Go-Geo! Title: Data Distribution Study Synopsis: This document reports on an investigation into data distribution systems and the requirements of data creators for a system within Go-Geo! Author: Julie Missen, UK Data Archive Date: 05 August July 2004 Version: 1.c.a Status: Final Authorised: Dr David Medyckyj-Scott, EDINA Contents 1.0 Background............................................................................................................................. 3 2.0 Introduction ............................................................................................................................. 4 3.0 Conducting the Data Distribution Study ................................................................................. 4 4.0 Requirements Study ............................................................................................................... 5 4.1 Results ....................................................................................................... 6 4.2 Summary .................................................................................................... 8 5.0 Literature Review .................................................................................................................... 8 5.1 Self Archiving .............................................................................................. 8 5.1.1 Copyright/IPR .......................................................................................... 9 5.1.2 Preservation ......................................................................................... 10 5.1.3 Peer review .......................................................................................... 10 5.1.4 Cost .................................................................................................... 11 5.1.5 Software .............................................................................................. 11 5.1.6 Case Studies ........................................................................................ 15 5.2 Peer to Peer Archiving (P2P) ...................................................................... 17 5.2.1 How Does P2P Work?............................................................................. 19 5.2.2 The Centralised Model of P2P File-Sharing.................................................. 19 5.2.3 The Decentralised Model of P2P File-Sharing .............................................. 20 5.2.4 Advantages and Limitations of using the Peer-to-Peer Network ........................ 20 5.2.5 Copyright/IPR ........................................................................................ 21 5.2.6 Preservation ......................................................................................... 21 5.2.7 Cost .................................................................................................... 21 5.2.8 Security ............................................................................................... 22 5.2.9 Peer-to-Peer Systems ............................................................................. 22 5.2.10 Case Studies ....................................................................................... 33 5.3 Distribution from a Traditional (Centralised) Archive ..................................... 38 5.3.1 Copyright/IPR ........................................................................................ 38 5.3.2 Preservation ......................................................................................... 38 5.3.3 Cost .................................................................................................... 38 5.3.4 Case Studies ........................................................................................ 38 5.4 Data Distribution Systems Summary ........................................................... 40 6.0 Conclusions .......................................................................................................................... 42 7.0 Recommendations for Data Distribution for the Go-Geo! Portal .......................................... 44 Bibliography ................................................................................................................................ 46 Glossary...................................................................................................................................... 48 Appendix A – Data Distribution Survey ...................................................................................... 50 2 1.0 Background The creation and submission of geo-spatial metadata records to the Go-Geo! portal is not the end of the data collection process. Datasets are at risk of becoming forgotten once a project ends and staff move on. Unless preserved for further use, data which have been collected at significant expense, expertise and effort, may later exist in only a small number of reports which analyse only a fraction of the research potential of the data. Within a very short space of time, the data files can become lost or obsolete as the technology of the collecting institution changes. The metadata records in the Go-Geo! portal could rapidly become useless as they will describe datasets to which users have no access. There is a need to ensure that data are preserved against technological obsolescence and physical damage and that means are provided to supply them in an appropriate form to users. Individuals, research centres and departments are generally not organised in such a way as to be able to administer distribution of data to people who approach them. While a researcher might be willing to burn data onto a CD as a one off request, they would be less willing to do this on a regular basis. One of the hopes of the Go-Geo! project is that researchers and research centres will provide metadata to Go-Geo! about smaller, more specialist datasets they have created. One of the benefits to individuals and centres of doing this is that demonstrating continued usage of data after the original research is completed can influence funders to provide further research money. However, feedback has indicated that provision of metadata from these sorts of data providers might be limited because of concerns about how they would distribute and store their data. For larger datasets hosted and made available through national data centres and other large service providers, the technology already exists through which the Go-Geo! portal could support online data mining, visualisation, exploitation and analysis of geospatial data. However, issues of licence to use data in this way and funding to establish the technical infrastructure need to be resolved before further developments can take place. More investigation is required into the best means by which individuals and smaller organisations such as research groups/centres can share data with others. The combination of compulsion and reward proposed to encourage metadata creation could also be applied to data archiving. The Research Assessment Exercise could reward institutions for depositing high-quality geo-spatial data sets with a suitable archival body, and funding councils and research councils could follow the example of the NERC, the ESRC and the AHRB who make it a condition of funding that all geo-spatial datasets should be archived. It seems clear that what is required is a cost effective and easy way for those holding geospatial data to share their data with others within UK tertiary education. A data sharing mechanism is seen as critical to the project. Without it, the amount of geo-spatial data available for re-use may be limited. The need for long-term preservation, along with cataloguing the existence of data, was identified during the phase one feasibility study and a recommendation was made to the JISC to consider establishing a repository for geospatial datasets which falls outside the collecting scope of the UK Data Archive and Arts and Humanities Data Service. This may be something one or more existing data centres could take on or it could become an activity within the operation of the Go-Geo! portal. Before this investigation was undertaken, three scenarios of potential data distribution systems for the Go-Geo! project were identified: self-archiving service. It was envisaged that one or more self-archiving services could be established where data producers/holders could publish data for use by others. The service would need to provide mechanisms for users to submit data, metadata and accompanying documentation (PDF, word files etc.). Metadata would also be published, possibly using OAI, and therefore harvested and stored in the Go-Geo! catalogue; peer to peer (P2P) application. Data holders/custodians would set up a P2P server on institutional machines and store data in them (probably at an institutional or department level). Metadata would be published announcing the existence of servers and geospatial 3 data. Metadata could also be published to the Go-Geo! catalogue. Users would use a P2P client to search for data or the Go-Geo! portal and then, having located a copy of the data, download it to their machine; depositing copies of data with archive organisations, such as the UK Data Archive. The archive would maintain controls over the data on behalf of the owner and ensure the longterm safekeeping of the data. The archive would take over the administrative tasks associated with external users and their queries. Potential users of the data would typically find data through an online catalogue provided by the archive. Popular or large datasets may be available online otherwise on-line ordering systems are provided to order copies of datasets. If researchers and research centres did deposit their data with an archive, it would be important that the metadata records displayed by Go-Geo! recorded this fact and how to contact the archive. 2.0 Introduction The aim of this study was to investigate a cost-effective way of distributing geo-spatial data held by individuals, research teams and departments. Three approaches were considered in this investigation, which were felt to reflect the resources available to the data creator/custodian for data distribution: P2P, self-archiving and traditional (centralised) archiving. The first part of this report concentrates on a survey which looked at what data creators and depositors requirements are for a data distribution mechanism. Key researchers and faculty within the geographic information community in UK academia were contacted to assess their requirements and their constraints for data distribution. The survey was also posted on mailing lists and on both the portal and project web sites. Technical options for both P2P and self-archiving were investigated through a literature review and by contacting experts in their respective fields. Existing software solutions were identified and evaluated, of particular importance was to determine how well existing software solutions could either meet, or could be modified to meet, the particular requirements for geo-spatial data distribution. The two approaches were compared against traditional (centralised) data archiving services. 3.0 Conducting the Data Distribution Study The data distribution study began later than anticipated as there was an over run from another work package and there was a further delay due to communication difficulties with City University. Towards the end of the project this relationship was terminated and their allocated work undertaken by staff at the UK Data Archive. At the start of the study, a meeting was held at the UK Data Archive to initiate ideas on the subject of data distribution and to draft a list of stakeholders. A list of stakeholders was drawn up by UKDA and EDINA, including key researchers from the geographic information (GI) community. A requirements survey was then developed and distributed to stakeholders to assess their needs for data distribution issues. The survey (see Appendix A) attempted to discover how organisations and individuals would like to see data distributed in the future. The survey looked at: access conditions; technical issues; copyright/IPR; funding; licences. The questionnaire was posted on the Go-Geo! web site, project web site, distributed at GISRUK and two workshops undertaken by the Go-Geo! metadata project. 4 An investigation was then undertaken to compare self-archiving and peer-to-peer with traditional (centralised) archiving. The use of OAI (Open Archives Initiative) was also considered within this study. The peer-2-peer study should have been undertaken by City University but as this relationship was later terminated, their allocated work was undertaken by the UK Data Archive, with some consultation with an expert of the field. As this change in workload occurred at a very late stage of the project, it left less time then anticipated to complete the study, therefore a slightly scaled down version of the literature review was decided upon. Areas of investigation included: IPR/copyright. Copyright is an intellectual property right (IPR), or output of human intellect. Copyright protects the labour, skill and judgement that someone has expended in the creation of original work. Usually copyright is retained by the author, which can sometimes defined as and including the individual, organisation or institution. If a piece of work is completed as part of employment, the employer will retain copyright in your work. If commissioned to create a piece of work on behalf of someone else, then the author will retain copyright in that work; preservation. The saving and storing of data (either in digital and/or paper format) for future use, either as a short-term repository of long term preservation of material in a format which will is not transferable or will not become obsolete. The cost and effort of longer term preservation may outweigh the benefits; cost. This should be considered in both in terms of finances, expertise and resources; software. This should include software for both storage and distribution; data format/standards. The data formats supported by software; service level definition. The definition of what a service will provide, for example, user support; access control/security. Access and security control to the data, metadata and repository/storage facility; user support/training. Support and training should be considered for data creators, depositors and users; Open Archives Initiative. OAI, based at Cornell University provides the Open Archive Metadata Harvesting protocol that runs through web servers and clients to connect data providers to data services. The data provider is somebody who archives information on their site, while the service provider runs the OAI protocol to access the metadata. The OAI provide web pages for data providers and service providers to register so that they can know of each other's existence, and thereby bring about interoperable access. The OAI protocol was originally designed for e-prints, although OAI acknowledge that it needs to be extended to cover other forms of digital information. The OAI protocol demands that archives, at a minimum, use the Dublin Core metadata format, although parallel sets of metadata in other formats are not prohibited. The main commercial and public domain alternative to the OAI protocol is the IEEE Z39.50 which is widely used by large archives. The contrast between the two is that Z39.50 supports greater functionality than OAI and therefore is more complex to implement in the HTTP server and the client. OAI ensures access to a freely available repository of research information, but does not address longterm archiving issues. A report was produced, setting out user requirements, a comparison of technical options and recommendations including costs. The report was then peer reviewed by subject experts and by a sample of those individuals who were sought to provide input during the requirement analysis stage. 4.0 Requirements Study A requirements study was undertaken to investigate how services and organisations currently store and distribute data and what they might consider they would use in the future. The investigation was survey-based and focused on access, licensing, funding and policies. The requirements survey was carried out electronically due to time constraints for completing this work package. Ideally, if time had allowed, face-to-face interviews or focus group 5 discussion sessions would have been conducted, as these could have produced a greater response rate. The survey was conducted via the web, workshops, mailing lists and by emailing UK GIS lecturers and other key academics. The questionnaire was sent to: gogeogeoxwalk mailing list (http://www.jiscmail.ac.uk./lists/gogeogeoxwalk.html); ESDS site reps and website (http://www.esds.ac.uk); IBG Quantitative Methods Research Group (http://www.ncl.ac.uk/geps/research/geography/sarg/qmrg.htm); Geographical Information Science Research Group (GIScRG) (http://www.giscience.info/); GIS-UK (http://www.jiscmail.ac.uk/lists/GIS-UK.html). The questionnaire was posted on both the project web pages and on the Go-Geo! portal. The first 100 respondents who fully completed the survey received a book token, and were entered into the evaluation prize draw to win an Amazon voucher. A copy of the survey questionnaire can be found in Appendix A. 4.1 Results Fewer responses were gained than was initially hoped (ten responses). This was thought to be due to the time of year (exam pressure, marking etc) as well as survey fatigue amongst respondents, particularly as many of the contacts had also been invited to evaluate the GoGeo! portal earlier in the year. Those responses we did receive however were from experts and were therefore regarded to be of high quality. Percentages have predominantly been used within the following section as some questions were not answered by respondents, or in some cases, multiple answers were given, therefore the number of responses was not always ten. From the questionnaires submitted, it is clear that the majority of respondents (63%) would like to see a centralised archive and distribution service put into place for the portal. A number of respondents (27%) would like to see a distributed service based on a number of self-archives located around the country. None of the respondents wanted to see a peer-to-peer network set up. The reasons behind these choices (see Table 1) included ease of use, cost, need to ensure long-term availability (preservation) and the provision of a user support system. Reasons for Choice of Data Distribution System ease of use cost need to ensure long-term availability (preservation) copyright issues data security user support depositor support other (please specify) Number of Responses 8 8 8 4 3 5 5 0 Table 1. Factors affecting choice of data distribution system There was a requirement by 66% of respondents for the allowance of direct contact with depositors, although respondents would want trivial questions to be dealt with through good documentation, help facilities and the service provider. Half of respondents felt that access to the system should be restricted to gatekeepers e.g. data librarians or service providers. Most of the respondents (85%) would like to see original and new datasets in the archive and for them to be made available online (66%). All respondents would like datasets stored in formats such as CSV, shape files and GML and 62% of respondents would like to see the provision of data with complete supporting material, such as guidance notes and copyright/IPR statements. The majority of respondents (60%) thought that usage of the data should be 6 tracked. Most of the respondents (80%) also thought that data should persist indefinitely, according to the perceived utility of the data by others. The respondents felt strongly that data quality and accuracy of data was very important, as was the implementation of standards for spatial data, data transfer and data documentation. As far as data deposition and using data is concerned, the respondents largest concerns were with data quality and liability. Data quality was considered to be a key issue, as if the data are of poor quality, then any analysis performed on the data will also be less robust. In terms of data deposition, the respondents were least concerned about confidentiality, IPR and the need for specialist support. In the context of using data, they were least concerned about confidentiality, data security and protecting the integrity of the data creator. These scores were slightly contradicted by some comments which showed that data creators do have concerns about IPR as if the rights of the depositor are not protected they are less likely to wish to deposit their data. In terms of submission of data, the most important factors were considered to be ease of use and speed of the deposition process. The table below shows the scores gained for each issue. The results were ranked from 1-7, with 1 being the most important. Therefore a lower score indicates that an issue is considered to be important to the respondents then a higher score. Issue of Concern IPR Confidentiality Liability Data quality/provenance Need for specialist support Data security Protecting the integrity of the data creator Depositing Data 43 53 33 23 47 33 33 Using Data 35 55 29 12 34 45 44 Table 2. The respondents concerns for data deposition and use Responsibility for any conversion or other work required to meet data transfer and preservation standards and for the active curation of the dataset, was thought to lie with data depositors, archivists and service providers. When considering who should be responsible for maintaining the dataset once a team disbands, there was a mixed response, with the answers including archivists, service providers, data creators, data librarians and a designated team member. The majority of the respondents (58%) held their own data, but also jointly held data with another organisation and held Crown data. Of those that held their own data 81% of respondents would like to make their data freely accessible to academics. Those who do not hold their own data expressed a preference (60%) to have copyright managed by a central specialist organisation. The majority (82%) of the respondents felt that data creators should not incur royalties, but should also not be charged by service providers for depositing data. Almost half of the respondents (44%) felt that users should be charged to cover service costs and 56% of the respondents felt that the service should be free. Most of the respondents felt that funders should cover researchers costs of preparing the data for sharing. There was a mixed response as to who or which organisation should be responsible for promoting, facilitating and funding data sharing (see Table 3). Although 80% of the respondents felt that funders should cover researchers costs of preparing the data for sharing. 7 Number of Responses Organisation Funders of data creation awards The creators themselves The institution within which the creator works National repositories (where they exist) The JISC Other (please state who) Funding a service which Promoting Facilitating facilitates data data sharing data sharing sharing 7 5 7 4 3 0 3 6 5 0 3 7 6 0 0 7 6 0 Table 3. Who should be responsible for promoting, facilitating and funding data sharing 4.2 Summary The results from the survey provide guidance to the direction in which the data creators would like to move in, with regards to a data distribution system. The majority of the respondents would like to see the provision of: a centralised archive; online access to data; copyright agreements and data integrity provided for; a service where user and depositor support is provided; preservation and archiving facilities provided by the archive; a system which is both easy and fast to use; supporting material made available; a service which is free to deposit and possibly free to use. From this list, it can be seen that the most obvious recommendation would be to use or set up a traditional type of archive. Second to this choice would be the use of a number of distributed archives. This could include well-established archives such as the UKDA and Archaeology Data Service (ADS), or could see the setting up of institutional archives. 5.0 Literature Review A literature review was conducted to investigate three methods of storing and distributing data: self-archiving, peer-to-peer, and traditional archiving. The study considered issues such as copyright, cost and software as well as the advantages and disadvantages of each type of data distribution system. Case studies have also been included with examples of both software and service usage. 5.1 Self Archiving Self-archiving allows for the free distribution of data across the web, a medium which provides wide and rapid dissemination of information. The purpose of self-archiving is to make full text documents visible, accessible, harvestable, searchable and useable by any potential user with access to the internet. To self-archive is to deposit a digital document in a publicly accessible website, for example an OAI-compliant eprint archive (an eprint archive is a collection of digital documents of peerreviewed research articles, before and after refereeing. Before refereeing and publication, the draft is called a "preprint." The refereed, published final draft is called a "postprint" or “eprint”). 8 Depositing involves a simple web interface where the depositor copies or pastes in the metadata (date, author-name, title, journal-name, etc.) and then attaches the full-text document. Software is also being developed to allow documents to be self-archived in bulk, rather than just one by one. Self-archiving systems can be either centralised or distributed. There is little difference to users between self-archiving documents in one central archive or many distributed archives, as users need not know where documents are located in order to find, browse and retrieve them and the full texts are all retrievable. Standards used for metadata may also not be made apparent to users. There are two types of self-archives: subject based; and institutional based. Several subjectbased archives are in use at present, which include ArXiv and CogPrints. There are also institutional based archives in existence, such as ePrints, based at Southampton University. Distributed, institution-based self-archiving benefits research institutions by: maximising the visibility and impact of their own refereed research output; maximising researchers access to the full refereed research output of all other institutions; reducing likelihood of library's annual serials expenditures budget to 10% (in the form of fees paid to journal publishers for the quality-control of their own research output instead of tools for accessing other researchers' output). An institutional library can help researchers to do self-archiving and can maintain the institution's own refereed eprint archives as an outgoing collection for external use, in place of the old incoming collection via journal costs, for internal use. Institutional library consortial power can also be used to provide leveraged support for journal publishers who commit themselves to a timetable of downsizing to becoming pure quality-control service providers (Harnad 2001). The advantages of subject-based archives are that they are specific to the needs and requirements of a discipline. Many of the repositories use OAI to facilitate interoperability between repository servers (Pinfield Sept 2003). Potential problems of self-archiving include quality control, copyright issues and potential lack of preservation and/or access control. The principal potential problem with self-archiving is actually getting the content for the repository, which requires a cultural change through persuading researchers of the benefits of self-archiving and of data sharing. To date, self-archiving has been about depositing a digital document, typically a full text document1, in a publicly accessible web site. Consideration of the use of self-archiving for depositing copies of datasets seems to have been limited. Further issues, which must be considered, will arise if using self-archiving for the distribution of datasets. Once such issue is that whilst electronic copies of printed material will have been peer reviewed, this is not the case with datasets and therefore there is a lack of quality control. Another area of concern is the format of datasets, which may be more complex the simple text documents, especially geospatial datasets which could be in formats such as SHP files and will require more support for depositors. IPR will also be of greater concern with datasets, particularly geospatial ones, as they may contain more then one source of material, e.g. OS data, government data and primary sources. 5.1.1 Copyright/IPR There are clear guidelines for copyright and IPR in relation to self-archiving and eprints. It seems that the author retains the copyright for pre-refereeing preprint, (therefore it can be selfarchived without seeking anyone else’s permission), but not for postprints. For the refereed postprint, the author can try to modify the copyright transfer agreement to allow self-archiving. 1 Frequently these documents are eprints, digital texts of peer-reviewed research articles, before and after refereeing. 9 In those cases where the publisher does not agree to modify the copyright transfer agreement so as to allow the self-archiving of the refereed final draft (postprint), a corrigenda file can instead be self-archived, alongside the already archived preprint, listing the changes that need to be made to make it into a postprint. Self-archiving of one's own non-plagiarised texts is in general legal in all cases except: where exclusive copyright in a "work for hire" has been assigned by the author to a publisher (i.e. the author has been paid (or will be paid royalties) in exchange for the text) the author may not self-archive it. The text is still the author's intellectual property, in the sense that authorship is retained by the author, and the text may not be plagiarised by anyone, but the exclusive right to sell or give away copies of it has been transferred to the publisher; where exclusive copyright has been assigned by the author to a journal publisher for a peer-reviewed draft, refereed and accepted for publication by that journal, then that draft may not be self-archived by the author (without the publisher's permission). Questions which should be asked when considering self-archiving as an option include: are there any rights in an individual metadata record? If so, who owns them? do data providers wish to assert any rights over either individual metadata records, or data collections? If so, what do they want to protect, and how might this be done? do data providers disclose any rights information relating to the documents themselves i.e the metadata? how do service providers ascertain the rights status of the metadata they’re harvesting? do service providers enhance harvested metadata records, creating new IPR? do service providers want to protect their enhanced records? If so, how? how do service providers make use of any rights information relating to the documents themselves? Project RoMEO based at Loughborough University is investigating copyright issues. More detailed information, including survey results, can be found from their web site (http://www.lboro.ac.uk/departments/ls/disresearch/romeo/. 5.1.2 Preservation There is a concern that archived eprints may not be accessible online in the future. This worry is not really about self-archiving, but about the online medium itself. However, this concern may be unnecessary as it seems that many of the self-archives use OAI and Dublin Core which ensures they are interoperable. For example, the repository arxiv.org, was set up in 1991 and all of its contents are still accessible today. In any case, if Harnard’s definition of eprints being duplicates of conventionally published material is used, then preservation could be see as needless in the short-term (Pinfield 2003). 5.1.3 Peer review Refereeing (peer review) is the system of evaluation and feedback by which expert researchers quality control each others research findings. The work of specialists is submitted to a qualified adjudicator, an editor, who in turn sends it to experts (referees) to seek their advice about whether the paper is potentially publishable, and if so, what further work is required to make it acceptable. The paper is not published until and unless the requisite revision can be and is done to the satisfaction of the editor and referees (Harnad 2000). Neither the editor nor the referees is infallible. Editors can err in the choice of specialists; or editors can misinterpret or misapply referees' advice. The referees themselves can fail to be sufficiently expert, informed, conscientious or fair (Harnad 2000). Nor are authors always conscientious in accepting the dictates of peer review. It is known among editors that virtually every paper is eventually published, somewhere: There is a quality hierarchy among journals, based on the rigour of their peer review, all the way down to an unrefereed vanity press at the bottom. Persistent authors can work their way down until their paper finds its own level, not without considerable wasting of time and resources along the way, including the editorial office budgets of the journals and the freely given time of the referees, who might find themselves called upon more than once to review the same paper, sometimes unchanged, for several different journals (Harnad 2000). 10 The system is not perfect, but no one has demonstrated any viable alternative to having experts judge the work of their peers, let alone one that is at least as effective in maintaining the quality of the literature as the present one is (Harnad 2000). Improving peer review first requires careful testing of alternative systems, and demonstrating empirically that these alternatives are at least as effective as classical peer review in maintaining the quality of the refereed literature. The self-archiving initiative is directed at freeing the current peer-reviewed literature, it is not directed at freeing the literature from peer review (Harnad 2001). 5.1.4 Cost Referees services are donated free to virtually all scientific journals, but there is a real cost to implementing the refereeing procedures, which include archiving submitted papers onto a website; selecting appropriate referees; tracking submissions through rounds of review and author revision; making editorial judgments, and so on (Harnad 2001). The minimum cost of implementing refereeing has been estimated as $500 per accepted article but even that figure almost certainly has inessential costs wrapped into it (for example, the creation of the publisher's PDF). The true figure for peer-review implementation alone across all refereed journals probably averages much closer to $200 per article or even lower. Hence, quality control costs account for only about 10% of the collective tolls actually being paid per article. The optimal solution would be for free online data for everyone. The 10% or so qualitycontrol cost could be paid in the form of quality-control service costs, per paper published, by authors' institutions, out of their savings on subscription costs (Harnad 2001). 5.1.5 Software There are a number of self-archives currently in use, the software for some of these are described below. ePrints (http://www.eprints.org) The freely available ePrints software has been designed so institutions or even individuals can create their own OAI-compliant ePrint archive (ePrints include both preprints and postprints, as well as any significant drafts in between, and any post publication updates). Setting up the archive only requires some space on a web server and is relatively easy to install. The ePrints and self-archiving initiative does not undertake the filtering function of existing libraries and archives, nor their indexing or preservation function. The document archiving facilities of the ePrints software, developed by Southampton University, can now be extended to provide storage for raw scientific data as well as the capability of interoperable processing. There is already the potential for widespread adoption of the ePrints software by universities and research institutions worldwide for research report archiving. The ePrint software is one implementation of an OAI protocol conforming server on a UNIX operating system. It draws on a set of freely publicly available tools (MySQL, Apache etc.). All OAI-compliant ePrint archives share the same metadata, making their contents interoperable with one another. This means their contents are harvestable by cross-archive search engines like ARC or cite-base. The limitation in OAI is in the availability of the metadata in only Dublin Core. The ePrint software uses a more complex nested metadata structure to represent archived items internally. This is accessible for search from a web page on the archive machine itself. However, the interoperable service provided to OAI data services only supports the Dublin Core subset of this. The main alterations expected to the ePrint system in the future could include: improvements to the documentation which is at present is minimal; improvements to the installation procedures which at present requires a skilled UNIX administrator to install the software. This should be done using one of the commercial installation packages, such as InstallAnywhere; changing the structure of the metadata representation used by mapping the existing document metadata structure to the hierarchical structure used for data metadata; 11 a new metadata format needs to be defined within the ePrints system to support this mapping; further interoperability for information that does not consist only of text objects or multimedia sound/images, but arbitrary raw data. The existing metadata structure supports ePrints that consist of a set of documents which in turn consist of sets of e-print files (e.g. an html document can consist of a set of files). To implement this new metadata structure it will be necessary to apply: new strings bundle files for the new metadata input and browse facilities on the server; a new mapping from the data metadata to Dublin Core for OAI access; new metadata fields for the representation. Dspace (http://dspace.org/) DSpace is an open-source, web-based system produced by Massachusetts Institute of Technology Libraries. DSpace is a digital asset management software platform that enables institutions to: capture and describe digital works using a submission workflow module; distribute an institution's digital works over the web through a search and retrieval system; store and preserve digital works over the long term. DSpace institutional repository software is available for download and runs on a variety of hardware platforms. It functions as a repository for digital research and educational material and can be both modified and extended to meet specific needs. DSpace is Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) compliant and uses Dublin Core, with optional additional fields, including abstract, keywords, technical metadata and rights metadata. The metadata is indexed for browsing and can be exported, along with digital data, using an XML schema. OAI support was implemented using OCLC’s OAICat open-source software The OAICat Open Source project is a Java Servlet web application providing an OAI-PMH v2.0 repository framework. This framework can be customised to work with arbitrary data repositories by implementing some Java interfaces. DSpace@MIT is registered as a data provider with the Open Archives Initiative. Other institutions running DSpace may choose to turn on OAI or not, and to register as a data provider or not. The DSpace system is freely available as open-source software and as such, users are allowed to modify DSpace to meet an organisation’s specific needs. Open-source tools are freely available with the DSpace application under an open-source license (not all the same license as the one for DSpace itself). The BSD distribution license (http://www.opensource.org/licenses/bsd-license.php) describes the specific terms of use. DSpace accepts a variety of digital formats, some examples2 are: documents, such as articles, preprints, working papers, technical reports, conference; papers and books; datasets; computer programs; visualisations, simulations, and other models; multimedia publications; administrative records; bibliographic datasets; image, audio and video files; learning objects; web pages. Currently, DSpace supports exporting digital content, along with its metadata, in a simple XMLencoded file format. The DSpace developers are working on migrating this export capability to 2 Information taken from http://dspace.org/faqs/index.html#standards 12 use the Metadata Encoding and Transmission standard (METS), but are waiting for some necessary extension schemas to emerge (i.e. one for qualified Dublin Core metadata, and one for minimal technical/preservation metadata for arbitrary digital objects). DSpace has documented Java Application Programming Interfaces (APIs) which can be customised to allow interoperation with other systems an institution might be running. DSpace identifies two levels of digital preservation: bit preservation, and functional preservation: bit preservation ensures that a file remains exactly the same over time (not a single bit is changed) while the physical media evolve around it; functional preservation allows the file to change over time so that the material continues to be immediately usable in the same way it was originally while the digital formats (and physical media) evolve over time. Some file formats can be functionally preserved using straightforward format migration, such as TIFF images or XML documents. Other formats are proprietary, or for other reasons are much harder to preserve functionally. There are three levels of preservation defined for a given format: supported, known, or unsupported: supported formats will be functionally preserved using either format migration or emulation techniques. For example TIFF, SGML, XML, AIFF, and PDF; known formats are those which can not be guaranteed to preserve, such as proprietary or binary formats, but which are so popular that third party migration tools will likely emerge to help with format migration. Examples include Microsoft Word and Powerpoint, Lotus 1-2-3, and WordPerfect; unsupported formats are those which are not known about enough to do any sort of functional preservation. This would include some proprietary formats or a one-of-a-kind software program. For all three levels, DSpace does bit-level preservation to provide raw material to work with if the material proves to be worth that effort. DSpace developers are working in conjunction with partner institutions (particularly Cambridge University) to develop new upload procedures for converting unsupported or known formats to supported ones where advisable, and to enhance DSpace’s ability to capture preservation metadata and to perform periodic format migrations. Kepler (http://kepler.cs.odu.edu:8080/kepler/index.html) The original Kepler concept is based on the Open Archives Initiative. Kepler gives users the ability to self-archive publications by means of an "archivelet": a self-contained, self-installing software system that functions as an Open Archives Initiative data provider. An archivelet has the tools to let the user publish a report as it is; to post to a web site, yet have a fully OAIcompliant digital library that can be harvested by a service provider. Kepler archivelets are designed to be easy to install, use and maintain. Kepler can be tailored publication and search services with broad and fast dissemination, it is also interoperable with other communities. The publication tools to create an archivelet are downloadable, platform-independent, software package that can be installed on individual workstations and PC’s. This is different to, for example, the eprints.org OAI-compliant software package which is intended for institutionallevel service. The archivelet needs to have an extremely easy to use user interface for publishing and needs to be an OAI-compliant data provider. The archivelet is expected to store relatively few objects to retain independence. Instead, a native file system will be used rather than, for example, a database system. In supporting archivelets, the registration service takes on a bigger role than the registration server plays in regular OAI. The number of archivelets is expected to be on the order of tens of thousands, and their state, in terms of availability, will show great variation. Currently, the OAI registration service keeps track of OAI-compliant archives and the current registration process is mostly manual. In contrast to data providers at an organisational level, archivelets will switch more frequently between active and non-active states. It will be 13 necessary for the registration service to keep track of the state of the registered archivelets in support of higher-level services. For this, the concept is borrowed from Napster and the instantmessenger model where the central server keeps track of active clients. The OAI presents a technical and organisational metadata-harvesting framework designed to facilitate the discovery of content stored in distributed archives. The framework consists of two parts: a set of simple metadata elements (for which OAI uses Dublin Core), and a common protocol to enable extraction of document metadata and archive-specific metadata from participating archives. The OAI also defines two distinct participants: data provider and service provider. The current OAI framework is targeted for large data providers (at the organisation level). The Kepler framework based on the OAI to support archivelets is meant for many "little" publishers. The Kepler framework promotes fast dissemination of technical articles by individual publishers. Moreover, it is based on interoperability standards that make it flexible so as to build higherlevel services for communities sharing specific interests. Figure 1 shows the four components of the Kepler framework: OAI compliant repository, publishing tool, registration service, and service provider. The OAI compliant repository along with the publishing tool, also referred to as the archivelet, is targeted for individual publishers. Figure 1. Kepler framework. The registration service keeps track of registered archivelets including their state of availability. The service provider offers high-level services, such as a discovery service, that allows users to search for a published document among all registered archivelets. The Kepler framework supports two types of users: individual publishers using the archivelet publishing tool, and general users interested in retrieving published documents. The individual publishers interact with the publishing tool and the general users interact with a service provider and an OAI-compliant repository using a browser. In a way, the Kepler framework looks very similar to a broker based peer-to-peer (P2P) network model (Figure 2). Typically, a user is both a data provider and a customer who accesses a service provider, thus the primary mode of operation might be construed as one of exchanging documents. 14 Figure 2. Kepler Framework and Peer-to-Peer Network Model. The archivelet combines the OAI-compliant repository and the publication tool in a downloadable and self-installable component. Only OAI requests are supported, not any other http actions. The basic service part of Kepler is the discovery service Arc. Figure 3. Kepler architecture. 5.1.6 Case Studies The following case studies illustrate how the systems discussed above have been used. ArXiv (http://uk.arxiv.org/) ArXiv is an e-print service in the fields of physics, mathematics, non-linear science, computer science, and quantitative biology. The contents of arXiv conform to Cornell University academic standards. ArXiv is owned, operated and funded by Cornell University, a private not-for-profit educational institution. ArXiv is also partially funded by the National Science Foundation. Started in August 1991, arXiv.org (formerly xxx.lanl.gov) is a fully automated electronic archive and distribution server for research papers. Areas covered include physics and related disciplines, mathematics, nonlinear sciences, computational linguistics, and neuroscience. Users can retrieve papers from the archive either through an online world wide web interface, or by sending commands to the system via email. Similarly, authors can submit their papers to the archive either using the online world wide web interface, using ftp, or using email. Authors 15 can update their submissions if they choose, though previous versions remain available. Users can also register to automatically receive an email listing of newly submitted papers in areas of interest to them, when papers are received in those areas. In addition, the archive provides for distribution list maintenance and archiving of TeX macro packages and related tools. Mechanisms for searching through the collection of papers are also provided. CogPrints (http://cogprints.ecs.soton.ac.uk/) CogPrints is an electronic archive for papers in any area of Psychology, Neuroscience, Linguistics, Computer Science, Biology, Medicine and Anthropology. CogPrints is running on eprints.org open archive software. Dspace@Cambridge (http://www.lib.cam.ac.uk/dspace/index.htm) Cambridge University are undertaking a project to create Dspace@Cambridge. The project is a collaboration between Cambridge University Library, Cambridge University Computing Service, and the MIT Libraries, and is funded by a grant from the Cambridge-MIT Institute. Dspace will be developed further as a means for digital preservation and will include the ability to support learning management systems. The project aims to: provide a home for digitised material from the University library's printed and manuscript collections; capture, index, store, disseminate, and preserve digital materials created in any part of the University; contribute to the development of the open source DSpace system, working with other members of the DSpace Federation of academic research institutions; act as an exemplar site for UK higher and further education institutions. Theses Alive! (http://www.thesesalive.ac.uk/index.shtml) Theses Alive! is based at the University of Edinburgh and is funded under the JISC Focus on Access to Institutional Resources (FAIR) Programme. The Theses Alive! project is seeking to promote the adoption of a management system for electronic theses and dissertations (ETDs) in the UK, primarily by creating an online submission system for electronic theses and dissertations (ETD's) that mirrors the current submission process and an online repository of digital PhD theses. Including a thesis in the Edinburgh University repository, allows access to research findings for a global audience, allowing for wide exposure and recognition. The University will provide metadata from theses held in the Edinburgh University repository to known service providers. By using interoperability standards (e.g. OAI-PMH) service providers can allow researchers from institutes anywhere in the world to easily search and find relevant material. An additional benefit is that once in a repository, the work is protected from physical damage and loss. Theses Alive! uses DSpace for the archive as all items have a kind of wrapper in which the parts of the relevant data are stored. This includes all the individual files and the copyright licence. The metadata is maintained in Dublin Core format in the database for as long as the item remains in the repository. Security settings for the repository are handled via the authorisation policy tool and the security of the archive depends upon the way that the DSpace administrator configures the policies for each community, collection, and item. The DSpace archive is perhaps more geared toward digital preservation, although this issue is still very much in debate. It may be that digital preservation is an issue which is never 'solved' but which requires constant attention by those wishing to preserve and may not necessarily have anything to do with the software package in question. For this reason it is hard for us to be sure which package is going down the correct route, and even if that route exists. 16 It was envisaged that the university library would have a role as the key university agent in the thesis publishing process. In this role, it would provide supportive documentation to postgraduate students, via their departments, at the commencement of their dissertations and theses, drawing on the support of the national pilot service. It would ensure that theses authors are given training in the use of thesis submission software several months before they were due to submit. It would receive submissions once they had been signed off by the relevant registrar whether this is at departmental, faculty or central university level. The signing off is of course the last stage in the academic validation process, and follows on from the successful defence of the thesis by the student and the award of the postgraduate degree. The library therefore takes the role of trusted intermediary in what is essentially a triangular relationship, thus: 1. during the course of their research, the thesis author sends the library the basic metadata for their thesis; 2. the thesis author then submits the full thesis to the university; 3. the university submits the successfully defended thesis to the library; 4. the library matches thesis to metadata, and ensures that metadata are complete and that the university validation has taken place; 5. the library finally releases metadata and, if appropriate, the full text of the thesis, including supplementary digital material (to publishers, lenders or sellers). Once the library is satisfied with the metadata, they are released by the system, and the same set of metadata is used by the various agencies providing publication, loan or sale services. It is not likely that the identical set of metadata will be required by each of these agencies, so the system operated by the library should accommodate a superset, from which appropriate subsets can be generated for the requirements of agencies. It is recommended that the metadata employ an appropriate Dublin Core-based metadata set, using a Document Type Definition suitable for theses and dissertations, and marked up in XML to allow ease of repurposing. SHERPA SHERPA (Securing a Hybrid Environment for Research Access and Preservation) is a FAIR project. The main purpose of the SHERPA Project, led by the University of Nottingham, is the creation, population and management of several e-print repositories based at several partner institutions. These projects can step in to help the process by providing a more stable platform for effective collation and dissemination of research. The SHERPA project aims to: set up thirteen institutional open access e-print repositories which comply with the Open Archives Initiative Protocol for Metadata Harvesting (OAI PMH) using eprints.org software; investigate key issues in creating, populating and maintaining e-print collections, including: Intellectual Property Rights (IPR), quality control, collection development policies, business models, scholarly communication cultures, and institutional strategies; work with OAI Service Providers to achieve acceptable (technical, metadata and collection management) standards for the effective dissemination of the content; investigate digital preservation of e-prints using the Open Archival Information System (OAIS) Reference Model; disseminate lessons learned and provide advice to others wishing to set up similar services. SHERPA will work with ePrints UK which will provide search interfaces and will allow for searching of metadata harvested from hubs using web services. OCLC software is used as well as the University of Southampton OpenURL citations which results in citation analysis (Pinfield March 2003). 5.2 Peer to Peer Archiving (P2P) Peer-to-peer (P2P) networking is a technological communications method where all parties are equal and any node can operate as either a server or a client (Krishnan 2001). Each party has the same capabilities and either party can initiate a communication session. This differs from client/server architectures, in which some computers are dedicated to serving the others. In 17 some cases, peer-to-peer communication is implemented by giving each communication node both server and client capabilities (known as servents). Peer-2-peer is both flexible and scalable and could become an invaluable tool to aid collaboration and data management within a common organisation structure. On the web, P2P refers specifically to a network established by a group of users sharing a networking program, to connect with each other and exchange and access files directly or through a mediating server. Currently, the most common distributed computing model is the client/server model. In the client/server architecture (Figure 4), clients request services and servers provide those services. A variety of servers exist in today's internet, for example, web servers, mail servers, FTP servers, and so on. The client/server architecture is an example of a centralised architecture, where the whole network depends on central points, namely servers, to provide services. Without the servers, the network would make no sense and the web browsers would not work. Regardless of the number of browsers or clients, the network can exist only if a server exists (Krishnan 2001). Figure 4.The typical client/server architecture. Like the client/server architecture, P2P is also a distributed computing model, but there is an important difference. The P2P architecture is decentralised (see Figure 5), where neither client nor server status exists in a network. Every entity in the network, referred to as a peer, has equal status, meaning that an entity can either request a service, a client trait, or provide a service, a server trait (Krishnan 2001). 18 Figure 5. The peer-to-peer model A P2P network also differs from the client/server model in that the P2P network can be considered alive even if only one peer is active. The P2P network is unavailable only when no peers are active (Krishnan 2001). 5.2.1 How Does P2P Work? The user must first download and execute a peer-to-peer networking program. After launching the program, the user enters the IP address of another computer belonging to the network (typically, the web page where the user got the download will list several IP addresses as places to begin). Once the computer finds another network member online, it will connect to that user's connection and so on. Users can choose how many member connections to seek at one time and determine which files they wish to share or password protect. The extent of this peer-to-peer sharing is limited to the circle of computer users an individual knows and has agreed to share files with. Users who want to communicate with new or unknown users can transfer files using IRC (Internet Relay Chat) or other similar bulletin boards dedicated to specific subjects. Currently, there are a number of advanced P2P file sharing applications, the reach and scope of peer networks has increased dramatically. The two main models that have evolved are the centralised model and the decentralised model, used by Gnutella. 5.2.2 The Centralised Model of P2P File-Sharing One model of P2P file sharing is based around the use of a central server system (the serverclient structure), which directs traffic between individual registered users. The central servers maintain directories of the shared files stored on the respective PCs of registered users of the network. These directories are updated every time a user logs on or off the server network. Each time a user of a centralised P2P file sharing system submits a request or search for a particular file, the central server creates a list of files matching the search request, by crosschecking the request with the server's database of files belonging to users who are currently connected to the network. The central server then displays that list to the requesting user, who can then select the desired file from the list and open a direct HTTP link with the individual computer which currently posses that file. The download of the actual file takes place directly from one network user to the other. The actual file is never stored on the central server or on any intermediate point on the network. 19 Advantages of the Server-Client Structure One of the server-client model's main advantages is its central index which locates files quickly and efficiently. As the central directory constantly updates the index, files that users find through their searches are immediately available for download. Another advantage lies in the fact that all individual users, or clients, must be registered to be on the server's network. As a result, search requests reach all logged-on users, which ensures that all searches are as comprehensive as possible. Problems with the Server-Client Model While a centralised architecture allows the most efficient, comprehensive search possible, the system also has only a single point of entry. As a result, the network could completely collapse if one or several of the servers were to be incapacitated. Furthermore, the server-client model may provide out-of-date information or broken links, as the central server's database is refreshed only periodically. 5.2.3 The Decentralised Model of P2P File-Sharing Unlike a centralised server network, the P2P network does not use a central server to keep track of all user files. To share files using this model, a user starts with a networked computer, equipped with P2P and will connect to another P2P networked computer. The computer will then announce that it is alive and the message is passed on from one computer to the next. Once the computer has announced that it is alive to the various members of the peer network, it can then search the contents of the shared directories of the peer network members. The search will send the request to all members of the network. If one of the computers in the peer network has a file which that matches the request, it transmits the file information (name, size, etc.) back through all the computers in the pathway where a list of files matching the search request will then appear on the computer. The file can then be downloaded directly. Advantages of the Decentralised Model The P2P network has a number of distinct advantages over other methods of file sharing. The network is more robust than a centralised model because it eliminates reliance on centralised servers that are potential critical points of failure. The P2P network is designed to search for any type of digital file (from recipes to pictures to java libraries). The network also has the potential to reach every computer on the internet, while even the most comprehensive search engines can only cover 20% of websites available. Messages are also transmitted over P2P network in a decentralised manner: one user sends a search request to his "friends," who in turn pass that request along to their "friends," and so on. If one user, or even several users, in the network stop working, search requests would still get passed along. Problems with the Decentralised Model Although the reach of this network is potentially infinite, in reality it is limited by "time-to-live" (TTL) constraints; that is, the number of layers of computers that the request will reach. Most network messages which have TTL's that are excessively high will be rejected. 5.2.4 Advantages and Limitations of using the Peer-to-Peer Network Peer-to-peer networks are generally simpler then server and client networks, but they usually do not offer the same performance under heavy loads. There is also concern about illegal sharing of copyrighted content, for example music files, by some P2P users. Though peers all have equal status in the network, they do not all necessarily have equal physical capabilities. A P2P network might consist of peers with varying capabilities, from mobile devices to mainframes. A mobile peer might not be able to act as a server due to its intrinsic limitations, even though the network does not restrict it in any way (Krishnan 2001). A P2P network delivers a quite different scenario to a client/server network. Since every entity (or peer) in the network is an active participant, each peer contributes certain resources to the network, such as storage space and CPU cycles. As more and more peers join the network, the 20 network's capability increases, hence, as the network grows, it strengthens. This kind of scalability is not found in client/server architectures (Krishnan 2001). The advantages a P2P network offers however, do create problems. Firstly, managing such a network can be a difficult compared to managing a client/server network, where administration is only needed at the central points. The enforcement of security policies, backup policies, and so on, has proven to be complicated in a P2P network. Secondly, P2P protocols are much more "talkative", as peers join and exit the network at will, than typical client/server protocols. This transient nature can trigger performance concerns (Krishnan 2001). Both of the networking models discussed above feature advantages and disadvantages. One can visualise from Figure 4 that as a client/server network grows (that is, as more and more clients are added), the pressure on the central point, the server, increases. As each client is added, the central point weakens; its failure can destroy the whole network (Krishnan 2001). Although file sharing is widespread, it presents a number of challenges to organisations through degraded network availability, reduced bandwidth, lost productivity and the threat posed by having copyright information on the organisation’s network. For example, P2P downloading can easily consume 30 percent of network bandwidth. P2P can also open the way for spyware and viruses to enter the system. P2P provides certain interesting capabilities not possible in traditional client/server networks, which have predefined client or server roles for their nodes (Krishnan 2001). Corporations are looking at the advantages of using P2P as a way for employees to share files without the expense involved in maintaining a centralised server and as a way for businesses to exchange information with each other directly. However, there can be problems with scalability, security and performance. It is also impossible to keep track on who is using the files and for what purpose. The peer-to-peer network may also not be suitable for files which are updated frequently, as it makes already downloaded copies obsolete. 5.2.5 Copyright/IPR A P2P network can be closed, so that files are only open to those that are connected to the network. Individuals can set passwords and only allow access to whom they wish. The network can also be open to all, which has resulted in many reported cases of copyright infringement in association with using P2P systems. The most infamous of these is the Napster case, where the service was eventually shut down. There can be serious copyright issues when using P2P. In the case of organisations such as universities, the data holder will probably rely on their institutions copyright agreements. In the U.S, there is now a bill against copyright infringement to limit the liability of copyright owners for protecting their works on peer-to-peer networks (Sinrod 2002). A more recent bill enables the imprisonment of people who illegally trade large amounts of copyrighted music online. A House Judiciary subcommittee unanimously approved the "Piracy Deterrence and Education Act of 2004," which will be the first law to punish internet music pirates with prison if it were signed into law. The bill targets people who trade more than 1,000 songs on peer-to-peer networks like Kazaa and Morpheus, as well as people who make and sell bootlegged copies of films still in cinematic release (McGuire 2004). 5.2.6 Preservation Responsibility for preservation must lie with the data holders as data is stored on individual computers, or peers. Preservation may be possible through the individuals’ institutional archive or library, if available for deposition, or responsibility may lie with the data creators themselves to ensure that the data is available in an interoperable format. Problems occur if the data creator moves institution as they may retain the IPR but not copyright of the data. The data may then become obsolete as responsibility of preserving the data is not handed on to someone else. Coupled with the potential existence of a number of versions and out-of-date data across the network, means that preservation is a large concern with the use of P2P networking. 5.2.7 Cost Broadband ISPs all around the world, especially in Europe, are complaining that P2P traffic is costing them too much and claim that almost 60% of all bandwidth is used for file-swapping. 21 According to British CacheLogic, the global cost of P2P networks for ISPs will top £828M (€1148M, $1356M) in 2003 and will triple in 2004. Various ISPs have considered taking measures for restricting users' download habits. One UK example is the cable company ntl, which imposed a 1GB/day limit for its cable modem connections and discovered that thousands of users left the service immediately, taking also their digital cable TV accounts to competitors as well. Now some tech companies are trying to invent ways to prioritise the traffic - if the filetrading is done within the ISP's network, the cost for the ISP is minimal compared to intercontinental network connection costs (AfterDawn 2003). Peer-to-peer also raises maintenance costs and the need for high spec computers. J.P. Morgan Chase found, when evaluating grid computing, that there was an up-front development cost of about $2,000 per desktop, along with an annual maintenance cost of about $600 per desktop (Breidenbach 2001). The costs will also involve the time and expertise needed to set up the network, as well as any legal fees involved with copyright infringement should the need arise. 5.2.8 Security Security has been a major barrier to peer-to-peer adoption. The different platforms being used within a company and across an extranet all have different security systems, and it is hard to get them to interoperate. People end up using the lowest-common-denominator features. The peer-to-peer community is trying to adapt existing security standards such as Kerberos and X.509 certificates (Breidenbach 2001). 5.2.9 Peer-to-Peer Systems In 1999 Napster went into service. Napster was a breakthrough in creating P2P networking. With Napster, files were not stored on a central server, instead the client software for each user also acts as a server for files shared from that computer. Connection to the central Napster server is made only to find where the files are located at any particular point in time. The advantage of this is that it distributes vast numbers of files over a vast number of "mini servers". After a series of legal battles for copyright infringement, by mid 2001, Napster was all but shut down, however, several other P2P services rose up to fill the niche (Somers). Examples of other peer-to-peer systems are described in the sections below. Edutella (http://edutella.jxta.org/) Edutella is a P2P system which aims to connect heterogeneous educational peers with different types of repositories, query languages and different kinds of metadata schemata (Hatala 2003). The overall goal of Edutella is to facilitate the reuse of globally distributed learning resources by creating an open-source, peer-to-peer system for the exchange of RDF-based metadata (Wilson 2001). The project also aims to provide the metadata services needed to enable interoperability between heterogeneous JXTA applications. Initial aims are to create: a replication service – to provide data persistence/availability and workload balancing while maintaining data integrity and consistency; a mapping service – to translate between different metadata vocabularies to enable interoperability between different peers; an annotation service – to annotate materials stored anywhere in the Edutella network. The providers of metadata for the system will be anyone with content they want to make available. This includes anything from individual teachers and students to universities and other educational institutions (Wilson 2001). Edutella is a metadata based peer-to-peer system, able to integrate heterogeneous peers (using different repositories, query languages and functionalities), as well as different kinds of metadata schemas. Finding a common ground in essential, in the assumption that all resources maintained in the Edutella network can be described in RDF, and all functionality in the Edutella network is mediated through RDF statements and queries on them. For the local user, the 22 Edutella network transparently provides access to distributed information resources, and different clients/peers can be used to access these resources. Each peer will be required to offer a number of basic services and may offer additional advanced services (Nejdl 2002). Edutella Architecture Edutella is based on JXTA, an Open Source project supported and managed by Sun Microsystems. JXTA is a set of XML based protocols to cover typical P2P functionality and provides a Java binding offering a layered approach for creating P2P applications (core, services, applications, see Figure 6). In addition to remote service access (such as offered by SOAP), JXTA provides additional P2P protocols and services, including peer discovery, peer groups and peer monitors. Therefore JXTA is a very useful framework for prototyping and developing P2P applications (Nejdl 2002). Figure 6: JXTA Layers Edutella services complement the JXTA service layer, building upon the JXTA core layer, with Edutella peers on the application layer. The service uses the functionality provided by these Edutella services as well as possibly other JXTA services (Nejdl 2002). On the Edutella service layer, data exchange formats and protocols are defined (how to exchange queries, query results and other metadata between Edutella peers), as well as APIs for advanced functionality in a library-like manner. Applications like repositories, annotation tools or user interfaces connected to and accessing the Edutella network, are implemented on the application layer (Nejdl 2002). The Edutella query service is intended to be a standardised query exchange mechanism for RDF metadata stored in distributed RDF repositories. It is meant to serve as both the query interface for individual RDF repositories located at single Edutella peers, as well as the query interface for distributed queries spanning multiple RDF repositories. An RDF repository consists of RDF statements (or facts) and describes metadata according to arbitrary RDFS schemas (Nejdl 2002). One of the main purposes is to abstract from various possible RDF storage layer query languages (e.g., SQL) and from different user level query languages (e.g., RQL, TRIPLE). The Edutella query exchange language and the Edutella common data model provide the syntax and semantics for an overall standard query interface across heterogeneous peer repositories for any kind of RDF metadata. The Edutella network uses the query exchange language family RDF-QEL-i (based on Datalog semantics and subsets) as standardised query exchange language format which is transmitted in an RDF/XML-format (Nejdl 2002). Edutella peers are highly heterogeneous in terms of the functionality they offer. A simple peer has RDF storage capability only where the peer has some kind of local storage for RDF triples, e.g. a relational database, as well as some kind of local query language, e.g. SQL. In addition, the peer might offer more complex services such as annotation, mediation or mapping (Nejdl 2002). This results in an exchange from the local format to the peer and vice versa, and to connection of the peer to the Edutella network by a JXTA-based P2P library. To handle queries, 23 the wrapper uses the common Edutella query exchange format and data model for query and result representation. For communication with the Edutella network, the wrapper translates the local data model into the Edutella common data model ECDM and vice versa, and connects to the Edutella network using the JXTA P2P primitives, transmitting the queries based on the common data model ECDM in RDF/XML form (Nejdl 2002). In order to handle different query capabilities, several RDF-QEL-i exchange language levels are defined, describing which kind of queries a peer can handle (conjunctive queries, relational algebra, transitive closure, etc.) The same internal data model is used for all levels. To enable the peer to participate in the Edutella network, Edutella wrappers are used to translate queries and results from the Edutella query (Nejdl 2002). Freenet Freenet is a completely distributed decentralised peer-to-peer system. Communication is handled entirely by peers operating at a global level (Watson 2003). Freenet is a free software which lets you publish and obtain information on the internet without fear of censorship. To achieve this freedom, the publishers and consumers of information are anonymous. The system operates as a location-independent distributed file system across many individual computers that allows files to be inserted, stored, and requested anonymously. A node is simply a computer that is running the Freenet software, and all nodes are treated as equals by the network (Watson 2003). Users contribute to the network by giving bandwidth and a portion of their hard drive (called the data store) for storing files. Each node maintains its own local data store which it makes available to the network for reading and writing, as well as dynamic routing table containing addresses of other nodes and the keys that they are thought to hold (Watson 2003). Unlike other peer-to-peer file sharing networks, Freenet does not let the user control what is stored in the data store. Instead, files are kept or deleted depending on how popular they are, with the least popular being discarded to make way for newer or more popular content. Communications by Freenet nodes are encrypted and are routed-through other nodes to make it extremely difficult to determine who is requesting the information and what its content is. This removes any single point of failure or control. By following the Freenet protocol, many such nodes spontaneously organise themselves into an efficient network (Watson 2003). Files in the data store are encrypted to reduce the likelihood of prosecution by persons wishing to censor Freenet content. The network can be used in a number of different ways and is not restricted to just sharing files like other peer-to-peer networks. It acts more like an internet within an internet. For example, Freenet can be used for: publishing websites or 'free sites'; communicating via message boards; content distribution. The system is designed to respond adaptively to usage patterns, transparently moving, replicating, and deleting files as necessary to provide efficient service without resorting to broadcast searches or centralised location indexes (Watson 2003). Freenet enables users to share unused disk space (Watson 2003). It is intended that most users of the system will run nodes to: provide security guarantees against inadvertently using a hostile node; provide resistance to attempts by third parties to deny access to information and prevent censorship of documents; efficient dynamic storage and routing of information; increase the storage capacity available to the network as a whole; provide anonymity for both producers and consumers of information; decentralisation of all network functions-remove any single point of failure or control. 24 The system operates at the application layer and assumes the existence of a secure transport layer, although it is transport-independent. It does not seek to provide anonymity for general network usage, only for Freenet file transactions (Watson 2003). Unlike many cutting edge projects, Freenet is well established and has been downloaded by over two million users since the project started. It is used for the distribution of censored information all over the world including countries such as China and the Middle East. Freenet Architecture Freenet is implemented as an adaptive peer-to-peer network of nodes that query one another to store and retrieve data files, which are named by location-independent keys: Keyword-Signed Key (KSK) is based on a short descriptive string chosen by the user when inserting a file. This string is basically hashed to yield the KSK. To allow others to retrieve a document, publishers only have to publish this string; Signed-Subspace Key (SSK) is used to identify a personal subspace. A SSK is to allow a user to built a reputation by publishing documents while remaining anonymous but still identifiable; Content-Hash Keys (CHK) allows a node to check that the document it just received is a genuine copy and hasn’t been tampered with. They also allow an author to update a document, if this author uses a private subspace (with SSKs). The basic model is: keys are passed along from node to node through a chain of requests in which each node makes a local decision about where to send the request next, in the style of internet protocol routing; depending on the key requested the routes would vary. The routing algorithms adaptively adjust routes over time to provide efficient performance while using only local, rather than global knowledge. As each node only has knowledge of their immediate upstream and downstream neighbours, to maintain privacy; each request is given a hops-to-live limit, which is decremented at each node to prevent infinite chains; each request is also assigned a pseudo-unique random identifier, so that nodes can prevent loops by rejecting requests they have seen before; this process continues until the request is either satisfied or has exceeded its hops-to-live limit. Then the success or failure is passed back up the chain to the sending node. (Watson 2003). Requests In order to make use of Freenet’s distributed resources, a user must initiate a request. Requests are messages that can be forwarded through many different nodes. Initially the user forwards the request to a node that he or she knows about and trusts. If a node does not have the document that the requester is looking for, it forwards the request to another node that, according to its information, is more likely to have the document (Watson 2003). The reply is passed back through each node that forwarded the request, back to the original node that started the chain. Each node in the chain may cache the reply locally, so that it can reply immediately to any further requests for that particular document. This means that commonly requested documents are cached on more nodes, and thus there is no overloading effect on one node (Watson 2003). Performance Analysis User benchmarks include: how long will it take to retrieve a file and how much bandwidth will a query consume. These both have direct impact on the usability and success of the system. Network connections such as ADSL and cable modems, favour client over server usage. This has resulted in the network infrastructure being optimised for computers that are only clients, not servers. The problem is that peer-to-peer applications are changing the assumption that end users only want to download from the internet, never upload to it. Peer-to-peer technology generally makes every host act as both as a client and a server; the asymmetric assumption is 25 incorrect. The network architecture is going to have to change to handle this new traffic pattern (Watson 2003). Problems which affect the performance of the decentralised peer-to-peer network of Freenet are: in network communication, connection speed dominates processor and I/O speed as the bottleneck. This problem is emphasised by the highly parallel nature of Freenet; as there is no central master index maintained, messages must be passed over many hops, in order to search through the system to find the data. Each hop not only adds to the total bandwidth load but also increases the time needed to perform a query. If a peer is unreachable it can take several minutes to time out the connection; peer-to-peer communities depend on the presence of a sufficient base of communal participation and cooperation in order to function successfully. (Watson 2003). Small World Effect The small world effect is fundamental to Freenet’s operation. It is important because it defines the file location problem in a decentralised, self-configuring P2P network like Freenet. In Freenet, queries are forwarded from one peer to the next according to local decisions about which potential recipient might make the most progress towards the target. Freenet messages are not targeted to a specific named peer but towards any peer having a desired file in its data store (Watson 2003). Two characteristics that distinguish small-world networks: a small average path length, typical of random graphs; a large clustering coefficient that is independent of network size. The clustering coefficient captures how many of a node’s neighbours are connected to each other. Despite the large network, short routes must exist. In a simulation of a Freenet network, with 1,000 identical nodes, which were initially empty, half of all requests in the mature network succeed within six hops. A quarter of requests succeed within just three hops or fewer. This compares with the internet, as a small world network, with a characteristic path length of 19 hops (Watson 2003). Freenet has good average performance but poor worst case performance, because a few bad routing choices can throw a request completely off the track. Aspects of robustness affect Freenet, as all peer-to-peer systems coping with the unreliability of peers. Since peers tend to be PC’s rather than dedicated servers, they are often turned off or disconnected from the network at random (Watson 2003). Trust and Accountability By signing software they make available for download, authors can provide some assurance that their code has not been tampered with and facilitate the building of a reputation associated with their name key (Watson 2003). The lesson for peer-to-peer designers is that without accountability in a network, it is difficult to enforce rules of social responsibility. Just like email, today’s peer-to-peer systems run the risk of being overrun by unsolicited advertisements (Watson 2003). Security Firewalls and dynamic internet protocol grew out of the clear need in internet architecture to make scalable, secure systems. Firewalls stand at the gateway between the internal network and the internet outside and are a very useful security tool, but they pose a serious obstacle to peer-to-peer communication models. New peer-to-peer applications challenge this architecture, demanding that participants serve resources as well as use them (Watson 2003). Port 80 is conventionally used by HTTP traffic when people browse the web. Firewalls typically filter traffic based on the direction and the destination port of the traffic. Most current peer-topeer applications have some way to use port 80 in order to circumvent network security policies. The problem lies in that there is no good way you can identify what applications are 26 running through it. Also, even if the application has a legitimate reason to go through the firewall, there is no simple way to request permission (Watson 2003). As no node can tell where a request came from beyond that node that forwarded the request to it, it is very difficult to find the person who started the request. Freenet does not provide perfect anonymity because it balances paranoia against efficiency and usability. If someone wants to find out exactly what you are doing, then given the resources, they will. Freenet does, however, seek to stop mass, indiscriminate surveillance of people (Watson 2003). Legal Issues As Freenet can potentially contain illegal information, it provides deniability that the owner of the computer/the node, knows nothing of what is stored on his/her computer, due to the encryption that Freenet provides (Watson 2003). Advantages and Disadvantages of Freenet Some of the advantages and disadvantages of Freenet are described in the sections below. Advantages: Freenet is solving many of the problems seen in centralised networks. Popular data, far from being less available as requests increase, become more available as nodes cache it. This is the correct reaction of a network storage system to popular data; Freenet also removes the single point of attack for censors, the single point of technical failure, and the ability for people to gather large amounts of personal information about a reader; Freenet’s niche is in the efficient and anonymous distribution of files. It is designed to find a file in the minimum number of node-to-node transactions. Additionally, it is designed to protect the privacy of the publisher of the information, and all intervening nodes though which the information passes. Disadvantages: it is designed for file distribution and not fixed storage. It is not intended to guarantee permanent file storage, although it is hoped that a sufficient number of nodes will join with enough storage capacity that most files will be able to remain indefinitely; Freenet does not yet have a search system, because designing a search system which is sufficiently efficient and anonymous can be difficult; the node operators cannot be held responsible for what is being stored on its hard drive. Freenet is constantly criticised because you have to donate your personal hard drive space to a group of strangers that may be very well use it to host content that you disapprove of; Freenet is designed so that if the file is in the network, the path to the file is usually short. Consequently, Freenet is not optimised for long paths, which can therefore be very slow; self-organising file sharing systems like Freenet are affected by the popularity of files, and hence may be susceptible to the tyranny of the majority. (Watson 2003). Gnutella Gnutella was originally designed by Nullsoft, a subsidiary of America Online (AOL) but development of the Gnutella protocol was halted by AOL management shortly after being made available to the public. During the few hours it was available to the public several thousand downloads occurred. Using those downloads, programmers created their own Gnutella software packages. Gnutella is a networking protocol, which defines a manner in which computers can speak directly to one another in a completely decentralised fashion. The content that is available on the Gnutella network does not come from web sites or from the publishers of Gnutellacompatible software; it comes from other users running Gnutella-compatible software on their own computers. Software publishers such as Lime Wire LLC have written and distributed programs which are compatible with the Gnutella protocol, and which therefore allow users to participate in the Gnutella network. Gnutellanet is currently one of the most popular of these decentralised P2P programs because it allows users to exchange all types of files. 27 The Gnutella Structure Unlike a centralised server network, the Gnutella network (gNet) does not use a central server to keep track of all user files. To share files using the Gnutella model (Figure 7), a user starts with a networked computer, equipped with a Gnutella servent. Computer "A" will connect to another Gnutella-networked computer, "B." A will then announce that it is "alive" to B, which will in turn announce to all the computers that it is connected to, "C," "D," "E," and "F," that A is alive. The computers C, D, E, and F will then announce to all computers to which they are connected that A is alive; those computers will continue the pattern and announce to the computers they are connected to that computer A is alive. Once "A" has announced that it is "alive" to the various members of the peer network, it can then search the contents of the shared directories of the peer network members. The search request will send the request to all members of the network, starting with, B, then to C, D, E, F, who will in turn send the request to the computers to which they are connected, and so forth. If one of the computers in the peer network, say for example, computer D, has a file which that matches the request, it transmits the file information (name, size, etc.) back through all the computers in the pathway towards A, where a list of files matching the search request will then appear on computer A's Gnutella servent display. A will then be able to open a direct connection with computer D and will be able to download that file directly from computer D. The Gnutella model enables file sharing without using servers that do not actually directly serve content themselves. Figure 7. The Gnutella Structure 28 Technical Overview The Gnutella protocol is run over TCP/IP a connection-oriented network protocol. A typical session comprises a client connecting to a server. The client then sends a Gnutella packet advertising its presence. This advertisement is propagated by the servers through the network by recursively forwarding it to other connected servers. All servers that receive the packet reply with a similar packet about themselves (Creedon 2003). Queries are propagated in the same manner, with positive responses being routed back the same path. When a resource is found and selected for downloading, a direct point-to-point connection is made between the client and the host of the resource, and the file downloaded directly using HTTP. The server in this case will act as a web server capable of responding to HTTP GET requests (Creedon 2003). Gnutella packets are of the form: Message ID (16 bytes) Function ID (1 byte) TTL (1 byte) Hops (1 byte) Payload length (4 bytes) Table 4. The form of Gnutella Packets Where: message ID in conjunction with a given TCP/IP connection is used to uniquely identify a transaction; function ID is one of: Advertisement[response], Query[response] or Push-Request; TTL is the time-to-live of the packet, i.e. how many more times the packet will be forwarded; hops counts the number of times a given packet is forwarded; payload length is the length in bytes of the body of the packet. Connecting A client finds a server by trying to connect to any of a local list of known servers that are likely to be available. This list can be downloaded from the internet, or be compiled by the end user, comprising, for example, servers run by friends, etc. The Advertisement packets comprise the number of files the client is sharing, and the size in Kilobytes of the shared data. The server replies comprise the same information. Thus, once connected, a client knows how much data is available on the network. Queries As mentioned above, queries are propagated the same way as Advertisements. To save bandwidth, servers that cannot match the search parameters need not send a reply. The semantics of matching search parameters are not defined in the current published protocol. The details are server dependent. For example, a search for ".mp3" could be interpreted as all files with same file extension, or any file with “mp3" in its' name, etc. Downloading A client wishing to make a download opens a HTTP (hyper-text transfer protocol) connection to the host and requests the resource by sending a "GET >URL<" type HTTP command, where the URL (Uniform Resource Locator) is returned by a query request. Hence, a client sharing resources has to implement a basic HTTP (aka web) server. Firewalls A client residing behind a firewall trying to connect to a Gnutella network will have to connect to a server running on a "firewall-friendly" port. Typically this will be port 80, as this is the reserved port number for HTTP, which is generally considered secure and non-malicious. When a machine hosting a resource cannot accept HTTP connections because it is behind a firewall, it is possible for the client to send a "Push-Request" packet to the host, instructing it to make an outbound connection to the client on a firewall-friendly port, and "upload" the requested resource, as opposed to the more usual client "download" method. The other 29 permutation, where both client and server reside behind firewalls renders the protocol nonfunctional (Creedon 2003). Advantages and Limitations of Gnutella The Gnutella network has a number of distinct advantages over other methods of file sharing. Namely, the Gnutella network is decentralised and hence more robust than a centralised model because it eliminates reliance on centralised servers that are potential critical points of failure. Messages are also transmitted over Gnutella network in a decentralised manner: One user sends a search request to his "friends," who in turn pass that request along to their "friends," and so on. If one user, or even several users, in the network stop working, search requests would still get passed along. The Gnutella network is designed to search for any type of digital file (from recipes to pictures to java libraries). The Gnutella network also has the potential to reach every computer on the internet, while even the most comprehensive search engines can only cover 20% of websites available. Improvements have enabled the resulting GnutellaNet to overcome major obstacles and experience substantial growth. It is estimated that the network currently encompasses about 25,000 simultaneous peers with approximately a quarter million unique peers active on any given day (Truelove 2001). The crux of the Gnutella protocol is its focus on decentralised P2P file-sharing and distributed searching. Such peer-to-peer content searching is not core to other services such as JXTA. Instead, this is precisely the sort of functionality well-suited for the service level between the core and applications. In Gnutella, search and file transfers are cleanly separated, and an established protocol, HTTP, is used for the latter. Standard web browsers can access Gnutella peers (provided they are not blocked for social-engineering reasons), which are in essence transient web sites. As it is an open protocol, the end-user interface and functionality are separable from the underlying network. The result is a sizable number of interoperable applications that interact with the common GnutellaNet. Gnutella's query-and-response messages could be put to use as a bid-response mechanism for peer-to-peer auctioning (Truelove 2001). The principal shortcomings in the protocol are: scalability - the system was designed in a laboratory and set up to run with a few hundred users. When it became available on the internet, it quickly grew to having a user base of tens of thousands. Unfortunately at that stage the system became overloaded and was unable to handle the amount of traffic and nodes that were present in the system. Concepts that had looked good in a laboratory were showing signs of stress right from the start; packet life - to find other users, a packet has to be sent out into the network. It became apparent early on that the packet life on some packets had not been set right and a build up of these packets started circulating around the network indefinitely. This resulted in less bandwidth being available on the network for users; connection speeds of users - users on the system act as gateways to other users to find the data they need. However, not every user had the same connection speed. This has led to problems as users on slower bandwidth machines were acting as connections to people on higher bandwidth. This resulted in connection speeds being dictated by people with the slowest connection speed, on the link to the data thereby leading to bottlenecks. Furthermore, the entire network is not visible to any one client. Using the standard time-to-live, during advertisement and search, only about 4,000 peers are reachable. This arises from the fact that each client only holds connections to four other clients and a search packet is only forwarded five times. In practical terms this means that even though a certain resource is available on the network, it may not be visible to the seeker because it is too many nodes away. To increase the number of reachable peers in the Gnutella network we would need to increase the time-to-live for packets and the number of connections kept open. Unfortunately this gives rise to other problems: if we where to increase both the number of connections and the number of hops to eight, 1.2 gigabytes of aggregate data could be potentially crossing the network just to perform an 18 byte search query (Creedon 2003). 30 Another significant issue that has been identified is Gnutella's susceptibility to denial of service attacks. For example a burst of search requests can easily saturate all the available bandwidth in the attacker's neighbourhood, as there is no easy way for peers to discriminate between malicious and genuine requests. Some workarounds to the problem have been presented, but in each case there are significant compromises to be made. All in all the overall quality of service of the Gnutella network is very poor. This is due to a combination of factors, some of them deriving directly from the characteristics of the protocol, others induced by the users of the network themselves. For example, users that are reluctant to concede any outgoing bandwidth will go to great lengths to prevent others to download files that they are 'sharing'. Similarly the ability to find a certain file will largely depend on the naming scheme of the user that makes certain files available. A conspiracy theorist will of course argue that the above tactics are being used by record companies to undermine the peer-to-peer revolution (Creedon 2003). JXTA JXTA is a set of protocols that can be implemented in any language and will allow distributed client interoperability. It provides a platform to perform the most basic functionality required by any P2P application: peer discovery and peer communication (Krikorian 2001). At the same time, these protocols are flexible enough to be easily adapted to application-specific requirements. While JXTA does not dictate any particular programming language or environment, Java could potentially become the language of choice for P2P application development due to its portability, ease of development, and a rich set of class libraries (Krishnan 2001). At its core, JXTA is simply a protocol for inter-peer communication. Each peer is assigned a unique identifier and belongs to one or more groups in which the peers cooperate and function similarly under a unified set of capabilities and restrictions. JXTA provides protocols for basic functions, for example, creating groups, finding groups, joining and leaving groups, monitoring groups, talking to other groups and peers, sharing content and services. All of these functions are performed by publishing and exchanging XML advertisements and messages between peers. JXTA includes peer-monitoring hooks that will enable the management of a peer node. People can build visualization tools that show how much traffic a particular node is getting. With such information, a network manager can decide to increase or throttle bandwidth on various nodes, or implement a different level of security (Breidenbach 2001). The JXTA Structure Conceptually, each peer in JXTA abstracts three layers: the core layer, the services layer, and the applications layer. The core layer is responsible for managing the JXTA protocol and it should encapsulate the knowledge of all basic P2P operations. The core creates an addressing space separate from IP addresses by providing each peer its own unique peer ID, whilst also boot-strapping each peer into a peer group. Through protocols which the core knows how to implement, a JXTA peer can locate other peers and peer groups to join (in a secure, authenticated manner, if desired). The core layer can also open a pipe, a very simple one-way message queue, to another peer or group of peers. By using pipes, distinct parties can communicate with each other. The services layer serves as a place for common functionality which more than one, but not necessarily every, P2P program might use. For example, Sun has released a Content Management System (CMS) service that has been implemented as a JXTA service. CMS provides a generic way for different applications to share files on JXTA peers so that decentralised Gnutella-like searches may be performed against all of them. Once specific content has been located, the CMS provides ways for peers to download content so the services layer provides library-like functionality, which JXTA applications can control via logic located in application layer (Krikorian 2001). The third layer is where the P2P application truly lives. The upper layer might host a user interface, so that the user can control different services, or it might be where the logic of an autonomous application operates. For example, a simple chat program can be built on this layer, making use of both the service and the core layer to allow people to send messages back and forth to each other. P2P applications should be fairly easy to build once a developer 31 is familiar with JXTA, as the platform provides the basic peer-to-peer framework (Krikorian 2001). JXTA Developments As part of its open source release, Sun is distributing a preliminary Java binding for JXTA with the goal of having early-adopter engineers create simple P2P applications in Java. Sun's binding is not complete, however as interfaces, implementations, and protocols are likely to change (Krikorian 2001). Project JXTA is building core network computing technology to provide a set of simple, small, and flexible mechanisms that can support P2P computing on any platform, anywhere, at any time. The project is first generalising P2P functionality and then building core technology that addresses today's limitations on P2P computing. The focus is on creating basic mechanisms and leaving policy choices to application developers (Krishnan 2001). With the unveiling of Project JXTA, not only is Sun introducing new building blocks for P2P development, but it is launching an open and decentralised peer-to-peer network. While JXTA can be used for P2P applications that operate in a closed environment such as an intranet and its success may ultimately be measured by its utility in that domain, a wider public JXTA network, the JuxtaNet, will be formed. The core JXTA protocols are the foundation for Sun's initial reference implementation, which in turn is the basis for Sun's example applications, including the Shell and InstantP2P. These applications give life to the JuxtaNet as they are run and instantiate peers that intercommunicate (Truelove 2001). XML in JXTA Undoubtedly, the first step towards providing a universal base protocol layer is to adopt a suitable representation that a majority of the platforms currently available can understand. XML is the ideal candidate for such a representation. The JXTA developers recognize that XML is fast becoming the default standard for data exchange as it provides a universal, languageindependent, and platform-independent form of data representation. XML can also be easily transformed into other encoding, hence, the XML format defines all JXTA protocols (Krishnan 2001). Although JXTA messages are defined in XML, JXTA does not depend on XML encoding. In fact, a JXTA entity does not require an XML parser; it is an optional component. XML is a convenient form of data representation used by JXTA. Smaller entities like a mobile phone might use precompiled XML messages (Krishnan 2001). JXTA holds promise as a low-level platform for P2P application development. While the technology is in its early stages, it is expected to mature over time to provide a robust, reliable framework for P2P computing. As Java is the preferred language for applications designed for heterogeneous environments, it is the natural choice for P2P applications (Krishnan 2001). Advantages and Limitations The JuxtaNet is significant in that it is an open, general-purpose P2P network. JXTA is abstracted into multiple layers, core, service and application, with the intention that multiple services will be built on the core, and that the core and services will support multiple applications. There is no constraint against the simultaneous existence on the JuxtaNet of multiple services or applications designed for a similar purpose. As an example, just as a PC's operating system can simultaneously support multiple word processors, the JuxtaNet can simultaneously support multiple file-sharing systems (Truelove 2001). In contrast to Gnutella, JXTA was designed for a multiplicity of purposes. Gnutella is one protocol, while the JXTA core consists of several protocols. Based on the experience of the GnutellaNet, there are several reasons to expect developer enthusiasm for JuxtaNet from similar quarters: versatile core protocols for peer discovery, peer group membership, pipes and peer monitoring form a rich foundation on which a wide variety of higher-level services and applications can be built. Developers, eager to develop new decentralised applications, have found that the path to build them in Gnutella involves overloading existing constructs or carefully grafting on new ones without breaking the installed 32 base. The alternative, individually developing a proprietary vertically integrated application from the P2P networking layer up to the application layer, is unattractive in many cases. This high friction has arguably inhibited development, but JXTA lowers it. JXTA's groups, security, pipes, advertisements and other aspects should be welcomed building blocks. In terms of attracting developers, the open nature of the JXTA protocols is an advantage to the JuxtaNet, just as the open nature of the Gnutella protocol is an advantage to GnutellaNet. The higher complexity of JXTA relative to Gnutella gives it a steeper learning curve, but could be seen as a release of an open-source reference code that educates by example. Just as with the GnutellaNet, the JuxtaNet is an open network whose applications can be expected to be interoperable at lower levels. This can potentially give developers the "instant user base" phenomenon familiar from the GnutellaNet. JXTA strives to provide a base P2P infrastructure over which other P2P applications can be built. This base consists of a set of protocols that are language independent, platform independent, and network agnostic (that is, they do not assume anything about the underlying network). These protocols address the bare necessities for building generic P2P applications. Designed to be simple with low overheads, the protocols target, to quote the JXTA vision statement, "every device with a digital heartbeat." (Krishnan 2001). JXTA currently defines six protocols, but not all JXTA peers are required to implement all six of them. The number of protocols that a peer implements depends on that peer's capabilities; conceivably, a peer could use just one protocol. Peers can also extend or replace any protocol, depending on its particular requirements (Krishnan 2001). It is important to note that JXTA protocols by themselves do not promise interoperability. Here, you can draw parallels between JXTA and TCP/IP. Though both FTP and HTTP are built over TCP/IP, you cannot use an FTP client to access web pages. The same is the case with JXTA. Just because two applications are built on top of JXTA it does not mean that they can magically interoperate. Developers must design applications to be interoperable, however, developers can use JXTA, which provides an interoperable base layer, to further reduce interoperability concerns (Krishnan 2001). 5.2.10 Case Studies The following case studies illustrate the use of the software discussed in the previous section. EduSource (http://edusource.licef.teluq.uquebec.ca/ese/en/overview.htm ) The vision of the eduSource project is focused on the creation of a network of linked and interoperable learning object repositories across Canada. The initial part of this project will be an inventory of ongoing development of the tools, systems, protocols and practices. Consequent to this initial exercise the project will look at defining the components of interoperable framework, the web services that will tie them all together and the protocols necessary to allow other institutions to enter into that framework. The eduSource project goals are to: support a true network model; use standards and protocols that are royalty-free; implement and support emerging specifications, such as CanCore; enable anybody to become a member; provide royalty-free open source infrastructure layer and services; provide a distributed architecture; facilitate an open marketplace for quality learning resources; provide multiple metadata descriptions of a given learning resource; provide a semantic web within the eduSource community; encourage open rights management. LimeWire (http://www.limewire.com/english/content/home.shtml) LimeWire is a file-sharing program running on the Gnutella network. It is open standard software running on an open protocol, free for the public to use. LimeWire allows you to share 33 any file such as mp3s, jpgs, tiffs, etc. Limewire is written in Java, and will run on Windows, Macintosh, Linux, Sun, and other computing platforms. The features of LimeWire are: easy to use - just install, run, and search; ability to search by artist, title, genre, or other metainformation; elegant multiple search tabbed interface ; "swarm" downloads from multiple hosts help you get files faster; iTunes integration for Mac users; unique "ultra peer" technology reduces bandwidth requirements for most users; integrated chat; browse host feature which even works through firewalls; added Bitzi metadata lookup; international versions: Now available in many new languages; connects to the network using GWebCache, a distributed connection system; automatic local network searches for lightning-fast downloads. If you are on a corporate or university network, you can download files from other users on the same network almost instantaneously; support for MAGNET links that allow you to click on web page links that access Gnutella. LimeWire is capable of multiple searches, available in several different languages, and easy to use with cross platform compatibility. LimeWire offers two versions of the software: LimeWire Basic which is ad-supported and is free of charge; the other is LimeWire PRO which costs a small fee and contains no ads or flashing banners, includes six months of free updates, and customer support via email. Several developments are being finalised, such as the ability for companies to offload bandwidth costs by serving files on LimeWire, as well allowing personal users to send very large-sized files over email. LimeWire is a very fast P2P file-sharing application which enables the sharing, searching, and downloading of MP3 files. LimeWire also lets users share and search for all types of computer files, including movies, pictures, games and text documents. Other features include dynamic querying, push-proxy support for connecting through firewalls, the ability to preview files while downloading, advanced techniques for locating rare files, and an extremely intuitive user interface (http://download.com.com/3000-2166-10051892.html). Lime Wire supports Gnutella’s open-protocol, prejudice-free development environment. Since nobody owns the Gnutella protocol, any company or person can use it to send or respond to queries, and no entity will have a hold over the network or over the information flowing through it. This free market environment promotes competition among entities choosing to respond to the same queries. The model for Gnutella’s growth and development is the world wide web as nobody owned the hypertext transfer protocol (HTTP) on which the web was based, nor did anybody own the web itself, which has allowed its growth to be so explosive, and the spectrum of its applications so broad. All computers running a program utilising the Gnutella protocol are said to be on the Gnutella network (gNet). On the world wide web, each computer is connected to only one other computer at a time. When a user visits Amazon.com, she is not at Yahoo.com. The two sites are mutually exclusive. On the Gnutella network, a user is connected to several other computers at once. Information can be received from many sources simultaneously. LimeWire Peer Server This product allows Gnutella to service queries beyond file sharing. Utilising XML and the Java platform, the LimeWire peer server allows various data sources to be easily integrated into the Gnutella network, and allows these data sources to be conveniently queried to satisfy customer needs. The LimeWire peer server can integrate data whether it resides in a database, is streamed as a feed, has ODBC extensions, etc, into the Gnutella network in a few simple steps. Firstly, mappings between the data entity and popularly distributed XML schemas are created. Using LimeWire 1.8, Gnutella users can meta-search the network based on these XML schemas. 34 Through this mechanism, Gnutella becomes a dynamic information platform that allows users to specify and retrieve the information they seek. For example, a company that maintains a web site for finding rentable apartments can integrate its data store into Gnutella using the LimeWire peer server. Gnutella users can then search for apartments as necessary; a matching "meta-search" would bring up a link to the companies website. At that point, the traditional business cycle would resume. It is clear that the LimeWire peer server can make life more convenient for users and more profitable for businesses. The LimeWire peer server can leverage any existing technology platform. The foundation of Java and XML ensures that the peer server is completely cross-platform; reliance on JDBC ensures integration of many data source variants. Moreover, the peer server supports popular web interfaces, such as CGI, JSP, and ASP. The LimeWire peer server unites two revolutionary aspects of the World Wide Web: a powerful search mechanism and e-commerce. The Windows, OS X, and general Unix versions of the LimeWire peer server are now available for download. LionShare http://lionshare.its.psu.edu/main/ The LionShare project began as an experimental software development project at Penn State University to assist faculty with digital file management. The project has now grown to be a collaborative effort between Penn State University, Massachusetts Institute of Technology Open Knowledge Initiative, researchers at Simon Fraser University, and the Internet2 P2P Working Group. Penn State researchers identified key deficiencies in the ability of existing technologies to provide Higher Education users with the necessary tools for digital resource sharing and group collaboration. The LionShare project was initiated to meet these needs. The LionShare P2P project is an innovative effort to facilitate legitimate file sharing among individuals and educational institutions around the world. By using peer-to-peer technology and incorporating features such as authentication, directory servers, and owner controlled sharing of files, LionShare provides secure file-sharing capabilities for the easy exchange of image collections, video archives, large data collections, and other types of academic information. In addition to authenticated file-sharing capabilities, the developing LionShare technology will also provide users with resources for organising, storing, and retrieving digital files. Currently, many academic digital collections remain "hidden" or difficult to access on the Internet. Through the use of LionShare technology, users will be able to find and access these information reservoirs in a more timely and direct manner, employing one rather than multiple searches. LionShare will also provide users with tools to catalogue and organise personal files for easier retrievals and enhanced sharing capabilities. The LionShare development team anticipates a beta launch in Autumn of 2004 with the software's final release in 2005. The software will be made available to the general public under an open source license agreement. The initial LionShare prototype would not have been possible without the source code and assistance of the Limewire open source project. 35 LionShare Architecture LionShare is based on LimeWire, and has both decentralised and centralised topology with a P2P and client/server architecture. Figure 8. Client/Server Architecture LionShare consists of LionShare peer, peer server, security and repositories. The peer server is based on the Gnutella protocol and the security model is similar to Shibboleth. Figure 9. Peer Server Architecture Shibboleth Development The Joint Information Systems Committee (JISC) has aims to adopt Shibboleth technology as the principal framework for authentication and authorisation within the JISC Information Environment. In support of this, EDINA is leading a group of partners within the University of Edinburgh to advance this work under the Shibboleth Development and Support Services project (SDSS). This project is funded by the Core Middleware Technology Development Programme, a JISC initiative which supports a total of 15 projects to develop and explore this technology. SDSS will 36 act in an enabling role for these projects by providing prototype elements of the infrastructure necessary for this national activity. These elements include: a national certificate-issuing service for SSL-based secure communication between institutional and national servers; a "where are you from" (WAYF) service required to direct users to the appropriate authentication point; development of Shibboleth-enabled target services, by adding this capability to a number of existing live services operated by EDINA; auditing, monitoring and support for use of the initial Shibboleth infrastructure. It is intended that these prototype services will eventually be replaced by industry-strength solutions before the end of the project in 2007. Meantime, they will provide a live test bed which will enable interworking between other Core Middleware projects within a service environment. Main Features and Benefits of LionShare LionShare enjoys the benefits of P2P network sharing but with expanded capabilities, and the elimination of typical P2P drawbacks, all within a trusted, academic oriented network. The inclusion of access control, defined user groups, persistent file sharing and the ability to search both the P2P network and centralised repositories makes this software differ from other P2P networks. Example Uses of LionShare Listed below are of some examples of how LionShare can be used in the HE environment for teaching, learning, research and collaborative efforts: improved peer-to-peer networking to provide an information gathering tool - all the personal file sharing capabilities of Kazaa and Gnutella plus expanded search capabilities to include special academic databases with only one search query. A permanent storage space is included for selected personal files to be shared even when the user is not logged on; controlled access - a seamless trust fabric could be established between institutions. If a user is logged on at one institution, they will be able to access the resources from all partner institutions in a secure, safe manner. Users can specify who to share each file with. This can range from institutions, disciplinary areas, departments, specific classes, individuals or even custom defined groups. This is useful when users wish to limit access due to copyright considerations; tools to help organise and use the files - many easy-to-use options are available for users to organise their own collections which in turn makes sharing with others easier. Users can easily organise and refine the descriptions of files to be shared, so that keywords, headings and collection information will identify and organise shared files. There is built in support for different file types and slideshows; tools for collaboration - collaborative tools can facilitate the joint efforts of user defined groups. Chat rooms and bulletin boards can be created for user-defined groups. POOL and SPLASH http://www.edusplash.net/ The Portal for Online Objects in Learning (POOL) Project is a consortium of several educational, private and public, sector organisations to develop an infrastructure for learning object repositories. The POOL project ran until 2002 and aimed to build an infrastructure for connecting heterogeneous repositories into one network. The infrastructure used P2P in which nodes could be individual repositories (called SPLASH), or community repositories (PONDS). The POOL network used JXTA and followed the CanCore/IMS metadata profile (Hatala 2003). SPLASH is a P2P repository developed by Griff Richards and Marek Hatala and Simon Frasier University in Canada, as part of the wider Portals for On-line Objects in Education (POOL) consortium (Kraan 2004). SPLASH is a freely available desktop program that provides storage for learning objects used or collected by individuals. SPLASH enables users to create metadata on the individuals file system or on the web. The peer-to-peer protocol is used to search for learning objects on other peers and allows for file transfer between peers (Hatala 2003). 37 As SPLASH is built on peer-to-peer technology, it means that SPLASH programs can talk directly to each other over the network, without the need for a server. Each group of Splashes (i.e. a POND) has a ‘head’ SPLASH and is a gateway to the major internet backbones (POOLS). This means that objects can still be found on Ponds, even if the machine a SPLASH is installed on is off the network and it is faster. Existing, conventional repositories can be elevated to POND status by either making it talk SPLASH by the addition of an interface on the repository, or, by having SPLASHES talk to a gateway that speaks eduSource Communication Layer (ECL- a national implementation of the IMS Digital Repository Interoperability specification) to other repositories on the other end (Kraan 2004). 5.3 Distribution from a Traditional (Centralised) Archive Traditional archiving is considered to be a central site where data are deposited. The norm is for the organisation to provide quality control and support for both users and depositors. The level of support required differs according to the type of material being archived e.g. text documents will need less support then more complex documents such as geospatial datasets. The advantages of using an archive are that the data are quality checked, preserved and disseminated by the organisation. The limitations are that both the documents and the data must be archived at a central site where metadata or bibliographic indexes have to be created by skilled specialists, consuming human and financial resources at the central site. 5.3.1 Copyright/IPR Usually copyright is retained by the author, sometimes defined as and including: the individual; organisation; or institution. For example the UK Data Archive licence agreement states that it is: “a non-exclusive licence which ensures that copyright in the original data is not transferred by this Agreement and provides other safeguards for the depositor, such as, requesting acknowledgement in any publications arising from future research using the data.” If a piece of work is completed as part of employment, the employer will retain copyright in the work, but the creator will retain IPR. But if you are commissioned to create a piece of work on behalf of someone else, then you will retain copyright in that work. Copyright is owned by the author of a letter or other written communication i.e. fax or email. Centralised archives which use registration such as Athens, also have the advantage of being able to track who has used data and for what declared purposes, thus safeguarding IP. 5.3.2 Preservation Centralised archives preserve materials deposited with them to ensure continued access wherever possible. Each archive has their own individual preservation policies. 5.3.3 Cost Cost to users from a non-commercial community such as academia, are covered and therefore free at the point of use. Cost to the data creator is dependent on grant applications and time spent on issues such as copyright, changing data formats where necessary and anonymising the data if required. A centralised archive can be a cost effective way for a data creator to store and preserve their data. Concentrating specialist support into a central organisation may also be cost effective when compared to a distributed system which might see a duplication of effort in each host institution. On the internet, the more traffic you get means the more money it costs as most bandwidth usage is centralised: one server and one network. This means that using a centralised archive is more expensive then a decentralised network for this particular cost issue. 5.3.4 Case Studies The following case studies illustrate the advantages and disadvantages of centralised archiving. The National Archive The UK National Digital Archive of Datasets (NDAD), is operated by the University of London Computer Centre (ULCC) on behalf of the National Archive (formerly the Public Record Office (PRO)). Its aim is to conserve and, where possible, provide access to many computer datasets 38 from Central Government departments and agencies, which have been selected for preservation by the PRO. Information about the data and their accompanying documentation can be browsed free of charge through the NDAD website. NDAD is one of very few digital archives in the UK which not only preserve but also provides access to electronic records. It stores and catalogues its holdings according to archival practice, and makes them accessible to the public via their website. Paper documentation which accompanies the electronic records and provides a context for them, is digitised and made available on the site too, both as a scanned image and as a plain text file. This national data repository service offers fast network access to extremely large amounts of data (up to 300 terabytes). UK Data Archive The UK Data Archive (UKDA), established in 1967, is an internationally-renowned centre of expertise in data acquisition, preservation, dissemination and promotion; and is curator of the largest collection of digital data in the social sciences and humanities in the UK. It houses datasets of interest to researchers in all sectors and from many different disciplines. It also houses the AHDS, History Service and the Census Registration Service (CRS). Funding comes from the Economic and Social Research Council (ESRC), the JISC and the University of Essex. The UKDA provides resource discovery and support for secondary use of quantitative and qualitative data in research, teaching and learning. It is the lead partner of the Economic and Social Data Service (ESDS) and hosts AHDS History, which provides preservation services for other data organisations and facilitates international data exchange. The UKDA acquires data from public sector (especially central government), academic, and commercial sources within the United Kingdom. The Archive does not own data but holds and distributes them under licences signed by data owners. Data received by the Archive go through a variety of checks to ensure their integrity. The management of data is an essential part of long-term preservation and fundamentally the processing will strip out any software or system dependency so that the data can be read at any time in the future. To ensure longevity, clearly understood structures and comprehensive documentation are essential. The UKDA catalogue system is well known for its comprehensiveness and the employment of a controlled vocabulary based upon the use of a thesaurus (known as HASSET, Humanities and Social Science Electronic Thesaurus). Currently a multilingual version of the thesaurus is being developed under an EU funded project. The search fields used within the catalogue are part of a standard study description agreed in the early 1980's by an international committee of data archivists. This has subsequently been developed and adopted as the international Data Documentation Initiative (DDI). Once data are checked, stored and catalogued, they are made available to the user community. Data are available in a number of formats (typically the main statistical packages) and on a number of media, for example, CD ROMs, although the main medium used is online dissemination. In addition to the delivery process, the UK Data Archive provides a one stop shop support function (by phone or email) to users of the ESDS, CRS and AHDS, History services. This service also extends to data depositors. In addition, a network of organisational representatives has been established and regular data workshops are arranged. In order to obtain free access to a wide range of data sources, licences are negotiated resulting in consistent access conditions being agreed and managed centrally for the academic community. Appropriate registration and authentication processes are implemented to ensure that data are not made available outside the constraints of the licence. For these reasons an authentication system is necessary in order to ensure that all users are aware of the conditions under which they can access the data. The UK Data Archive is also a member of CESSDA, the consortium of European Social Science Data Archives, each of which serves a similar role to the UKDA in its individual 39 country. Since the late 1990’s the UKDA, has collaborated with its CESSDA partners to develop software systems to permit cross national searching for data and online data browsing. The result of this programme of work is the Nesstar software architecture on which the CESSDA Integrated Data Catalogue (IDC) is built. An enhanced version of the IDC will be released during the coming year. The UKDA, therefore is an example of a traditional archive (it serves this function, for the social sciences, within the UK) which has established mechanisms for managing data exchange across similar organisations. Authentication is Athens based. 5.4 Data Distribution Systems Summary The two main models under which a repository could operate include a centralised approach or some form of distributed model. The main characteristics of the centralised archive are: storage and distribution of data from a single location; centralised access control over the supply and re-use of data; checking, cleaning and processing of data according to standard criteria; centralised support service, describing the contents of the data, the principles and practices governing the collection of data and other relevant properties of data; cataloguing technical and substantive properties of data for information and retrieval and offering user support following the supply of data. The main characteristics of the typical distributed model include: data holdings distributed over various sites; data disseminated to users from each of the different sites, according to where the data is held; the various suppliers of data ideally networked in such a way that common standards and administrative procedure can be maintained, including agreements on the supply and use of data; a single point of entry into the network for users, together with some form of integrated cataloguing and ordering service. Until now the centralised approach has dominated the European scene although this is beginning to be challenged. The distributed model has been used in the US successfully for many years and has more recently been operated by the Arts and Humanities Data Service (AHDS) in the UK. There are advantages to both the centralised and distributed approaches. There are advantages to be had from a single institution liaising with all depositors and users. Some economies of scale are likely with a centralised service. Centralised, possibly national, services may also be better able to promote national standards for the documentation of research data, new developments and innovation including cross-national linkage, to promote research and teaching which values the use of secondary data and to police usage of the data supplied. These advantages might also, however, be reproduced to varying degrees through a hub and spoke model with the hub generating much of the benefit provided by a centralised model. Technological developments make this increasingly possible. The distributed model can allow for the development and support of pools of knowledge and expertise and for higher volume usage in clusters related to particular datasets. Under a hub and spoke model, a centralised facility might be responsible for the most heavily used datasets. Other datasets could then be held and provided by a network of distributed centres. This would allow for specialisation within particular satellite centres. Terms and conditions relating to deposit, access and standards could then be co-ordinated centrally. Self Archiving Advantages: rapid, wide and free dissemination of data across the web documents are stored electronically on a Peer-2-Peer all parties are equal data stored locally therefore no central repository/server needed simpler then client/server Centralised Archiving both documents and metadata held and managed by archive therefore preservation and copyright not an issue as negotiated and 40 publicly accessible website documents can be deposited in either centralised archive or distributed system easy access to software can be published by author at any stage subject or organisation based therefore specific to given field cheaper then centralised archiving as quality checking done by peer review extendible most self archives use Dublin Core or OAI and are therefore interoperable some self archives are duplicates of printed material, therefore preservation not an issue architectures more robust then centralised servers as it eliminates reliance on centralised servers that are potential critical points of failure can search both the P2P network and centralised repositories control over a closed network where individuals can set passwords scalable Disadvantages: wide dispersal can lead to problems finding the data copyright - author retains pre-print but not post-print copyright. The author can not selfarchive if paid royalties by publisher or if publisher holds exclusive copyright. It can be difficult to enforce copyright ownership can be disputed if data creator moves organisations rights over metadata are not clear preservation concerns over interoperability of online medium and maintenance of the archive lack of access control no quality control of metadata no or little reputation no overall control over copyright and licences ownership can be disputed if data creator moves organisations no quality control technical infrastructure does not offer good performance under heavy loads use of hard drive as nodes could lead to local slow performance must have sufficient number of nodes to work successfully unreliability of peers which may be switched off no central support system difficult to enforce rules, such as copyright network favours client not server usage, therefore creating bandwidth problems managed centrally quality control of metadata and data user support available resource discovery supported and provided centrally reputation of organisation promotes trust and confidence in data creators required by grant award providers IPR remains with data creator existence published by archive administrative duties carried out by archive use of data can be monitored access can be managed centrally e.g. using Athens can support a large number of data users can create own search facilities both documents and metadata held by archive, need specialised staff skills, therefore human and financial restraints dependency on single network may lead to bandwidth issues especially with large volumes of data, although archive could just hold metadata 41 means lack of confidence in method concern over long term prospects of archive Software: lack of licensing cost of data creators time and resources relies on peer review for quality control user support reliant on goodwill of publisher or data creator requires cultural change to ensure researchers submit content to repositories EPrints Dspace Kepler CERN relies on metadata for data retrieval difficult to manage as no central control user support reliant on goodwill of publisher or data creator cost of data creators time and resources concern over long term prospects of archive JXTA Gnutella FreeNet LimeWire LionShare Archive specific Table 5. Comparison of the Three Data Distribution Systems 6.0 Conclusions Whichever method for data distribution is adopted, there are several issues which need to be addressed and these are discussed below. User and Depositor Support Centralised archives would provide the best user and depositor support as resources are focused in one location. These types of archive are often specialised in an academic discipline and can therefore provide the most efficient service. There are examples of traditional archives which work together as one organisation, but still providing specialist support at the individual level. One example of this is ESDS (Economic and Social Data Service), where specialist skills remain at the UK Data Archive, ISER, MIMAS and the Cathie Marsh Centre for Census and Survey Research. Self-archiving and P2P networks do not provide user and depositor support. Instead they rely on in-house help at the data holders organisation, such as a library support system based at a university. Metadata Standards The use of the Go-Geo! HE/FE metadata application profile, which was based on FGDC and updated to include ISO 19115 in 2004, should be considered as a template for the distribution or storage of metadata records for Go-Geo! If a simple standard is used, such as Dublin Core, which only has a few elements, then not all of the mandatory elements will be displayed on the Go-Geo! portal and information about the datasets will be lost. From the case studies discussed in this report it seems at present that self-archiving and centralised archiving mainly rely on Dublin Core as it promotes interoperability and preservation. However, the use of Dublin Core is too simplistic for geospatial metadata records. In contrast, P2P can be more flexible, although difficulties lay in the lack of control and enforcement of standards and protocols. There is some evidence that centralised archives will incorporate geospatial elements into the standards they use, for example the DDI has already begun on this path by adopting the FGDC bounding rectangle and polygon elements. However, it is unlikely that they would retrospectively create records using the Go-Geo! HE/FE metadata application profile. 42 Ease of Use All of the data distribution systems described in this report have their own advantages and disadvantages when it comes to ease of use. The centralised archive has the support to assist data creators in the submission of metadata. Self-archiving and P2P archiving are more flexible as the data creator can deposit when they wish to and update the datasets when they like. This point is particularly relevant to P2P which, in theory, should be available all of the time. Quality Control Distributed networks, such as self-archives and P2P networks, do not have any central control over data. This means that there is a lack of quality control protocols in place. Self-archiving has tried to solve this problem by using the peer review system. This system is assumes that suitable peers can be found in the first place and that they are willing to undertake this task long-term to ensure continuity in quality standards. Copyright and Licensing For distributed archives, copyright and licensing must lie with the data creators host institution as this is where the data are stored. At a more traditional archive, licences are drawn up and signed by depositors but also users, who must sign an agreement as to the intended use of the data. This point is very important when the usage of data is restricted to one community, such as academia. Licences are also a means of mediating IP where the data producer has used material for which that do not have copyright. This issue probably affects the majority of geospatial datasets in the UK as materials such as Ordnance Survey data are incorporated into the geospatial data. Without centralised control over licences, issues over copyright will occur. Cost The cost of self-archiving and P2P lies mainly with the data creator, as it is they who will spend the time setting up the system, in the case of P2P and submitting data, as in the case of selfarchiving. The cost with centralised archiving lies with the archive, who must pay for overheads, staff expertise and time, resources etc. The cost to the user in most cases will be nothing unless the data is used commercially. Security A distributed system provides greater security against denial of service attacks and network crashes as the system does not rely on one data source or network. A centralised archive could provide greater data security and integrity as access is controlled and usage of the data is monitored. Preservation The traditional way to preserve material would be to deposit data in a traditional archive. Although this is still the most effective, reassuring way of preserving data, there are other issues to consider. Firstly, as self-archiving deals with copies of prints, preservation of electronic copies may not be an issue. If the data creator is based at a university they may be able to rely on the university library for data preservation, depending on in-house policies. Another option would be to use another service such as demonstrated in the Thesis Alive project. The preservation of metadata records is an issue which needs further consideration. Solutions One solution could be to combine the advantages of using centralised archiving with a distributed system. The creation of a distributed archive, could be subject based and would involve using several archives such as ADS and UKDA to hold the datasets and metadata which they would normally hold, such as archaeology and social science respectively. In this way, the advantages of using a well established, reputable archive which offers quality control and user support can be combined with the advantage of not holding data in one place, therefore not depending on one single network. This ensures that whilst only good quality metadata is published the system will not be slowed down. It also means that data creators can store their data in the place which they would normally do so and therefore there is less likely to be duplication of data. Creating subject repositories of this kind who also allow for the creation 43 of subject-based user support systems, where the repository would be able to provide specialist user support. One major problem with this solution would be that the various archives already use their own metadata standards and are unlikely to want to hold two different standards or convert their standard into a geospatial one. Some confusion may also arise as data creators may not always know which subject repository their dataset is most suited to. Another solution would be to set up a completely distributed system where repositories are set up at the individual organisations who have created the data. For example, AHDS Archaeology holds archaeological data and AHDS History holds historical data. This type of system would rely on the institutions providing quality control and user support, which could be achieved through current archive or library procedures. It may also lead to duplication of data and inconsistencies in style, for example metadata standards which again would probably not be compliant with the Go-Geo! application profile. Another solution would be to hold the datasets and associated metadata at an established archive, such as UKDA. This is likely to be a welcomed scenario as provisions will be made for user and depositor support, preservation, licensing and data security. Again, the issue of differences between metadata standards create difficulties with this option. More investigation is needed into this issue though as further changes may be made to standards such as the DDI in the future which could make it compliant enough with the Go-Geo! application profile to warrant this as a feasible option. 7.0 Requirements and Recommendations for Data Distribution for the GoGeo! Portal The following requirements were made by data creators and depositors for a data distribution system to provide: access to data online; copyright agreements and data integrity; user and depositor support; preservation and archiving policies; a system which is both easy and fast to use; acceptance of supporting material for storage; a service which is free to both depositors and users. There is also a need for the system to provide: large bandwidth; secure and stable network; storage facilities. From the findings of this report and with the benefit of feedback received from the metadata creation initiative, it is clear that the portal technology has two potential functions: as a national service providing access to geospatially referenced data; and as a tool for institutional management of resources which are likely be restricted to local use. For the portal to function most effectively and in line with user expectations as a national resource, it seems that the only practical solution would be to consider the creation of a national, centralised geospatial repository which would provide geo spatial data users and creators with support which is tailored to be compliant with the FGDC and ISO 19115 standards. Data access could be centrally controlled and usage of data monitored. Copyright, licensing and preservation could all be dealt with in a single repository to provide an efficient data distribution and storage system for the Go-Geo! portal. If, on the other hand, the portal is to be used as an institutional tool, the need for a central place of deposit becomes, in theory, less pressing as issues such as IP, authorisation and user support become less important. Consequently, for institutional use, self-archiving may be the solution. However, it needs to be clearly understood that the two functions do not naturally lead one to the other. The attraction of institutional repositories seems to be that information is managed, 44 made available and remains within the control of the institution. Whilst there is general agreement that data should be shared nationally, there is a clear lack of commitment to putting local resources into removing the barriers that would make this happen. It is difficult to see how an automated system can resolve issues of IP, user support and access control without human intervention in the form of a decision making body. Our only recommendation therefore is that the enormous potential of the portal should be recognised. The evaluation work has identified two important needs for the management of geospatially referenced data: one is the need for a system for institutional management; the other is the need for an enhanced national service for geospatial resources with an emphasis on access to data. The portal has the potential to serve both these needs and we would like to see further funding to assess the feasibility and costs of fulfilling both these needs using GoGeo! technology. 45 Bibliography AfterDawn.com. (May 2003). P2P Networks Cost too Much for ISP. Available at: shttp://www.afterdawn.com/news/archive/4106.cfm. Andrew, T. (2003).Trends in Self-Posting of Research Material Online by Academic Staff. Ariadne Issue 37. Available at: http://www.ariadne.ac.uk/issue37/andrew/. Breidenbach, Susan. (July 2001). Peer-to-Peer Potential. Network World. Available at: http://www.nwfusion.com/research/2001/0730feat.html. Creedon, Eoin.et.al. (2003). GNUtella. Available at: http://ntrg.cs.tcd.ie/undergrad/4ba2.0203/p5.html. Fleishman, Glenn. (May 2003). It Doesn't Pay to be Popular. Available at: http://www.openp2p.com/pub/a/p2p/2003/05/30/file_sharing.html. Harnad S. Distributed Interoperable Research Archives for Both Papers and Their Data: An Electronic Infrastructure for all Users of Scientific Research. Available at: http://www.ecs.soton.ac.uk/~harnad/Temp/data-archiving.htm. Harnad, Stevan. (2001).The Self-Archiving Initiative. Nature 410: 1024-1025. Available at: http://www.ecs.soton.ac.uk/~harnad/Tp/naturenew.htm. Harnad, Stevan. (April 2000). The Invisible Hand of Peer Review. Exploit Interactive. Issue 5, Available at: http://www.exploit-lib.org/issue5/peer-review/. Hatala, Marek et.al. (2003). The EduSource Communication Language: Implementing Open Network for Learning Repositories and Services. Available at: http://www.sfu.ca/~mhatala/pubs/sap04-edusource-submit.pdf. Hatala, Marek and Richards, Griff. (May 2004). Networking Learning Object Repositories: Building the eduSource Communications Layer. Oxford. ITsecurity.com. (5 March, 2004). Workplace P2P File Sharing Makes Mockery Of Internet Usage Policies Says Blue Coat. Available at: http://www.itsecurity.com/tecsnews/mar2004/mar59.htm. Kraan, Wilbert. (March 2004). Splashing in Ponds and Pools. Available at: http://www.cetis.ac.uk/content2/20040317152845. Krikorian, Raffi. (April 2001). Hello JXTA! Available at: http://www.onjava.com/pub/a/onjava/2001/04/25/jxta.html. Krishnan, Navaneeth. (October 19, 2001). The JXTA Solution to P2P: Sun's New Network Computing Platform Establishes a Base Infrastructure for Peer-to-Peer Application Development. Available at: http://www.javaworld.com/javaworld/jw-10-2001/jw-1019-jxta.html. Maly, K, et.al. (2003). Kepler Proposal and Design Document. Available at: http://kepler.cs.odu.edu:8080/kepler/publications/finaldes.doc. Maly, K, et.al. (2001). Kepler - An OAI Data/Service Provider for the Individual. D-Lib Magazine 7(4). Available at: http://www.dlib.org/dlib/april01/maly/04maly.html. Maly, K, et.al. (2002). Enhanced Kepler Framework for Self-Archiving ICPP-02, pp. 455-461, Vancouver. Available at: http://kepler.cs.odu.edu:8080/kepler/publications/kepler.pdf. 46 McGuire, David. (March 31, 2004). Lawmakers Push Prison For Online Pirates. washingtonpost.com. Available at: http://www.washingtonpost.com/wp-dyn/articles/A401452004Mar31.html. Nejdl, Wolfgang et al.(May 2002). EDUTELLA: A P2P Networking Infrastructure Based on RDF. Available at: http://www2002.org/CDROM/refereed/597/index.html. Olivier, Bill and Liber, Oleg. (December 2001). Lifelong Learning: The Need for Portable Personal Learning Environments and Supporting Interoperability Standards. The JISC Centre for Educational Technology Interoperability Standards, Bolton Institute. Pinfield, Stephen (March 2003). Open Archives and UK Institutions. D-Lib Magazine, Volume 9 Number 3. ISSM 1082-9873. Available at: http://www.dlib.org/dlib/march03/pinfield/03pinfield.html. Pinfield, Stephen and Hamish James (September 2003). The Digital Preservation of e-Prints. D-Lib Magazine, Volume 9, Number 9. ISSN 1082-9873. Available at: http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/september03/pinfield/09pinfield.html. Pradhan, Anup and David Medyckyj-Scott. (July 2001). EDINA Requirements Specification and Solution Strategy. Seybold, P.A. (2002). Web Services Guide for Customer-Centric Executives, Patricia Seybold Group, Inc. Boston. Sinrod, Eric J. (August 2002). E-Legal: A Bill to Combat P2P Copyright Infringement. Law.com. Available at: http://www.duanemorris.com/articles/article927.html. Simon Fraser University. (May 4 2004). LionShare. JISC Meeting, Oxford. Smith, M et.al. (2003). DSpace: An Open Dynamic Digital Repository. D-Lib Magazine Volume 9, Number 1. ISSN 1082-9873. Available at: http://mirrored.ukoln.ac.uk/lisjournals/dlib/dlib/dlib/january03/smith/01smith.html. Somers, Andrew. Is Sharing Music Over A Network Any Different Than Playing It For A Friend? Available at: http://civilliberty.about.com/library/content/blP2Prights.htm. Truelove, Kelly. (April 2001). The JuxtaNet. Available at: http://www.openp2p.com/lpt/a/799 . O'Reilly Media, Inc. Watson, Ronan. (2003). Freenet. Available at: http://ntrg.cs.tcd.ie/undergrad/4ba2.0203/p7.html. Wilson, Scott. (September 2001). The Next Wave: CETIS Interviews Mikael Nilsson about the Edutella Project. Available at: http://www.cetis.ac.uk/content/20010927163232. Wolf, Donna. (Feb 22 2004). Peer-to Peer. Available from: http://searchnetworking.techtarget.com/sDefinition/0..sid7 gci21769.00.html. Wood, Jo (). A Peer-to-Peer Architecture for Collaborative Spatial Data Exchange. SFU Surrey 47 Glossary Access Control API Application Profile Architecture Authentication Bandwidth BOAI Creator Document Delivery EDD Eprint Eprint Archive FTP Harvest HTTP Infrastructure Interoperability Learning Object (LO) Node Open Access OAI OAI-PMH OAI Compliant Peer Technology that selectively permits or prohibits certain types of data access Application Programming Interfaces Schema consisting of data elements The structure or structures of a computer system of software. This structure includes software components, the externally visible properties of those components, the relationships among them and the constraints on their use. It eventually encompasses protocols, either custom-made, or standard, for a given purpose or set of purposes Verifying a user's claimed identity The information carrying capacity of a communication channel The Budapest Open Access Initiative - a worldwide coordinated movement to make full-text online access to all peer-reviewed research free for all. Important note: BOAI and OAI are not the same thing, but BOAI and OAI do have similar goals The person or organisation primarily responsible for creating the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources. The supply, for retention, of a document ( journal article, book chapter etc) to a third party by means of copying, in compliance with all copyright regulations, and delivering it to the requester (by hand, post, electronically) electronic document delivery - the supply, for retention, of a document (journal article, book chapter etc) to a third party by means of scanning or, where permitted by publishers, from online versions in compliance with all copyright regulations, and delivering it to the requester by electronic transfer An electronically published research paper (or other litary item). Or free software for producing an archive of eprints An online archive of preprints and postprints. Possibly, but not necessarily, running eprints software File Transfer Protocol - an internet application protocol To retrieve metadata from a digital repository. Conversely: to take delivery of metadata from a digital Repository Hyper Text Transfer Protocol The underlying mechanism or framework of a system The ability of hardware or software components to work together effectively Any resource or assets that can be used to support learning. A resource typically becomes thought of, as a Learning Object, when it is assigned Learning Object Metadata, is discoverable through a digital repository, and can be displayed using an eLearning application See ‘Peer’ Something anyone can read or view The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content Open Archives Initiative Protocol for Metadata Harvesting - A way for an archive to share it's metadata with harvesters which will offer searches across the data of many OAI-Compliant Archives An archive which has correctly implemented the OAI Protocol Two computers are considered peers if they are communicating with each other and playing similar roles. For example, a desktop computer in an office might communicate with the office's mail server; however, they are not 48 P2P Network Postprint Preprint Preprint Archive RDF Reprint Servent XML Z39.50 peers, since the server is playing the role of server and the desktop computer is playing the role of client. Gnutella's peer-to-peer model uses no servers, so the network is composed entirely of peers. Computers connected to the Gnutella Network are also referred to as "nodes" or "hosts" Internet users are communicating with each other through P2P (Peer-toPeer) file sharing software programs that allow a group of computer users to share text, audio and video files stored on each other's computers The digital text of an article that has been peer-reviewed and accepted for publication by a journal. This includes the author's own final, revised, accepted digital draft, the publisher's, edited, marked-up version, possibly in PDF and any subsequent revised, corrected updates of the peer-reviewed final draft The digital text of a paper that has not yet been peer-reviewed and accepted for publication by a journal An EPrint Archive which only contains PrePrints Resource Description Framework (http://www.w3.org/RDF/ ) is a foundation for processing metadata; it provides interoperability between applications that exchange machine-understandable information on the Web. RDF emphasizes facilities to enable automated processing of Web resources A paper copy of a peer-reviewed, published article. Usually printed off by the publisher and given to or purchased by the author for distribution. Node with both client and server capabilities Extensible Markup Language. A uniform method for describing and exchanging structured data that is independent of applications or vendors Z39.50 refers to the International Standard, ISO/IEC 23950: "Information Retrieval (Z39.50): Application Service Definition and Protocol Specification", and to ANSI/NISO/IEC Z39.50. The standard specifies a client/server-based protocol for searching and retrieving information from remote databases 49 Appendix A – Data Distribution Survey The Go-Geo! project is a collaborative effort between EDINA (University of Edinburgh) and the UK Data Archive (University of Essex) to provide a geo-spatial data portal for the UK Higher and Further Education community. This web service will provide a tool for discovering, exchanging, and accessing spatial data and related resources. The Go-Geo project is currently investigating different ways by which data could be shared with others through a distribution system. The aim is that in the future, a data distribution mechanism will be set up for the Go-Geo! portal. Feedback gained from this questionnaire will go towards the development of such a data distribution system. All answers given during this survey will be confidential. 1.0 General 1.1 Which of the following would you choose as a mechanism for data distribution? Explanations of the mechanisms are given at the bottom of this pages 3 centralised archive and distribution service centralised repository based on a single self archiving service distributed service based on a number of self archives located around the country distributed service with each participating institution having it own self archive distributed service based upon peer-2-peer (p2p) technology none of the above other (please describe) 1.1.1 What factors would affect your choice? ease of use cost need to ensure long-term availability (preservation) copyright issues data security user support depositor support other (please specify) 3An archival dissemination model follows the system employed by the UK Data Archive whereby academic researchers submit material to a dedicated repository. The materials are quality checked, preserved and disseminated by the organisation. The archive implements acquisitions, IP and licensing and quality assurance policies and provides a user and depositor support facilities. Self-archiving allows for the deposition of a digital document in a publicly accessible website and the free distribution of data across the web. A model for GI data would be analogous to the systems currently being employed within scientific literature but customised for spatial data. Peer-to-peer is a communications method where all parties are equal. Users within a distributed network share GI resources through freely available client software that allows the search and transfer of remote datasets. 50 1.2 How would you like to see data provided? CD-ROM DVD on-line other (please specify) 1.3 Should users be able to contact depositors for support or should the data and related materials be distributed ‘as is’ with no further assistance? contact depositors directly ‘as is’: no further assistance 1.3.1 Please provide comments below 2.0 Depositing Data 2.1 How big a concern are each of the following to you when it comes to deciding whether to either deposit data for use by others or use data provided by others? Please rank your answers from 1-7, with 1 being the most important Issue Depositing Data Using Data IPR Confidentiality Liability Data quality/provenance Need for specialist support Data security Protecting the integrity of the data creator 2.1.1 Please provide reasons for the concerns: 2.2 Assuming there are no licensing problems, should available data be: original datasets limited to value-added datasets new datasets unavailable elsewhere within the HE community new datasets unavailable elsewhere within the GI community limited to a single version of a dataset 51 limited to a single format of a dataset limited to the most recent edition of a dataset 2.3 Should the data be stored in standard and commonly used formats such as ESRI shape files, CSV, gml? yes no 2.3.1 If yes, please specify which formats you would like? 2.4 In terms of submission of data, what factors are likely to be the most important e.g. speed, ease of use? 2.5 Should anyone be permitted to put data into the system? yes no 2.5.1 If no, then to whom should it be restricted? individuals within academia only to registered users within academia only to gatekeepers e.g. data librarians or service providers your organisation e.g. university, department, institution anyone other (please specify) 2.6 Would you wish to be able to track who has taken the data? yes no 2.7 Given that guidance notes, quality statements, metadata etc take time and resources to produce, should depositing of a dataset only be allowed where: a complete set of supporting information is provided minimal supporting information is provided no supporting information is provided 2.7.1 If you selected answer 2.7b, which of the following would you consider to be minimal? Please tick all that apply discovery metadata guidance notes/help files code books data dictionary/schema/data model information on methodology/processing undertaken on data reference to research papers based on the data 52 data and metadata quality statements statement of IPR statement of copyright & terms of use other (please specify) 2.7.2 If you selected answer 2.7c, who do you expect would provide support and how would you expect to monitor the quality of the dataset? 2.8 How important are data quality and accurate metadata? very important fairly important neither important or unimportant not very important not important at all 3.0 Long term Storage and Preservation 3.1 Once deposited, should the data persist indefinitely? yes no 3.1.1 If no, should the data have a lifespan that is dependent on: usage perceived utility to other users other (please specify) 3.2 How important is it that standards for spatial data, data transfer and data documentation are implemented? very important fairly important not very important not important at all 3.3 Who should be responsible for any conversion or other work required to meet data transfer and preservation standards? 3.4 Who should be responsible for the active curation of the dataset e.g. data creator/service provider? 53 3.5 Who should take responsibility for maintaining a dataset generated by a research team when that team disbands? 4.0 Licensing and Copyright 4.1 Is copyright of your data: your own held jointly (e.g. if you use Ordnance Survey data made available under a separate agreement) held by others (please specify who) 4.1.1 Assuming you are the sole copyright owner in any dataset would you wish to make them freely available for: academics only any user (e.g. academic, commercial or other) for use within your organisation (e.g. department or university) none of these 4.1.2 Assuming you do not have sole copyright (e.g. you have used material from Ordnance Survey) would you: want to personally manage the additional licensing requirements ask your department to manage licences ask your central university administration to manage the licences prefer to have it managed by a central specialist organisation other (please specify) 4.3 Do you think that each use of a dataset should incur a royalty charge for the data creator/s? yes no 4.3.1 If yes, please suggest a reasonable charge or a formula for charging (for example, tiered charge for academic, general public, commercial use): 54 4.4 Do you think there should be a charge by the service providers to cover service operation costs? (this is separate from royalty charges) yes, for users of data yes, for depositors of data no 55 5.0 Sustainability 5.1 In your view, who or which organisation should be responsible for promoting, facilitating and funding data sharing? Organisation Promoting data sharing Facilitating data sharing Funding a service which facilitates data sharing Funders of data creation awards The creators themselves The institution within which the creator works National repositories (where they exist) The JISC Other (please state who) 5.2 Should funders cover researchers costs of preparing the data for sharing? yes no maybe (please elaborate) 5.2.1 If not, why not and who should fund it? 56 6.0 Additional Comments Please complete the section below if you wish to be entered into a prize draw to win a £30 Amazon voucher Name Organisation Position Email Thank you for taking part. Your views are important to us Please return this questionnaire to: Julie Missen Projects Assistant UK Data Archive, University of Essex Colchester, CO4 3SQ Email: jmissen@essex.ac.uk Tel: 01206 872269 57