Digital Challenges – Bridging the gap between publication and data Adam Farquhar Head of Digital Library Technology The British Library IASSIST, Tampere, 27 May 2009 The British Library: ‘This is the life blood of research and innovation’ Science and Innovation Investment Framework 2004-2014, H.M. Treasury (2004) Information infrastructure 2.23 The growing UK research base must have ready and efficient access to information of all kinds – such as experimental data sets, journals, theses, conference proceedings and patents. This is the life blood of research and innovation. The largest document supply service in the world. Secure e-delivery and ‘just in time’ digitisation enables desktop delivery within 2 hours National library of the UK. Serves researchers, business, libraries, education & the general public Collection includes over 2m sound recordings, 5m reports, theses and conference papers, the world’s largest patents collection (c.50m) Generates value to the UK economy each year of 4.4 times public funding Collection fills over 600km of shelving and grows at 11km per year 30 Tb of digital material growing rapidly GIA Funding 08/09: £94.8m operational, £12m capital Other funding secured 07/08: c.£33m Business and IP Centre: Providing inspiration, and enabling protection of creative capital and business development Helping people advance knowledge to enrich lives 2 main sites in London and Yorkshire. Circa 2,000 staff Supporting research Science, Technology & Medicine Social Sciences Arts & Humanities Document Supply service provides 1.4m articles/year primarily to scientists Renewed engagement with researchers using digital content and online services In-depth focus on biomedicine and energy/environment Collection includes journals, patents, theses and more, and is updated by some 9,000 articles every day A significant international collection of books, journals, reports, theses, official publications and other materials A unique collection of grey literature, of special interest to practitioners and theoreticians Research collaboration with ESRC Greatest research collection of its kind in the world World-class curatorial expertise by subject, medium and geographical area BL has been developing world-leading e-innovations for past decade (e.g. International Dunhuang Project) and building a significant corpus of digitised texts Research collaboration with AHRC, British Academy and HEIs 3 Building the Digital Research Infrastructure BL Digital library system Large scale, highly resilient digital store Continuous validation & correction Long term digital storage for BL content & eLegal deposit/distribution Long term access (digital preservation) Leading EU-funded digital preservation project ‘Planets’ (16 partners) Developing cost models and case studies with UCL (‘Life’ projects) Cambridge Addressing root causes of digital Univ. obsolescence Edinburgh -2009 Boston Spa Aberystwyth Oxford Univ. St. Pancras 4 Digital Library Live Content Streams Sound Archives Voluntary Digital Donations Nineteenth Century Digitised Books Born Digital Newspapers Storage >440,000 Digital Items >30 Terabytes of Content Coming soon eJournals Digitised Newspapers 5 Role of the British Library in Science, Technology and Medicine Long history of collecting scientific and technical literature Serves business & industry, researchers, academics and students Dedicated reading rooms in London The Library operates the world’s largest document delivery service - millions of items each year to customers all over the world predominantly in the STM disciplines Indexing the UK input into Medline/PubMed Creation of AMED (Allied and Complementary Medicine A&I Database) research articles on complementary medicine and allied health Lead Partner in UK PubMed Central 6 WorldWideScience.org Global science gateway based on US Department of Energy’s Science.Gov service Multilateral partnership to enable federated searching of national and international scientific databases and portals. Launched in 2008 Large number of countries already providing access to publicly funded research outputs - latest addition is China Chaired by British Library 7 UK PubMed Central Launched in January 2007 Number of articles: 1.4 million Over 2,500 manuscripts submitted by grant holders Information held on 20,000 research grants awarded to 9,000 PIs by UKPMC Funders Downloads have grown strongly with over 300,000 in March 2009 UKPMC users are predominantly UK based (70%) but service is accessed across the world Working with the Bioscience community and Funders to develop the service based on UK research community needs 8 Research Information Centre – the research lifecycle Supports full research life-cycle Accessible by web browser Configured for biosciences but flexible Designed for collaboration Based on Microsoft’s Sharepoint product Developed with Microsoft External Research Team DOI:10.1109/ADVCOMP.2007.14 Beta tested by 25 bioscience research teams (academia & commercial) in UK & US 9 Social Science Collection and Research ©Clive Sherlock New team established in 2006 Priorities: define and develop the collection, improve accessibility, raise awareness, build networks, build capacity Strong focus on researcher needs Develop strategies for grey literature and data access Build the collection of government publications Recent and historic print collections with LSE and Oxford Soc Science Library, … Digital and web collections with TNA and UK e-OP ‘digital continuity’ Managing Access to Government Information Collaboratively (MAGIC) with LSE 10 Social Science Collection and Research ©Clive Sherlock Research collaborations Voices of the UK; Children’s play in the media age Knowledge exchange, awareness and capacity building Corporate and Social Responsibility seminars Multi-modal PhD seminars ESRC Festival of Social Science ESRC Interns Postgraduate training days, thematic study days, ESDS seminars Public events - Census 2011 to explain the role of quantitative and qualitative social surveys 11 Books and data – a parable A scientist measured environmental conditions to determine their impact on leather bindings When the project was complete, he printed the data, bound it, and submitted it to UK copyright libraries Thirty years later, a scientist took it off the shelf and started to reuse the data, and collect anew Too big for any shelf Not interesting for a data centre When his project was complete, he had had 30,000 images and megabytes of data Is the project web site enough? 12 Journals and data – a problem In 2003, Legal Deposit Legislation in the UK is extended to cover digital material Building on the 1911 Legal Deposit Act Electronic journal articles are covered – they will be collected and archived for the long term … But supplementary material is not covered For now, it remains on the publisher web sites 13 Long-term access is critical According to a Parse.Insight survey 50% needed research data gathered by other researchers that was not available Within High Energy Physics More than 90% think that data preservation is important - crucial Benefits include Verify scientific results independently (60%) Combine past and future data (60%) Re-analyze in the light of new theories and future results (75%) 45% - old data could have improved their scientific results 40% - important HEP data have been lost in the past. Many are willing to share 80% would provide data behind tables and figures 45% would provide “raw” data But 50% believe costs to repackage for sharing are high 14 Widening gap A widening gap in the scientific record between published research and the data that underlies it Published work held by libraries Datasets held by data centres No effective way to link between datasets and articles No widely used method to identify datasets No widely used method to cite datasets As a result, datasets are Difficult to discover Difficult to access Second-class citizens in the scientific record 15 Datasets in the scholarly record (OECD White Paper) 45% of journal publishers provide access to datasets associated with journal articles they publish (ALPSP) But there are no rules about how to publish, present, cite, or otherwise catalogue datasets Citation Tertiary school enrollment: School enrollment, tertiary (% of gross). Source: Citation Barro and Lee (2000) and their Main mortality estimate: Estimated settler mortality. databases Settler mortality is calculated from the mortality rates of European-born soldiers, sailors, and bishops when stationed in colonies. It measures the effects of local diseases on people without inherited or acquired immunities. Source: Acemoglu et al. (2001), based on Curtin (1989) and other sources. 16 Datasets – first class citizens? Datasets Published articles Data is difficult to manage after project funding ceases Libraries ensure long-term storage and management Informal networks provide the primary means of sharing Established funded services provide the primary means of access Only 21% use a national or international facility Nearly all published articles are held in multiple national libraries Datasets are not included in impact analysis Articles and citations form the backbone of impact analysis Good luck finding it (your discipline may vary)! Catalogues and full-text search support discovery UKRDS Study 17 Global responses to the challenge Research council mandates Data management plans Data retention plans Funded initiatives Australian National Data Service UK Research Data Service UK Digital Curation Centre US DataNet programme JISC Data programme EU Science Data Infrastructure, … STM publishers Brussels Declaration: Raw research data should be made freely available to all researchers 18 A key component for many goals Cite Make Visible Find Reuse Persistent ? Identification Access Verify Track Impact 19 Dataset citation using Digital Object Identifiers (DOIs) The DOI system offers an easy way to connect the article with the underlying data Several organisations have started to assign DOIs to datasets IUCR, ICPSR, OECD through CrossRef Pangea, Mare, and others through TIB (German Science Library) Dataset G.Yancheva, N. R. Nowaczyk et al (2007) Rock magnetism and X-ray flourescence spectrometry analyses on sediment cores of the Lake Huguang Maar, Southeast China, PANGAEA doi:10.1594/PANGAEA.587840 Article G. Yancheva, N. R. Nowaczyk et al (2007) Influence of the intertropical convergence zone on the East Asian monsoon Nature 445, 74-77 doi:10.1038/nature05431 20 It looks so easy Organisational challenges Data centres, funders have regional or disciplinary scope Universities have teaching and research mission and competitive relationships Publishers do not cover unpublished material Consortium of the above require large and fragile coalitions We need an consortium of national institutions with a longterm stewardship role Social challenges Acceptance by key stakeholders including funders, data centres, universities, researchers, publishers Use by data creators and authors Technical challenges Robust infrastructure Identifying the right thing Ensuring longevity 21 DataCite Organisations with the national science library role are working together to establish a European and global infrastructure to support researchers by providing methods for them to locate, identify, and cite research datasets with confidence Publishing agents (data centres, research institutes) are responsible for: Quality assurance Content storage and access Creating the identifier Creating and updating metadata The DataCite registration agency Maintains the resolution infrastructure Maintains a searchable database of metadata Manages the identifiers over the long term Establish and share best practice 22 Memorandum of Understanding Paris, March 2, 2009 Recognizing the importance of research datasets as the foundation of knowledge and sharing a common commitment to promote and establish persistent access to such datasets, we, the signed parties, hereby express our interest to work together to promote global access to research data. Our long term vision is to support researchers by providing methods for them to locate, identify, and cite research datasets with confidence. 23 Initial Signatories Technische Informationsbibliothek (TIB), Germany Library or the ETH Zürich, Switzerland L’Institut de l’Information Scientifique et Technique (INIST), France Library of TU Delft, The Netherlands Technical Information Center of Denmark The British Library 24 Key facts about DOI Usage >35m DOIs have been assigned >2m resolutions each month Organizational Not-for-profit International DOI Foundation (IDF) Provides social infrastructure Includes registration agencies Registration done in cooperation with a publication agent Publication agents are responsible for the content Technical A DOI Name is a persistent identifier used to cite and link resources Linked to an object – not to a location The location may change, but the DOI remains the same The DOI System holds metadata about objects including their URL Resolution redirects the user from a DOI name to the URL 25 Strengths and weaknesses of DOI DOIs have some strong advantages Accepted by researchers and scientists Mature infrastructure Put datasets on the same playing field as articles But perceived as Expensive The current IDF business model favours larger registration agencies Publisher oriented The largest registration agency is the publisher-oriented CrossRef 26 DataCite Structure International DOI Foundation Global Handle System DataCite National Institution National Institution Works with … DataCentre Centre Data Data Centre DataCentre Centre Data Data Centre 27 Typical workflow (Data Centre) Data Centre registers with DataCite Data Centre ingests a dataset and assigns an identifier Data Centre registers the dataset by submitting an XML file containing relevant bibliographic metadata and the URL for the dataset’s access page Metadata drawn from ISO 690-2 for referencing electronic information • author • title • size • edition • language • publisher • publishing date • publishing place 28 Typical workflow (2) Author Includes citation using the DOI, just like an article Reader Follows the resolvable link that includes the DOI (or searches for it), just like an article Reaches a unique landing page at the Data Centre for the dataset Open to every reader Includes the DOI and metadata to help the reader decide if the dataset will help May need to take additional steps to access the dataset 29 Research Data in Articles 30 Thanks! The British Library has a duty of care for the scientific record Renewed engagement in STM and Social Sciences Actively partnering to achieve goals There is a widening gap between published research and the data that underlies it DataCite will support researchers by enabling them to locate, identify, and cite research datasets with confidence This is the start of a long and open dialogue There are many open issues to address We welcome your comments, questions, and ideas! Email: adam.farquhar {@} bl.uk 31