presentation

advertisement
Digital Challenges – Bridging the gap between
publication and data
Adam Farquhar
Head of Digital Library Technology
The British Library
IASSIST, Tampere, 27 May 2009
The British Library:
‘This is the life blood of research and innovation’
Science and Innovation Investment Framework 2004-2014, H.M. Treasury (2004)
Information infrastructure
2.23 The growing UK research base must have ready and efficient access to information of all kinds – such as
experimental data sets, journals, theses, conference proceedings and patents. This is the life blood of research
and innovation.
The largest document supply
service in the world. Secure
e-delivery and ‘just in time’
digitisation enables desktop
delivery within 2 hours
National library of the UK.
Serves researchers, business,
libraries, education & the general
public
Collection includes over 2m
sound recordings, 5m reports, theses
and conference papers, the world’s
largest patents collection (c.50m)
Generates value to the UK
economy each year of 4.4 times
public funding
Collection fills over 600km of
shelving and grows at 11km per year
30 Tb of digital material growing
rapidly
GIA Funding 08/09:
£94.8m operational,
£12m capital
Other funding secured 07/08:
c.£33m
Business and IP Centre:
Providing inspiration, and enabling
protection of creative capital and
business development
Helping people
advance knowledge to
enrich lives
2 main sites in London and
Yorkshire. Circa 2,000 staff
Supporting research

Science,
Technology &
Medicine




Social Sciences




Arts & Humanities


Document Supply service provides 1.4m articles/year
primarily to scientists
Renewed engagement with researchers using digital content
and online services
In-depth focus on biomedicine and energy/environment
Collection includes journals, patents, theses and more, and is
updated by some 9,000 articles every day
A significant international collection of books, journals,
reports, theses, official publications and other materials
A unique collection of grey literature, of special interest
to practitioners and theoreticians
Research collaboration with ESRC
Greatest research collection of its kind in the world
World-class curatorial expertise by subject, medium
and geographical area
BL has been developing world-leading e-innovations
for past decade (e.g. International Dunhuang Project)
and building a significant corpus of digitised texts
Research collaboration with AHRC, British Academy
and HEIs
3
Building the Digital Research Infrastructure
BL Digital library system
 Large scale, highly resilient digital store
 Continuous validation & correction
 Long term digital storage for BL content &
eLegal deposit/distribution
 Long term access (digital preservation)
 Leading EU-funded digital preservation
project ‘Planets’ (16 partners)
 Developing cost models and case studies
with UCL (‘Life’ projects)
Cambridge  Addressing root causes of digital
Univ.
obsolescence

Edinburgh -2009
Boston Spa
Aberystwyth
Oxford
Univ.
St. Pancras
4
Digital Library



Live Content Streams

Sound Archives

Voluntary Digital Donations

Nineteenth Century Digitised
Books

Born Digital Newspapers
Storage

>440,000 Digital Items

>30 Terabytes of Content
Coming soon

eJournals

Digitised Newspapers
5
Role of the British Library in Science, Technology
and Medicine

Long history of collecting scientific and technical
literature

Serves business & industry, researchers, academics
and students

Dedicated reading rooms in London

The Library operates the world’s largest document
delivery service - millions of items each year to
customers all over the world predominantly in the
STM disciplines

Indexing the UK input into Medline/PubMed

Creation of AMED (Allied and Complementary
Medicine A&I Database) research articles on
complementary medicine and allied health

Lead Partner in UK PubMed Central
6
WorldWideScience.org





Global science gateway based on US
Department of Energy’s Science.Gov
service
Multilateral partnership to enable
federated searching of national and
international scientific databases and
portals.
Launched in 2008
Large number of countries already
providing access to publicly funded
research outputs - latest addition is
China
Chaired by British Library
7
UK PubMed Central
Launched in January 2007






Number of articles: 1.4 million
Over 2,500 manuscripts submitted by grant holders
Information held on 20,000 research grants awarded to 9,000
PIs by UKPMC Funders
Downloads have grown strongly with over 300,000 in March
2009
UKPMC users are predominantly UK based (70%) but service
is accessed across the world
Working with the Bioscience community and Funders to
develop the service based on UK research community needs
8
Research Information Centre – the research lifecycle

Supports full research life-cycle
 Accessible by web browser
 Configured for biosciences but flexible
 Designed for collaboration
Based on Microsoft’s Sharepoint product
Developed with Microsoft External Research
Team DOI:10.1109/ADVCOMP.2007.14
 Beta tested by 25 bioscience research teams
(academia & commercial) in UK & US


9
Social Science Collection
and Research
©Clive Sherlock
New team established in 2006
 Priorities: define and develop the collection, improve accessibility,
raise awareness, build networks, build capacity
 Strong focus on researcher needs
Develop strategies for grey literature and data access
Build the collection of government publications
 Recent and historic print collections with LSE and Oxford Soc
Science Library, …
 Digital and web collections with TNA and UK e-OP ‘digital
continuity’
 Managing Access to Government Information Collaboratively
(MAGIC) with LSE
10
Social Science Collection
and Research
©Clive Sherlock
Research collaborations
 Voices of the UK; Children’s play in the media age
Knowledge exchange, awareness and capacity building
 Corporate and Social Responsibility seminars
 Multi-modal PhD seminars
 ESRC Festival of Social Science
 ESRC Interns
 Postgraduate training days, thematic study days, ESDS
seminars
 Public events - Census 2011 to explain the role of
quantitative and qualitative social surveys
11
Books and data – a parable
A scientist measured
environmental conditions to
determine their impact on
leather bindings
When the project was complete,
he printed the data, bound it,
and submitted it to UK copyright
libraries
Thirty years later, a scientist took
it off the shelf and started to
reuse the data, and collect anew
Too big for any shelf
Not interesting for a data centre
When his project was complete,
he had had 30,000 images and
megabytes of data
Is the project web site enough?
12
Journals and data – a problem
In 2003, Legal Deposit Legislation in the UK is extended to
cover digital material
 Building on the 1911 Legal Deposit Act
Electronic journal articles are covered – they will be collected
and archived for the long term
… But supplementary material is not covered
 For now, it remains on the publisher web sites
13
Long-term access is critical
According to a Parse.Insight survey
 50% needed research data gathered by other researchers that was not
available
Within High Energy Physics
 More than 90% think that data preservation is important - crucial
 Benefits include
 Verify scientific results independently (60%)
 Combine past and future data (60%)
 Re-analyze in the light of new theories and future results (75%)
 45% - old data could have improved their scientific results
 40% - important HEP data have been lost in the past.
 Many are willing to share
 80% would provide data behind tables and figures
 45% would provide “raw” data
 But 50% believe costs to repackage for sharing are high
14
Widening gap
A widening gap in the scientific
record between published research
and the data that underlies it
 Published work held by libraries
 Datasets held by data centres
 No effective way to link between
datasets and articles
 No widely used method to
identify datasets
 No widely used method to cite
datasets
As a result, datasets are
 Difficult to discover
 Difficult to access
 Second-class citizens in the
scientific record
15
Datasets in the scholarly record (OECD White Paper)

45% of journal publishers provide access to datasets
associated with journal articles they publish (ALPSP)
 But there are no rules about how to publish, present, cite, or
otherwise catalogue datasets
Citation
Tertiary school enrollment: School
enrollment, tertiary (% of gross). Source:
Citation
Barro and Lee (2000) and their
Main mortality estimate: Estimated settler mortality.
databases
Settler mortality is calculated from the mortality
rates of European-born soldiers, sailors, and
bishops when stationed in colonies. It measures
the effects of local diseases on people without
inherited or acquired immunities. Source:
Acemoglu et al. (2001), based on Curtin (1989)
and other sources.
16
Datasets – first class citizens?
Datasets
Published articles
Data is difficult to manage after
project funding ceases
Libraries ensure long-term storage
and management
Informal networks provide the
primary means of sharing
Established funded services provide
the primary means of access
Only 21% use a national or
international facility
Nearly all published articles are held
in multiple national libraries
Datasets are not included in impact
analysis
Articles and citations form the
backbone of impact analysis
Good luck finding it (your discipline
may vary)!
Catalogues and full-text search
support discovery
UKRDS Study
17
Global responses to the challenge
Research council mandates
 Data management plans
 Data retention plans
Funded initiatives
 Australian National Data Service
 UK Research Data Service
 UK Digital Curation Centre
 US DataNet programme
 JISC Data programme
 EU Science Data Infrastructure, …
STM publishers
 Brussels Declaration: Raw research data should be made
freely available to all researchers
18
A key component for many goals
Cite
Make
Visible
Find
Reuse
Persistent
?
Identification
Access
Verify
Track
Impact
19
Dataset citation using Digital Object Identifiers (DOIs)
The DOI system offers an easy
way to connect the article with
the underlying data
Several organisations have
started to assign DOIs to
datasets
 IUCR, ICPSR, OECD through
CrossRef
 Pangea, Mare, and others
through TIB (German Science
Library)
Dataset
G.Yancheva, N. R. Nowaczyk et al (2007)
Rock magnetism and X-ray flourescence
spectrometry analyses on sediment cores
of the Lake Huguang Maar, Southeast
China, PANGAEA
doi:10.1594/PANGAEA.587840
Article
G. Yancheva, N. R. Nowaczyk et al (2007)
Influence of the intertropical convergence
zone on the East Asian monsoon
Nature 445, 74-77
doi:10.1038/nature05431
20
It looks so easy
Organisational challenges
 Data centres, funders have
regional or disciplinary scope
 Universities have teaching
and research mission and
competitive relationships
 Publishers do not cover unpublished material
 Consortium of the above
require large and fragile
coalitions
We need an consortium of
national institutions with a longterm stewardship role
Social challenges
 Acceptance by key
stakeholders including
funders, data centres,
universities, researchers,
publishers
 Use by data creators and
authors
Technical challenges
 Robust infrastructure
 Identifying the right thing
 Ensuring longevity
21
DataCite
Organisations with the national science library role are working together to
establish a European and global infrastructure to support researchers by
providing methods for them to locate, identify, and cite research datasets
with confidence
Publishing agents (data centres, research institutes) are responsible for:
 Quality assurance
 Content storage and access
 Creating the identifier
 Creating and updating metadata
The DataCite registration agency
 Maintains the resolution infrastructure
 Maintains a searchable database of metadata
 Manages the identifiers over the long term
 Establish and share best practice
22
Memorandum of Understanding
Paris, March 2, 2009
Recognizing the importance of research datasets as the
foundation of knowledge and sharing a common commitment
to promote and establish persistent access to such datasets,
we, the signed parties, hereby express our interest to work
together to promote global access to research data.
Our long term vision is to support researchers by
providing methods for them to locate, identify, and cite
research datasets with confidence.
23
Initial Signatories






Technische
Informationsbibliothek
(TIB), Germany
Library or the ETH Zürich,
Switzerland
L’Institut de l’Information
Scientifique et Technique
(INIST), France
Library of TU Delft, The
Netherlands
Technical Information
Center of Denmark
The British Library
24
Key facts about DOI
Usage
 >35m DOIs have been
assigned
 >2m resolutions each month
Organizational
 Not-for-profit International DOI
Foundation (IDF)
 Provides social infrastructure
 Includes registration agencies
 Registration done in cooperation with a publication
agent
 Publication agents are
responsible for the content
Technical
 A DOI Name is a persistent
identifier used to cite and link
resources
 Linked to an object – not to
a location
 The location may change,
but the DOI remains the
same
 The DOI System holds
metadata about objects
including their URL
 Resolution redirects the user
from a DOI name to the URL
25
Strengths and weaknesses of DOI
DOIs have some strong advantages
 Accepted by researchers and scientists
 Mature infrastructure
 Put datasets on the same playing field as articles
But perceived as
 Expensive
 The current IDF business model favours larger registration
agencies
 Publisher oriented
 The largest registration agency is the publisher-oriented
CrossRef
26
DataCite Structure
International DOI
Foundation
Global Handle
System
DataCite
National
Institution
National
Institution
Works
with
…
DataCentre
Centre
Data
Data Centre
DataCentre
Centre
Data
Data Centre
27
Typical workflow (Data Centre)

Data Centre registers with DataCite
 Data Centre ingests a dataset and assigns an identifier
 Data Centre registers the dataset by submitting an XML file
containing relevant bibliographic metadata and the URL for
the dataset’s access page
 Metadata drawn from ISO 690-2 for referencing electronic
information
• author
• title
• size
• edition
• language
• publisher
• publishing date
• publishing place
28
Typical workflow (2)
Author
 Includes citation using the DOI, just like an article
Reader
 Follows the resolvable link that includes the DOI (or
searches for it), just like an article
 Reaches a unique landing page at the Data Centre for the
dataset
 Open to every reader
 Includes the DOI and metadata to help the reader decide
if the dataset will help
 May need to take additional steps to access the dataset
29
Research Data in Articles
30
Thanks!
The British Library has a duty of care for the scientific record
 Renewed engagement in STM and Social Sciences
 Actively partnering to achieve goals
There is a widening gap between published research and the data
that underlies it
DataCite will support researchers by enabling them to locate,
identify, and cite research datasets with confidence
 This is the start of a long and open dialogue
 There are many open issues to address
We welcome your comments, questions, and ideas!
Email: adam.farquhar {@} bl.uk
31
Download