大数据科研中的数字保存

advertisement
Digital Curation for the Big Data Sciences
大数据科研中的数字保存
张智雄
中国科学院国家科学图书馆
提纲





Digital Curation的兴起
Digital Curation是什么?
Digital Curation和Preservation不同
?
大数据科研带来的Digital Curation挑
战、问题及应对措施
结语
提纲





Digital Curation的兴起
Digital Curation是什么?
Digital Curation和Preservation不同
?
大数据科研带来的Digital Curation挑
战、问题及应对措施
结语
1、Digital Curation的兴起

Data Deluge
1、Digital Curation的兴起

From Data Deluge to
Data Curation


Philip Lord, Alison
Macdonald, Liz Lyon,
David Giaretta
The Digital Archiving
Consultancy Limited
and the Digital
Curation Centre
1、Digital Curation的兴起

The Digital Curation Centre成立


在e-Science Core项目的支持下,DCC于
2004年3月1日成立
总部位于Edinburgh的National e-Science
Centre




University of Edinburgh
(lead,Informatics, Law, Information
Services and research institutes)
University of Glasgow (HATII and
Information Services)
UKOLN, University of Bath
Council for the Central Laboratory of the
Research Councils (CCLRC)
1、Digital Curation的兴起

会议期刊

International Digital Curation Conference,Bath
,Sep. 29 - 30, 2005


DigCCurr 2007、DigCCurr 2009、DigCCurr 2013





8th International Digital Curation Conference,
Amsterdam, 14 - 17 January 2013
An International Symposium on Digital Curation(April 1820, 2007)
Digital Curation Practice, Promise and Prospects(April 13, 2009)
Chapel Hill, North Carolina, United States
Public Symposium, 2010-2013
International Journal of Digital Curation


2006开始
http://www.ijdc.net/
1、Digital Curation的兴起

以Curation命名的机构






The Greek Digital Curation Unit (DCU) at
the Athena Research Centre(2007)
UC3,University of California Curation
Center (2010)
The Digital Research and Curation Center
at The Johns Hopkins University’s
Sheridan Libraries
The University of Toronto’s iSchool
established The Digital Curation Institute
( 2010 )
Purdue University Library’s Distributed
Data Curation Center (D2C2) (2009)
......
1、Digital Curation的兴起

与Curation相关的教育培训

DigCCurr I (2006-09),DigCCurr II (2008-13)




International Data curation Education Action (IDEA)
Working Group





School of Information and Library Science (SILS) University
of North Carolina at Chapel Hill,NARA
Preserving Access to Our Digital Future: Building an
International Digital Curation Curriculum.
Extending an International Digital Curation Curriculum to
Doctoral Students and Practitioners
Developing an International Curation and Preservation
Training and Education Roadmap
Education for Digital Stewardship: Librarians, Archivists
or Curators?"
Masters Programme in Digital Curation, Luleå University
of Technology
IFLA, 2011“ Education for Digital Curation”
Board on Research Data and Information
1、Digital Curation的兴起

相关技术工具

Data Asset Framework (DAF)


DRAMBORA


preservation plans
DROID


Trustworthy Repositories Audit & Certification,
Criteria and Checklist
Digital Preservation Suite


self-assessment of possible risk
TRAC


enumerating and auditing data holdings
identifies file formats
......
提纲





Digital Curation的兴起
Digital Curation是什么?
Digital Curation和Preservation不同
?
大数据科研带来的Digital Curation挑
战、问题及应用措施
结论
2、Digital Curation是什么

先说一下数字保存(Digital
Preservation)

数字是一把的双刃剑

优点


方便易用、可复制、易传输、大量
携带...
问题



脆弱性
 删除、盗取、修改、失真....
依赖性
 技术、系统、标准、软件、上下
文(元数据)、组织、经济...
飞速退化性(obsolescence)
 媒体、硬件、软件、格式...
2、Digital Curation是什么

Digital Preservation

1996年5月1日,成为重要关注内容

Preserving Digital Information: Report of
the Task Force on Archiving of Digital
Information



Commission on Preservation and Access
Research Libraries Group. Inc. (RLG)
目标:

“continued access indefinitely into the future
of records stored in digital electronic form.”
http://www.clir.org/pubs/reports/pub63/reports/pub63watersgarrett.pdf
2、Digital Curation是什么

21世纪初数字保存(DP)已经成
为数字图书馆的一个重要领域


主要研究内容
 保存策略和方法、保存元数据、存储体系、保存
仓储、保存工作流、Web存档、保存信息模型
主要标准规范:


开放档案信息系统(OAIS2002)、
主要数字保存系统和服务体系

e-Depot DIAS, NDIIPP, LOCKSS, Portico, CDL DPR
,FCLA DAITSS......
2、Digital Curation是什么

为什么还会出现Digital Curation?

已经数字保存已经有两个接受的术语了



数字保存(Digital Preservation)
数字存档(Digital Archiving)
为什么还要提出Digital Curation?
Digital Curation是什么?与Digital
Preservation 有什么不同的思路和方法?
2、Digital Curation是什么

Digital Curation:被创造的
新词

Digital Data Curation Task
Force

Report of the Task Force
Strategy Discussion Day


e-Science Curation Report


Tuesday, 26th,November 2002,Centre
Point, London WC1,January 2003
Data curation for e-Science in the
UK: an audit to establish
requirements for future curation
and provision,2003
JCSR(the Joint Information
Systems Committee’s
2、Digital Curation是什么

Digital Data Curation Task Force





由Tony Hey,当时JCSR的主席召集
目标:明确和构建英国原始研究数据的Curation战
略
会议日期 2002年11月26日
The application of the term “curation” is
new, and in several ways the meeting found
itself grappling with questions of scope,
with frequent overlap with questions
relating to digital preservation.
It did not reach a definition of the term.
2、Digital Curation是什么

Digital Data Curation Task Force

What is curation?





Dr John Taylor, Director General of the Research
Councils
Tony Hey, distinguish the actions involved in caring
for digital data beyond its original use, from
digital preservation.
Seamus Ross, “curation in the museum sense” covers
three core concepts: conservation, preservation and
access
Alison Allden “curation” implied in an active
management of information, involving planning. reuse of data is a core issue. If data is to be reused, then it needs special treatment
Rolf Apweiler, curation is when people add value to
2、Digital Curation是什么

e-Science Curation Report

“curation” 来源于 “curator”


somebody who keeps something for the public
good, whose value often needs to be brought
out by the curator.
两个重要特点


more support for explicit policies with
regard to data sharing
digital curator is store-keeper, but he
should take an active role in promoting and
adding value to his holdings
2、Digital Curation是什么

e-Science Curation Report

此前


“curation” is commonly used to refer to
the work done on genomic and proteomic
databases, annotating and managing
annotations
现在

It covers a wider context than just
archiving; it embraces the care of the
record within scientific context and
environment
2、Digital Curation是什么

e-Science Curation Report

Working definitions



Curation: The activity of, managing and promoting the
use of data from its point of creation, to ensure it is
fit for contemporary purpose, and available for discovery
and re-use. For dynamic datasets this may mean
continuous enrichment or updating to keep it fit for
purpose. Higher levels of curation will also involve
maintaining links with annotation and with other
published materials
Archiving: A curation activity which ensures that data
is properly selected, stored, can be accessed and that
its logical and physical integrity is maintained over
time, including security and authenticity
Preservation: An activity within archiving in which
specific items of data are maintained over time so that
2、Digital Curation是什么

e-Science Curation Report

That the objective of digital curation
of primary research data is



to keep data which is valuable,
potentially valuable or which is required
to be kept;
and in such a way that it is accessible
and usable by others (while observing
relevant restrictions), that its value is
maintained and, where possible, enhanced;
and that this activity and service should
2、Digital Curation是什么

JISC通讯定义


JISC circular 6/03 (Revised), July
2003
The term ‘digital curation’ is
increasingly being used for the
actions needed to maintain and utilise
digital data and research results over
their entire life-cycle for current
and future generations of users.
2、Digital Curation是什么

DDC定义1

DCC Approach to Digital Curation, 15
Aug 2004
curation : general term - taking care of
things
 data curation : looking after and adding
value to data
 digital curation : looking after and somehow
"adding value" to digital data. This
probably implies creating some new data from
the existing, in order to make the latter
more useful and "fit for purpose".

2、Digital Curation是什么

DDC定义2

DCC Charter and Statement of
Principles


What is digital curation?
Digital curation is maintaining and
adding value to a trusted body of digital
research data for current and future use;
it encompasses the active management of
data throughout the research lifecycle.
http://www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles
2、Digital Curation是什么

DDC定义3



Digital curation involves maintaining, preserving and
adding value to digital research data throughout its
lifecycle.
The active management of research data reduces
threats to their long-term research value and
mitigates the risk of digital obsolescence.
Meanwhile, curated data in trusted digital
repositories may be shared among the wider UK
research community.
As well as reducing duplication of effort in research
data creation, curation enhances the long-term value
of existing data by making it available for further
high quality http://www.dcc.ac.uk/digital-curation/what-digital-curation
research
2、Digital Curation是什么

DDC定义4

DCC Briefing Papers




Digital curation is the management and preservation of
digital data over the long-term.
All activities involved in managing data from planning its
creation, best practice in digitisation and documentation,
and ensuring its availability and suitability for
discovery and re-use in the future are part of digital
curation.
Digital curation can also include managing vast data sets
for daily use, for example ensuring that they can be
searched and continue to be readable.
Digital curation is therefore applicable to a large range
of professional situations from the beginning of the
information life-cycle to the end; digitisers, metadata
creators, funders,
policy-makers, and repository managers
http://www.dcc.ac.uk/resources/briefing-papers/introduction-curation
to name a few examples
提纲





Digital Curation的兴起
Digital Curation是什么?
Digital Curation和Preservation不同
?
大数据科研带来的Digital Curation挑
战、问题及应用措施
结论
3、Curation和Preservation不同
?

JISC Preservation和Curation对比

JISC Digital Preservation briefing paper

Digital preservation


actions and interventions ensure continued and reliable
access to authentic digital objects for as long as they
are deemed to be of value.
Digital curation
maintaining and adding value to a trusted body of
digital information for future and current use;
 active management and appraisal of data over the entire
life cycle.
 builds upon the underlying concepts of digital
preservation
 emphasising opportunities for added value and knowledge
http://sitecore.jisc.ac.uk/publications/briefingpapers/2006/pub_digipreservationbp.aspx
through annotation and continuing resource management.

3、Curation和Preservation不同
?

ARL的两者对比

New Roles for New Times: Digital
Curation for Preservation, March 2011



Digital curation refers to the actions
people take to maintain and add value to
digital information over its lifecycle,
including the processes used when
creating digital content.
Digital preservation focuses on the
“series of managed activities necessary
to ensure continued access to digital
materials for as long as necessary.”
intersection of these actions, digital
3、Curation和Preservation不同
?

Digital Curation: The Emergence of a New Discipline中
的对比


digital preservation efforts originally focussed on ensuring
that material survived technical obsolescence and
organisational mismanagement. Preservation implied a passive
state, where material would be mothballed in an inaccessible
“dark archive”, with only a few authorised users, to ensure
that it retained its integrity and authenticity
ensuring that digital material is managed throughout its
lifecycle so that it remains accessible to those who need to
use it. Metadata is used to both improve accessibility and
discoverability; and to control authentication procedures,
creating audit trails to ensure that material cannot be
accessed or altered by those not authorised to do so. Digital
material is actively preserved, used and reused for new
3、Curation和Preservation不同
?

应对的问题不同

Preservation


应对技术退化和组织失效
Curation

From Data Deluge to Data Curation, Data
volumes, complexity of the data itself
3、Curation和Preservation不同
?

行动的目的不同

Preservation



以数据的生存为目的
保证数据完整性、可信赖、真实性
Curation


以数据能够被科研利用为目的
实现数据管理并使数据增值
3、Curation和Preservation不同
?

达成的目标

Preservation


使数据可访问、可理解、可应用
Curation

对数据的整个生命周期进行管理,包括数据的创
建和在旧数据之上新生成的新数据,实现数据利
用和再生
3、Curation和Preservation不同
?

为什么人服务?

Preservation


为了未来后世能够利用
Curation

为了当前和未来可用
3、Curation和Preservation不同
?

行为模型

Preservation


OAIS参考模型
Curation

DCC Curation Lifecycle Model
3、Curation和Preservation不同
?

OAIS参考模型

6项功能活动、3类信息包、3种角色
3、Curation和Preservation不同
?

DCC Curation Lifecycle Model

Full Lifecycle Actions





Sequential Actions









Description and Representation Information
Preservation Planning
Community Watch and Participation
Curate and Preserve
Conceptualise
Create or Receive
Appraise and Select
Ingest
Preservation Action
Store
Access, Use and Reuse
Transform
Occasional Actions



Dispose
Reappraise
Migrate
3、Curation和Preservation不同
?

活动参与成员

Preservation


数据提供者、数据保存者、受权使用者
Curation

数据创造者、数据提供者、数据存档者、数据消
费者
3、Curation和Preservation不同
?

保存的周期
 Preservation


从数据提供开始,一直到所要求的未来时
段,保证数据生存
Curation
 从数据的产生开始,数据整个生命
周期,中间有丢弃
1、从数字保存到数字保管

数据应用范围

Preservation


受权访问
Curation

数据共享、数据重用
3、Curation和Preservation不同
?

思路方法

Preservation


迁移、仿真
Curation


creation and management
add value to generate new sources of
information and knowledg
3、Curation和Preservation不同
?

保存中的主观能动性

Preservation


Preservation implied a passive state
Curation



Digital material is actively preserved
active management of data throughout the
research lifecycle.
active management and appraisal of data
over the entire life cycle.
3、Curation和Preservation不同
?

保存的地方
 Preservation
 inaccessible “dark archive”

Curation
 Open Trusted Repositories
提纲





Digital Curation的兴起
Digital Curation是什么?
Digital Curation和Preservation不同
?
大数据科研带来的Digital Curation挑
战、问题及应对措施
结语
4、Digital Curation挑战

e-Science Curation Report
4、Digital Curation挑战

e-Science Curation Report
4、Digital Curation挑战

e-Science Curation Report
4、Digital Curation挑战

Data Tsunami、Data deluge、超规模数据



CERN(欧洲核能研究组织)
ESA(欧洲航天局)
未来数据规模将更大,数据增长将更快

天文观测数据



Sloan Digital Sky Survey,2008年的前10年,产生25
terabytes数据
2014,Large Synoptic Survey Telescope每晚20
terabytes
2019年,Square Kilometre Array radio telescope将产
生50 TB已处理的数据,如果以裸数据为计,每秒7000TB
4、Digital Curation挑战

Big Data——>big data science

“大数据科研”的时代已经来临


不仅限于大装置或部分领域的科学
大数据科研是一种新的科学发现范式


Data-intensive Science,Data-intensive
Discovery
存在于所有科研领域


观测、试验和计算机产生数据日益增长的价值
不论是物理科学、人文科学,还是社科科学。
4、Digital Curation挑战

Data as the Infrastructure

European Union


“In a sense, the physical and technical
infrastructure becomes invisible and the
data themselves become the infrastructure
a valuable asset, on which science,
technology, the economy and society can
advance”
GRDI2020项目将构建Research Data
Infrastructure 促进数据管理系统、数字图书
馆、研究图书馆、数字仓储、工具及研究团队的
集成,
4、Digital Curation挑战

大数据科研带来了一系列新的问题和挑
战







数据政策问题
保管规划问题
保管可靠可信问题
保管内容揭示问题
保管技术框架问题
保管仓储系统问题
......
4、Digital Curation挑战

数据政策问题(Data Policy)



公共资助的研究数据,如何来保存和利用
总体上趋向开放获取和应用
OECD


Research
Research Councils UK (RCUK)


Principles and Guidelines for Access to
Data from Public Funding( 2007)
Common Principles on Data Policy,2011
美国

Open Data Policy-Managing Information as an Asset
,2013,5,9
4、Digital Curation挑战

保管规划问题

问题




Open Data是否意味着发布所有数据?
保存的具体目标是什么?
everything should be preserved?
相关项目

MaRDI-Gross project




Managing Research Data Infrastructures – Big Science
DMP planning within big science
developing a toolkit to provide guidelines on the
application
Plato is a well-established tool for systematic
preservation planning
4、Digital Curation挑战

保管可靠可信问题



Data Seal of Approval (DSA)
Audit and certification of trustworthy
digital repositories (ISO 16363)
Criteria for Trustworthy Digital
Archives (DIN 31644)
4、Digital Curation挑战

保管内容揭示问题

Large-scale content profiling for
preservation analysis



C3PO(Clever, Crafty Content Profiling of
Objects)
SCAPE
Data Curation Profile

Purdue University Libraries
4、Digital Curation挑战

保管技术框架问题

在大数据科研环境下,通常的技术框架不能
解决相关的问题



如何快速有效地实现保存分析?MapReduce方法
的应用
欧盟FP7的SCAPE(Scalable Preservation
Environments)项目
英国Research Data Management
Infrastructure (RDMI) Projects
4、Digital Curation挑战

保管仓储系统问题

Supporting the preservation lifecycle
in repositories, or2013.net
4、Digital Curation挑战

保管仓储系统问题

基于当前工具的大规模保存仓储生命周期框
架
提纲





Digital Curation的兴起
Digital Curation是什么?
Digital Curation和Preservation不同
?
大数据科研带来的Digital Curation挑
战、问题及应对措施
结语
5、结语





好的科研需要好的数据保证
为了让使能够带来Nobel Prices的科学
数据不丢失,需要有效实现数据的保存
Digital Curation是积极实现数据保存
管理的重要手段
在大数据科研环境下,Digital
Curation还面临着很多关键的挑战
希望更多的人关注和研究这一重要问题
欢迎批评
谢谢报告中引用的所有资源的作者
欢迎各位专家的批评指正!
张智雄
zhangzhx@mail.las.ac.cn
Download