Digital Curation for the Big Data Sciences 大数据科研中的数字保存 张智雄 中国科学院国家科学图书馆 提纲 Digital Curation的兴起 Digital Curation是什么? Digital Curation和Preservation不同 ? 大数据科研带来的Digital Curation挑 战、问题及应对措施 结语 提纲 Digital Curation的兴起 Digital Curation是什么? Digital Curation和Preservation不同 ? 大数据科研带来的Digital Curation挑 战、问题及应对措施 结语 1、Digital Curation的兴起 Data Deluge 1、Digital Curation的兴起 From Data Deluge to Data Curation Philip Lord, Alison Macdonald, Liz Lyon, David Giaretta The Digital Archiving Consultancy Limited and the Digital Curation Centre 1、Digital Curation的兴起 The Digital Curation Centre成立 在e-Science Core项目的支持下,DCC于 2004年3月1日成立 总部位于Edinburgh的National e-Science Centre University of Edinburgh (lead,Informatics, Law, Information Services and research institutes) University of Glasgow (HATII and Information Services) UKOLN, University of Bath Council for the Central Laboratory of the Research Councils (CCLRC) 1、Digital Curation的兴起 会议期刊 International Digital Curation Conference,Bath ,Sep. 29 - 30, 2005 DigCCurr 2007、DigCCurr 2009、DigCCurr 2013 8th International Digital Curation Conference, Amsterdam, 14 - 17 January 2013 An International Symposium on Digital Curation(April 1820, 2007) Digital Curation Practice, Promise and Prospects(April 13, 2009) Chapel Hill, North Carolina, United States Public Symposium, 2010-2013 International Journal of Digital Curation 2006开始 http://www.ijdc.net/ 1、Digital Curation的兴起 以Curation命名的机构 The Greek Digital Curation Unit (DCU) at the Athena Research Centre(2007) UC3,University of California Curation Center (2010) The Digital Research and Curation Center at The Johns Hopkins University’s Sheridan Libraries The University of Toronto’s iSchool established The Digital Curation Institute ( 2010 ) Purdue University Library’s Distributed Data Curation Center (D2C2) (2009) ...... 1、Digital Curation的兴起 与Curation相关的教育培训 DigCCurr I (2006-09),DigCCurr II (2008-13) International Data curation Education Action (IDEA) Working Group School of Information and Library Science (SILS) University of North Carolina at Chapel Hill,NARA Preserving Access to Our Digital Future: Building an International Digital Curation Curriculum. Extending an International Digital Curation Curriculum to Doctoral Students and Practitioners Developing an International Curation and Preservation Training and Education Roadmap Education for Digital Stewardship: Librarians, Archivists or Curators?" Masters Programme in Digital Curation, Luleå University of Technology IFLA, 2011“ Education for Digital Curation” Board on Research Data and Information 1、Digital Curation的兴起 相关技术工具 Data Asset Framework (DAF) DRAMBORA preservation plans DROID Trustworthy Repositories Audit & Certification, Criteria and Checklist Digital Preservation Suite self-assessment of possible risk TRAC enumerating and auditing data holdings identifies file formats ...... 提纲 Digital Curation的兴起 Digital Curation是什么? Digital Curation和Preservation不同 ? 大数据科研带来的Digital Curation挑 战、问题及应用措施 结论 2、Digital Curation是什么 先说一下数字保存(Digital Preservation) 数字是一把的双刃剑 优点 方便易用、可复制、易传输、大量 携带... 问题 脆弱性 删除、盗取、修改、失真.... 依赖性 技术、系统、标准、软件、上下 文(元数据)、组织、经济... 飞速退化性(obsolescence) 媒体、硬件、软件、格式... 2、Digital Curation是什么 Digital Preservation 1996年5月1日,成为重要关注内容 Preserving Digital Information: Report of the Task Force on Archiving of Digital Information Commission on Preservation and Access Research Libraries Group. Inc. (RLG) 目标: “continued access indefinitely into the future of records stored in digital electronic form.” http://www.clir.org/pubs/reports/pub63/reports/pub63watersgarrett.pdf 2、Digital Curation是什么 21世纪初数字保存(DP)已经成 为数字图书馆的一个重要领域 主要研究内容 保存策略和方法、保存元数据、存储体系、保存 仓储、保存工作流、Web存档、保存信息模型 主要标准规范: 开放档案信息系统(OAIS2002)、 主要数字保存系统和服务体系 e-Depot DIAS, NDIIPP, LOCKSS, Portico, CDL DPR ,FCLA DAITSS...... 2、Digital Curation是什么 为什么还会出现Digital Curation? 已经数字保存已经有两个接受的术语了 数字保存(Digital Preservation) 数字存档(Digital Archiving) 为什么还要提出Digital Curation? Digital Curation是什么?与Digital Preservation 有什么不同的思路和方法? 2、Digital Curation是什么 Digital Curation:被创造的 新词 Digital Data Curation Task Force Report of the Task Force Strategy Discussion Day e-Science Curation Report Tuesday, 26th,November 2002,Centre Point, London WC1,January 2003 Data curation for e-Science in the UK: an audit to establish requirements for future curation and provision,2003 JCSR(the Joint Information Systems Committee’s 2、Digital Curation是什么 Digital Data Curation Task Force 由Tony Hey,当时JCSR的主席召集 目标:明确和构建英国原始研究数据的Curation战 略 会议日期 2002年11月26日 The application of the term “curation” is new, and in several ways the meeting found itself grappling with questions of scope, with frequent overlap with questions relating to digital preservation. It did not reach a definition of the term. 2、Digital Curation是什么 Digital Data Curation Task Force What is curation? Dr John Taylor, Director General of the Research Councils Tony Hey, distinguish the actions involved in caring for digital data beyond its original use, from digital preservation. Seamus Ross, “curation in the museum sense” covers three core concepts: conservation, preservation and access Alison Allden “curation” implied in an active management of information, involving planning. reuse of data is a core issue. If data is to be reused, then it needs special treatment Rolf Apweiler, curation is when people add value to 2、Digital Curation是什么 e-Science Curation Report “curation” 来源于 “curator” somebody who keeps something for the public good, whose value often needs to be brought out by the curator. 两个重要特点 more support for explicit policies with regard to data sharing digital curator is store-keeper, but he should take an active role in promoting and adding value to his holdings 2、Digital Curation是什么 e-Science Curation Report 此前 “curation” is commonly used to refer to the work done on genomic and proteomic databases, annotating and managing annotations 现在 It covers a wider context than just archiving; it embraces the care of the record within scientific context and environment 2、Digital Curation是什么 e-Science Curation Report Working definitions Curation: The activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and with other published materials Archiving: A curation activity which ensures that data is properly selected, stored, can be accessed and that its logical and physical integrity is maintained over time, including security and authenticity Preservation: An activity within archiving in which specific items of data are maintained over time so that 2、Digital Curation是什么 e-Science Curation Report That the objective of digital curation of primary research data is to keep data which is valuable, potentially valuable or which is required to be kept; and in such a way that it is accessible and usable by others (while observing relevant restrictions), that its value is maintained and, where possible, enhanced; and that this activity and service should 2、Digital Curation是什么 JISC通讯定义 JISC circular 6/03 (Revised), July 2003 The term ‘digital curation’ is increasingly being used for the actions needed to maintain and utilise digital data and research results over their entire life-cycle for current and future generations of users. 2、Digital Curation是什么 DDC定义1 DCC Approach to Digital Curation, 15 Aug 2004 curation : general term - taking care of things data curation : looking after and adding value to data digital curation : looking after and somehow "adding value" to digital data. This probably implies creating some new data from the existing, in order to make the latter more useful and "fit for purpose". 2、Digital Curation是什么 DDC定义2 DCC Charter and Statement of Principles What is digital curation? Digital curation is maintaining and adding value to a trusted body of digital research data for current and future use; it encompasses the active management of data throughout the research lifecycle. http://www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles 2、Digital Curation是什么 DDC定义3 Digital curation involves maintaining, preserving and adding value to digital research data throughout its lifecycle. The active management of research data reduces threats to their long-term research value and mitigates the risk of digital obsolescence. Meanwhile, curated data in trusted digital repositories may be shared among the wider UK research community. As well as reducing duplication of effort in research data creation, curation enhances the long-term value of existing data by making it available for further high quality http://www.dcc.ac.uk/digital-curation/what-digital-curation research 2、Digital Curation是什么 DDC定义4 DCC Briefing Papers Digital curation is the management and preservation of digital data over the long-term. All activities involved in managing data from planning its creation, best practice in digitisation and documentation, and ensuring its availability and suitability for discovery and re-use in the future are part of digital curation. Digital curation can also include managing vast data sets for daily use, for example ensuring that they can be searched and continue to be readable. Digital curation is therefore applicable to a large range of professional situations from the beginning of the information life-cycle to the end; digitisers, metadata creators, funders, policy-makers, and repository managers http://www.dcc.ac.uk/resources/briefing-papers/introduction-curation to name a few examples 提纲 Digital Curation的兴起 Digital Curation是什么? Digital Curation和Preservation不同 ? 大数据科研带来的Digital Curation挑 战、问题及应用措施 结论 3、Curation和Preservation不同 ? JISC Preservation和Curation对比 JISC Digital Preservation briefing paper Digital preservation actions and interventions ensure continued and reliable access to authentic digital objects for as long as they are deemed to be of value. Digital curation maintaining and adding value to a trusted body of digital information for future and current use; active management and appraisal of data over the entire life cycle. builds upon the underlying concepts of digital preservation emphasising opportunities for added value and knowledge http://sitecore.jisc.ac.uk/publications/briefingpapers/2006/pub_digipreservationbp.aspx through annotation and continuing resource management. 3、Curation和Preservation不同 ? ARL的两者对比 New Roles for New Times: Digital Curation for Preservation, March 2011 Digital curation refers to the actions people take to maintain and add value to digital information over its lifecycle, including the processes used when creating digital content. Digital preservation focuses on the “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” intersection of these actions, digital 3、Curation和Preservation不同 ? Digital Curation: The Emergence of a New Discipline中 的对比 digital preservation efforts originally focussed on ensuring that material survived technical obsolescence and organisational mismanagement. Preservation implied a passive state, where material would be mothballed in an inaccessible “dark archive”, with only a few authorised users, to ensure that it retained its integrity and authenticity ensuring that digital material is managed throughout its lifecycle so that it remains accessible to those who need to use it. Metadata is used to both improve accessibility and discoverability; and to control authentication procedures, creating audit trails to ensure that material cannot be accessed or altered by those not authorised to do so. Digital material is actively preserved, used and reused for new 3、Curation和Preservation不同 ? 应对的问题不同 Preservation 应对技术退化和组织失效 Curation From Data Deluge to Data Curation, Data volumes, complexity of the data itself 3、Curation和Preservation不同 ? 行动的目的不同 Preservation 以数据的生存为目的 保证数据完整性、可信赖、真实性 Curation 以数据能够被科研利用为目的 实现数据管理并使数据增值 3、Curation和Preservation不同 ? 达成的目标 Preservation 使数据可访问、可理解、可应用 Curation 对数据的整个生命周期进行管理,包括数据的创 建和在旧数据之上新生成的新数据,实现数据利 用和再生 3、Curation和Preservation不同 ? 为什么人服务? Preservation 为了未来后世能够利用 Curation 为了当前和未来可用 3、Curation和Preservation不同 ? 行为模型 Preservation OAIS参考模型 Curation DCC Curation Lifecycle Model 3、Curation和Preservation不同 ? OAIS参考模型 6项功能活动、3类信息包、3种角色 3、Curation和Preservation不同 ? DCC Curation Lifecycle Model Full Lifecycle Actions Sequential Actions Description and Representation Information Preservation Planning Community Watch and Participation Curate and Preserve Conceptualise Create or Receive Appraise and Select Ingest Preservation Action Store Access, Use and Reuse Transform Occasional Actions Dispose Reappraise Migrate 3、Curation和Preservation不同 ? 活动参与成员 Preservation 数据提供者、数据保存者、受权使用者 Curation 数据创造者、数据提供者、数据存档者、数据消 费者 3、Curation和Preservation不同 ? 保存的周期 Preservation 从数据提供开始,一直到所要求的未来时 段,保证数据生存 Curation 从数据的产生开始,数据整个生命 周期,中间有丢弃 1、从数字保存到数字保管 数据应用范围 Preservation 受权访问 Curation 数据共享、数据重用 3、Curation和Preservation不同 ? 思路方法 Preservation 迁移、仿真 Curation creation and management add value to generate new sources of information and knowledg 3、Curation和Preservation不同 ? 保存中的主观能动性 Preservation Preservation implied a passive state Curation Digital material is actively preserved active management of data throughout the research lifecycle. active management and appraisal of data over the entire life cycle. 3、Curation和Preservation不同 ? 保存的地方 Preservation inaccessible “dark archive” Curation Open Trusted Repositories 提纲 Digital Curation的兴起 Digital Curation是什么? Digital Curation和Preservation不同 ? 大数据科研带来的Digital Curation挑 战、问题及应对措施 结语 4、Digital Curation挑战 e-Science Curation Report 4、Digital Curation挑战 e-Science Curation Report 4、Digital Curation挑战 e-Science Curation Report 4、Digital Curation挑战 Data Tsunami、Data deluge、超规模数据 CERN(欧洲核能研究组织) ESA(欧洲航天局) 未来数据规模将更大,数据增长将更快 天文观测数据 Sloan Digital Sky Survey,2008年的前10年,产生25 terabytes数据 2014,Large Synoptic Survey Telescope每晚20 terabytes 2019年,Square Kilometre Array radio telescope将产 生50 TB已处理的数据,如果以裸数据为计,每秒7000TB 4、Digital Curation挑战 Big Data——>big data science “大数据科研”的时代已经来临 不仅限于大装置或部分领域的科学 大数据科研是一种新的科学发现范式 Data-intensive Science,Data-intensive Discovery 存在于所有科研领域 观测、试验和计算机产生数据日益增长的价值 不论是物理科学、人文科学,还是社科科学。 4、Digital Curation挑战 Data as the Infrastructure European Union “In a sense, the physical and technical infrastructure becomes invisible and the data themselves become the infrastructure a valuable asset, on which science, technology, the economy and society can advance” GRDI2020项目将构建Research Data Infrastructure 促进数据管理系统、数字图书 馆、研究图书馆、数字仓储、工具及研究团队的 集成, 4、Digital Curation挑战 大数据科研带来了一系列新的问题和挑 战 数据政策问题 保管规划问题 保管可靠可信问题 保管内容揭示问题 保管技术框架问题 保管仓储系统问题 ...... 4、Digital Curation挑战 数据政策问题(Data Policy) 公共资助的研究数据,如何来保存和利用 总体上趋向开放获取和应用 OECD Research Research Councils UK (RCUK) Principles and Guidelines for Access to Data from Public Funding( 2007) Common Principles on Data Policy,2011 美国 Open Data Policy-Managing Information as an Asset ,2013,5,9 4、Digital Curation挑战 保管规划问题 问题 Open Data是否意味着发布所有数据? 保存的具体目标是什么? everything should be preserved? 相关项目 MaRDI-Gross project Managing Research Data Infrastructures – Big Science DMP planning within big science developing a toolkit to provide guidelines on the application Plato is a well-established tool for systematic preservation planning 4、Digital Curation挑战 保管可靠可信问题 Data Seal of Approval (DSA) Audit and certification of trustworthy digital repositories (ISO 16363) Criteria for Trustworthy Digital Archives (DIN 31644) 4、Digital Curation挑战 保管内容揭示问题 Large-scale content profiling for preservation analysis C3PO(Clever, Crafty Content Profiling of Objects) SCAPE Data Curation Profile Purdue University Libraries 4、Digital Curation挑战 保管技术框架问题 在大数据科研环境下,通常的技术框架不能 解决相关的问题 如何快速有效地实现保存分析?MapReduce方法 的应用 欧盟FP7的SCAPE(Scalable Preservation Environments)项目 英国Research Data Management Infrastructure (RDMI) Projects 4、Digital Curation挑战 保管仓储系统问题 Supporting the preservation lifecycle in repositories, or2013.net 4、Digital Curation挑战 保管仓储系统问题 基于当前工具的大规模保存仓储生命周期框 架 提纲 Digital Curation的兴起 Digital Curation是什么? Digital Curation和Preservation不同 ? 大数据科研带来的Digital Curation挑 战、问题及应对措施 结语 5、结语 好的科研需要好的数据保证 为了让使能够带来Nobel Prices的科学 数据不丢失,需要有效实现数据的保存 Digital Curation是积极实现数据保存 管理的重要手段 在大数据科研环境下,Digital Curation还面临着很多关键的挑战 希望更多的人关注和研究这一重要问题 欢迎批评 谢谢报告中引用的所有资源的作者 欢迎各位专家的批评指正! 张智雄 zhangzhx@mail.las.ac.cn