International Workshop on Introduction to the DDI and the IHSN Microdata Management Toolkit UNITED NATIONS DEPARTMENT OF ECONOMIC AND SOCIAL AFFAIRS STATISTICS DIVISION NATIONAL BUREAU OF STATISTICS OF CHINA Beijing, 17-19 June 2013 DDI元数据标准及IHSN国际住户调查网 络微观数据管理工具国际培训班 联合国 经济和社会事务部 统计司 中华人民共和国 国家统计局 北京,2013年6月17日-19日 3 Workshop objectives - Context Generic Statistical Business Process Model (GSBPM) Design Build Collect Process Analyze Disseminate Archive Evaluate Metadata Management Quality Management Specify the needs Describes statistical processes (e.g., implementation of a survey) in 9 phases, each divided into subprocesses. A convenient tool for assessment, planning of statistical processes. 4 培训班目标 – 背景 通用统计业务流程模型 (GSBPM) 指明需求 设计 收集 处理 分析 传播 存档 评估 元数据管理 质量管理 建立 描述统计流程(例如, 实施一项调查)的9 个阶段,每个阶段有 各自的子流程。 一个用来评估与规划 统计流程的便利工具。 5 Workshop objectives The workshop will introduce standards and tools for: • Metadata management Design Build Collect Process Analyze Disseminate Archive Evaluate Metadata Management Quality Management Specify the needs • The DDI standard • IHSN Metadata Editor • Dissemination • Policy, technical and ethical issues • NADA software • Archiving • Preservation of digital information 6 培训班目标 培训班介绍标准和工具的目的是: • 元数据管理 指明需求 • DDI标准 • IHSN元数据编辑软件 设计 收集 处理 分析 传播 存档 评估 元数据管理 质量管理 建立 • 传播 • 政策,技术和道德问题 • NADA软件 • 存档 • 数字信息保存 7 Metadata management Part 1 Documenting your surveys and censuses using the DDI Metadata Standard and the IHSN Metadata Editor (Nesstar Publisher) 8 元数据管理 第1部分 使用DDI元数据标准 以及IHSN元数据编辑软件(Nesstar发布软件) 记录您的调查和普查 9 Why do data producers need metadata? • To increase the credibility and transparency of their statistical outputs • To preserve institutional memory • To allow replication of data collection and analysis • To allow re-use or re-purposing of the metadata 10 为何数据生产者需要元数据? • 为了增加其统计输出的公信力和透明度 • 为了保持机构记忆 • 为了允许复制数据收集和分析 • 为了允许重复使用或重新利用元数据 11 Why do data users need metadata? • To fully understand the (micro)data and make good use of them – To minimize the risk of misuse/misinterpretation, users need to fully understand the data. Why, by whom, when, and how data were collected and processed are important information. • For making data discoverable in on-line catalogs – Users will know about the availability of your data by searching or browsing detailed metadata catalogs. 12 为何数据使用者需要元数据? • 为了充分认识(微观)数据并很好的利用他们 – 为了尽量减少误用/曲解的风险,使用者需要充 分了解数据。数据收集和处理的重要信息包括: 目的,收集者/处理者,时间和方式。 • 为了便于搜寻在线目录中的数据 – 使用者通过搜索或浏览详细的元数据目录,将 会知道是否可以获得您的数据。 13 Standards and tools • The Data Documentation Initiative (DDI) metadata standard helps structure, preserve and share survey or census metadata • The IHSN Microdata Management Toolkit, a.k.a. Nesstar Publisher, provides a free and user friendly solution to document and catalog surveys/censuses in compliance with the DDI standard and international best practices 14 标准和工具 • 数据记录倡议(DDI)元数据标准有助于结构化, 保存和分享调查或普查的元数据 • IHSN 国际住户调查网络微观数据管理工具包, 又名Nesstar发布软件,为记录并编目符合DDI 元数据标准和国际最佳实践的调查/普查,提供 了一个免费且用户友好的解决方案。 15 What is the DDI? • A checklist of what you need to know about a study and its dataset – A structured and comprehensive list of hundreds of elements that may be used to document a survey dataset • An XML metadata standard • Developed by academic data centers / the DDI Alliance. • Designed to encompass the kinds of data generated by surveys, censuses, administrative records. • For microdata, not indicators. • Two versions: – Version 2.n (DDI codebook), used by the IHSN Toolkit – Version 3.n (DDI life cycle) 16 什么是DDI元数据标准? • 一张列有您所需要知道的,有关一个研究及其数据集信息 的核对表 • • • • • – 一张结构化的综合列表,包含数百个元素,可用来记 录一项调查的数据集。 一个XML格式的元数据标准 由学术数据中心/DDI联盟开发。 旨在涵盖由调查,普查,行政记录产生的这类数据。 用于微观数据,而非指标。 两个版本: – 版本2.n (DDI码本), 用于IHSN国际住户调查网络工具包 – 版本3.n (DDI生命周期) 17 What is XML ? • XML stands for eXtensible Markup Language. It is used to structure information to be shared on the Web or exchanged between software systems. • XML is a file format, readable by any text editor (e.g., Notepad). • XML tags text for meaning. HTML tags text for appearance. The “tags” are conceptually the same as “fields” in a database. • In an XML file, the information is wrapped between an opening tag and a closing tag. The tag name indicates its content. 18 什么是XML? • XML代表可扩展标记语言,用于结构化在网络 上共享或在软件系统之间交换的信息。 • XML是一种文件格式,在任何文本编辑器(例 如:Notepad)上可读。 • XML语言的标签文本具有内容含义。HTML语言 的标签文本用于文字外观。在XML语言下的数 据库中,“标签”和“字段”在概念上是相同 的。 • 在一个XML文件中,信息被包裹在开始标签和 结束标签之间。标签名称表示其内容。 19 DDI and XML - An example “The National Statistics Office (NSO) of Popstan conducted the Multiple Indicators Cluster Survey (MICS) with the financial support of UNICEF. 5,000 households, representing the overall population of the country, were randomly selected to participate in the survey, following a two-stage stratified sampling methodology. 4,900 of these households provided information.” In XML/DDI this would look like this: <titl> Multiple Indicator Cluster Survey 2005 </titl> <altTitl> MICS 2005</altTitl> <AuthEnty> National Statistics Office (NSO) </AuthEnty> <fundAg abbr= "UNICEF">United Nations Children Fund </fundAg> <nation> Popstan </nation> <geogCover> National </geogCover> <sampProc> 5,000 households, stratified two stages </sampProc> <respRate> 98 percent </respRate> 20 DDI和XML - 例子 “Popstan国国家统计局(NSO)在联合国儿童基金会(UNICEF)的资金支持下, 开展了多指标类集调查(MICS)。调查采用二阶段分层抽样法,从参与这项 调查的全国总人口中,随机抽取了5000户家庭作为代表全体的样本。其中4900 户家庭提供了信息。” 在XML/DDI中,以上内容呈现如下: <titl>多指标类集调查 2005</titl> <altTitl>MICS 2005</altTitl> <AuthEnty>国家统计局 (NSO)</AuthEnty> <fundAg abbr= “UNICEF”>联合国儿童基金会</fundAg> <nation>Popstan国</nation> <geogCover>全国</geogCover> <sampProc>5000户家庭, 二阶段分层抽样</sampProc> <respRate>百分之98</respRate> 21 Advantage of XML • Can be transformed into many kinds of outputs: – Databases, HTML, PDF, on-line catalogs, others • Plain text files. Not specific to any operating system or application • Easy to generate using specialized tools such as the IHSN Metadata Editor 22 XML的优势 • 可以转化为多种输出: – 数据库、HTML、PDF、在线目录,及其他 • 纯文本文件,不是某个操作系统或应用程序的 特定文件 • 使用特定工具生成非常便捷,例如IHSN国际住 户调查网络元数据编辑软件 23 Structure of the DDI 2.0 standard The DDI elements are organized in five sections: 1. Document Description. Used to document the documentation process (“metadata on metadata”). 2. Study Description. Information about the survey such as title, dates/method of data collection, sampling, funding, etc. 3. Data File Description. Content, producer, version, etc. 4. Variable Description. Literal question, universe, labels, derivation and imputation methods, etc. 5. Other Material. Description of materials related to the study such as questionnaires, coding information, reports, interviewer's manuals, data processing and analysis programs, etc. 24 DDI2.0标准的结构 DDI元素由5部分组成: 1. 文档描述:用来记录文档著录过程(“元数据的元数 据”)。 2. 研究描述:关于调查的信息,例如标题、数据收集的日 期/方法、抽样、资金等等。 3. 数据文件描述:内容、生产者、版本等等。 4. 变量描述:字面问题、全域、标签、推导和估算方法, 等等。 5. 其他相关信息:描述与研究相关的材料,例如问卷、编 码信息、报告、面试官手册、数据处理和分析程序等等。 25 Exercises Workshop participants will install the IHSN Metadata Editor (a.k.a. Nesstar Publisher) and document a small census dataset. 26 练习 培训班与会者将安装IHSN国际住户调查网络元数 据编辑软件(又名Nesstar发布软件)并学习记 录一个小的普查数据集。 27 Exercise data files Content of the USB provided to participants Chinese version of: • Popstan census data files (2) in Stata format • Census questionnaire • Enumerator manual Same content in English Selected technical and policy guidelines IHSN Metadata Editor software and templates 28 练习的数据文件 USB存储盘向与会者提供以下内容 中文版本: •Stata格式的人口普查数据(2个文件) •人口普查问卷 •统计员手册 英文内容相同 技术和政策方面的指导原则 IHSN国际住户调查网元数据编辑软件和模板 29 Exercise 1 – Installation • Run NesstarPublisherInstaller_v4.0.9.exe to install the software • Next step is to install the IHSN templates Open the Template Manager 30 练习1- 安装 • 运行NesstarPublisherInstaller_v4.0.9.exe,安装 软件 • 下一步是安装IHSN国际住户调查网络模板 打开模板管理程序 31 Exercise 1 – Installation Click on “Import” and select the English (EN) or Chinese (CN) template found in folder “Software” Then select the added template and click “Use” to activate it. This will now be the default study template. Repeat the exact same process for the Resource Description Template 32 练习1- 安装 点击“导入”,在 “Software (软件)”文件 夹中选择英语(EN)或 中文(CN)模板 然后选择要添加的模板, 点击“使用”来激活它。 这个模板将成为默认的 研究模板。 重复相同的步骤来添加 资源描述模板 33 Exercise 2 - Documentation The next steps will be to document the Census: - Import the data files (Stata) - Add metadata in the Document Description, Study Description, Data Files Description, and Variables Description sections - Attach and document the questionnaire and manual as external resources - Export the metadata to DDI (and RDF) formats 34 练习2 – 记录 接下来的步骤是记录普查: - 导入数据文件(Stata) - 添加文件描述,研究描述,数据文件描述, 和变量描述部分的元数据 - 将调查问卷和面试官手册作为外部资源附 加并记录 - 将元数据以DDI(和RDF)格式导出 35 When should data be documented? Document “as you go” – not after completion of the operation. When documentation is done as a “last step”, much information is lost. Much information loss, or never generated 36 数据在何时应该被记录? “按进度”记录每一步 – 而不是在调查结束以后。如 果只在“最后一步”记录数据,许多信息已经丢失。 37 Software and guidelines Available at www.ihsn.org http://www.ihsn.org/home/node/117 http://www.ihsn.org/home/software/ddi-metadata-editor 38 软件和指导原则 可下载于www.ihsn.org http://www.ihsn.org/home/node/117 http://www.ihsn.org/home/software/ddi-metadata-editor 39 Metadata and microdata dissemination Part 2 Formulating a microdata dissemination policy, disseminating data and metadata, and the IHSN National Data Archive (NADA) software 40 元数据和微观数据传播 第2部分 制定一个微观数据传播政策, 数据和元数据的传播, 以及IHSN国际住户调查网络国家数据归档 (NADA)软件 41 Benefits of dissemination • Diversity of research work. Data producers usually publish tabular and analytical outputs. But they will never identify all the research questions that can be addressed using the data. Microdata dissemination encourages diversity (and quality) of analysis. • Credibility/acceptability of data. Broader access to metadata and microdata demonstrates the producer’s confidence in the data, by making replication (or correction) possible by independent parties. 42 传播的优点 • 使研究工作多元化 :数据生产者通常发布表格和分析 输出。但他们绝不会辨识出这组数据能解决的所有研究 问题。微观数据的传播促进了分析的多样性(和质量)。 • 数据的公信力和认可度:通过让独立的第三方能够复制 (或修正)数据,对元数据和微观数据更广泛的访问显 示了生产者对数据的信心。 43 Benefits of dissemination • Reduced duplication. Non accessibility to microdata forces users to conduct their own surveys. Microdata dissemination would reduce the risk of duplicated activities. It will also reduce the burden on respondents, and minimize the risk of inconsistent studies on a same topic. • Funding. Better use of data means better return for survey sponsors, who will thus be more inclined to support data collection activities. • Quality of data. It is often through the use of data that insights for improvement for survey design can be identified. 44 传播的优点 • 减少重复:无法获得微观数据迫使用户自己进行调查。 微观数据的传播将减少重复工作的风险。它也将减少受 访者的负担,并将同一主题不一致研究的风险降到最低。 • 资金:更好地利用数据意味着对调查赞助者更好的回报, 从而使他们更倾向于支持数据收集活动。 • 数据质量:往往在数据使用的过程中,会产生如何改进 调查设计的见解。 45 Costs and risks of dissemination • Exposure to criticism. Quality itself often puts a brake on microdata dissemination. Some data producers may fear to be exposed to criticism when data are not fully reliable, and to be confronted to the obligation to defend their results when challenged by secondary users. • Loss of exclusivity. When disseminating microdata, data owners lose their exclusive right to discoveries. This is more of an issue for academic researchers than official producers. 46 传播的成本和风险 • 受到批评:质量本身往往会阻碍微观数据的传播。一些 数据生产者可能担心当数据不是完全可靠时会受到批评, 并且在面临二级用户质疑时,要承担为自己的结果辩论 的义务。 • 丧失专用性:微观数据的传播使数据拥有者失去了他们 对自己发现的数据的专用权。相比官方数据生产者,这 对学术研究者来说是更大的一个问题。 47 Costs and risks of dissemination • Official vs. non-official results, and exposure to contradiction. Dissemination of microdata may lead to a proliferation of differing -and possibly contradictory- results and statistics. It may become more and more difficult to distinguish between official figures and other sources of statistics. • Financial cost. Properly documenting and disseminating microdata has a cost. This includes not only the costs of creating and documenting microdata files, but also the costs of creating access tools and safeguards, and of supporting enquiries made by the research community. 48 传播的成本和风险 • 官方与非官方结果,对比揭露矛盾:微观数据的传播可 能激增不同的 - 并可能是相互矛盾的 - 结果和统计。传 播可能导致官方数据和其他来源的统计数据变得越来越 难以区分。 • 财务成本:妥善记录和传播微观数据是有成本的。这不 仅包括创建和记录微观数据文件的成本,还包括建立访 问工具和保障措施,以及向研究界提供辅助问询的成本。 49 Costs and risks of dissemination • Confidentiality. One of the biggest challenges of microdata dissemination is to minimize the risk of disclosure of any data that would compromise the identity of respondents. • Legality. All countries have a specific national statistical and data protection legislation. 50 传播的成本和风险 • 保密性:微观数据传播的最大挑战之一,是如何尽量减 少任何由于披露数据而导致的,可能危及受访者身份保 密性的风险。 • 合法性:所有国家都有其特定的国家统计和数据保护法 例。 51 Principles - UNECE • It is appropriate for microdata collected for official statistical purposes to be used for statistical analysis to support research as long as confidentiality is protected. • Provision of microdata should be consistent with legal and other necessary arrangements that ensure that confidentiality of the released microdata is protected. Managing Statistical Confidentiality and Microdata Access - Principles and guidelines of Good Practice, by the Conference of European Statisticians (CES) and United Nations Economic Commission for Europe (UNECE) 52 原则 - UNECE联合国欧洲经济委员会 • 在确保保密性的前提下,研究者可以使用为了官方统计 目而收集的微观数据,来进行统计分析并支持研究。 • 提供微观数据应当符合法律和其他必要的约定,以确保 被发布的微观数据的保密性。 管理统计保密性和微观数据访问 -良好实践的原则和指 导方针, 欧洲统计学家会议(CES) 与联合国欧洲经济委员 会(UNECE) 53 Anonymization • Statistical agencies are charged with protecting the confidentiality of survey respondents. • Protecting confidentiality necessitates some sort of data anonymization so that individual respondents can not be identified. 54 匿名化 • 统计机构被委以为调查受访者保密的责任。 • 为了保密,必须采取一定的数据匿名化措施, 从而使得个体受访者不会被辨识。 55 Anonymization concepts • Identifying variables include: – Direct identifiers, which are variables such as names, addresses, or identity card numbers. They should be removed from the published dataset. – Indirect identifiers, which are characteristics whose combination could lead to the re-identification of respondents (e.g., region, age, sex, occupation). Such variables are needed for statistical purposes, and should not be removed from the published data files. • Anonymizing the data involves determining which variables are potential identifiers and modifying the specificity of these variables to reduce the risk of re-identification to an acceptable level. The challenge is to maximize the security while minimizing the resulting information loss. 56 匿名化概念 • 识别变量包括: – 直接识别符, 是诸如姓名、地址或身份证号码的变量。这 些变量应该从被公布的数据集中删除。 – 间接识别符, 是一些个体特征变量,若组合在一起可重新 识别受访者(例如地区、年龄、性别、职业)。这样的 变量出于统计目的需要,不应该从被公布的数据文件中 删除。 • 数据匿名化涉及确定哪些变量是潜在识别符,并修改 这些变量的特征,从而将重新识别的风险降低到一个 可接受的水平。当前的挑战是如何在保持最大程度安 全性的同时,最大限度地减少信息损失。 57 Anonymization techniques • • • • • • • • • • Removing variables (e.g., detailed geographic identification) Removing records (outliers) Global recoding (e.g., from age to age groups) Top- or bottom-coding (e.g., create “65+” age category) Local suppression (replace with missing) Micro-aggregation (e.g., for income variable) Data swapping Post-randomization Noise addition Resampling 58 匿名化技术 • • • • • • • • • • 删除变量(例如,详细的地理标识) 删除记录(离群值) 全球性重新编码(例如,将年龄改成年龄组) 顶部或底部编码(例如,创建“65+”年龄组别) 本地隐瞒(更换为缺失值) 微聚集(例如,对于收入变量) 数据替换 后随机化 添加噪声 重新抽样 59 Anonymization tools and guidelines Software: sdcMicro Technical guidelines An open source (R-based) package http://www.ihsn.org/home/node/118 More practical guidelines are being produced by the IHSN. NOTE : Anonymization is a complex process. It requires analytical skills and involves some arbitrary decisions. 60 匿名化工具和指导原则 软件: sdcMicro 一个开放资源的(以R语 言为基础的)软件包 技术指引 http://www.ihsn.org/home/node/1 18 IHSN国际住户调查网络提供了 更多的实际操作指引。 注释 : 匿名化是一个复杂的过程。它 需要分析技巧并涉及到一些 主观的判定。 61 Policy guidelines on dissemination Formulating a microdata access policy http://www.ihsn.org/home/node/120 62 传播的政策指引 制定一个微观数据访问政策 http://www.ihsn.org/home/node/120 63 Cataloguing • Data and metadata need to be made visible. • Users will benefit from advanced data discovery tools, in particular on-line searchable catalogs. • The IHSN developed an open source application, compliant with the DDI standard, to help disseminate metadata and (optional) microdata. This application (NADA) complements the Metadata Editor. 64 编目 • 数据和元数据要成为用户可见的。 • 用户将受益于先进的数据发现工具,特别是可 在线搜索的目录。 • IHSN国际住户调查网络开发了一款符合DDI元 数据标准的,开放资源的应用程序,用来帮助 传播元数据和微观数据(可选)。此应用程序 (NADA)是Nesstar元数据编辑软件的一个补 充。 65 Dissemination Exercise Workshop participants will upload their DDI metadata (generated during the documentation exercise) in an on-line, searchable survey catalog 66 传播练习 培训班的与会者将把他们(在记录练习中生成) 的DDI元数据,上传到一个可在线搜索的调查 目录中。 67 Survey catalogs 100+ agencies in 65+ countries have started establishing a microdata archive using IHSN tools 68 调查目录 在65个以上国家的100多个机构中,已经开展了 使用IHSN国际住户调查网络工具对微观数据进 行存档 69 Archiving Part 3 Preserving data and metadata 70 存档 第3部分 保存数据和元数据 71 Issues Common issues include: – Loss of data and metadata, because of human error, technical problems, or disasters such as fire or flood – Data available, but on unreadable formats/media (hardware and software obsolescence) – Data available, but undocumented – Documentation only available in hard copy – Multiple versions of datasets available, with no “versioning” information 72 问题 常见问题包括: – 由于人为失误,技术问题,或诸如火灾或水灾等 灾害,造成数据和元数据的损失 – 存在数据,但格式/媒介无法读取(硬件和软件过 时) – 存在数据, 但尚未记录 – 只提供硬拷贝文档 – 存在多个版本的数据集,但没有 “版本管理”信息 73 Physical threats Physical damage can occur to hardware and media due to: • Material instability • Improper storage environment (temperature, humidity, light, dust) • Overuse (mainly for physical contact media) • Natural disaster (fire, flood, earthquake) • Infrastructure failure (plumbing, electrical, climate control) • Inadequate hardware maintenance • Human error (including improper handling) • Sabotage (theft, vandalism) 74 实体威胁 有形损害可能因为以下因素,发生在硬件和媒介上: • 材料的不稳定性 • 不适当的储存环境(温度、湿度、光照、灰尘) • 过度使用(主要针对有直接接触的媒介) • 自然灾害(火灾、水灾、地震) • 基础设施故障(水暖、电气、气候控制) • 硬件维护不足 • 人为失误(包括处理不当) • 蓄意破坏(盗窃、破坏) 75 Software obsolescence A file format may be superseded by newer versions and no longer be supported. <XML> Html 2 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 76 软件过时 一种文件格式可能被更新的版本取代,因而不再 受到支持 <XML> Html 2 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 77 Hardware obsolescence Storage medium are rapidly superseded by smaller, denser, faster media. The device needed to read an “old” medium may no longer be manufactured. 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 78 硬件过时 旧的存储媒介被体积更小,更密集,更快速的新 媒介所取代。阅读“旧”媒介所须的设备可能已 经停产。 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 79 Preservation policies Microdata preservation refers to the management of digital data and related metadata over time to guarantee their long term usability. It requires the establishment and implementation of a preservation policy and procedures. –Back up your data –Ensure suitable data storage • Refreshing media: copy digital information from one medium to another. • Technology preservation: preserve old operating systems, software, media drives as a disaster recovery strategy. • Migrating data: copy or convert data from one technology to another, whether hardware or software. 80 保存政策 微观数据保存是指随着时间的推移,管理数字化数据以 及相关的元数据,以保证他们的长期可使用性。这需要 建立和实施一整套保存政策及程序。 –备份您的数据 –确保适当的数据存储 • 翻新媒介:将数字信息从一种媒介复制到另一种媒 介。 • 技术保存:通过保存旧的操作系统,软件,媒体驱 动器来作为一项应急恢复策略。 • 迁移数据:将数据从一个技术复制或转换到另一个 技术,无论是硬件还是软件。 81 Guidelines • Unlike the preservation of information on paper, the preservation of digital information demands constant attention. • Guidelines: complex, but useful as a “technical audit manual” http://www.ihsn.org/home/node/121 82 指导原则 • 不同于在纸面上的信息保存,保存数字信息需 要不断关注。 • 指导原则:一份复杂,但有用的“技术审核手 册” http://www.ihsn.org/home/node/121