Document 13846831

advertisement
The reason we’re here is to explain what data management is, why we think it should be important to you, provide you with some things to think about in managing your data and provide some resources for managing your data. This is a first, introductory workshop…. Invite quesAons and comments throughout. [Note: Unless otherwise noted, all images in this presentaAon are from Stock Exchange (hIp://www.sxc.hu) and copyright is retained by their owners.] 0 As we talk with you today, be thinking about how what we are telling you works with your own research process. 1 Image credit: Science Magazine 2 Does not include, “any of the following: preliminary analyses, draUs of scienAfic papers, plans for future research, peer reviews, or communicaAons with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definiAon mostly takes a retrospecAve view of your dataset, in that it does not account for raw and intermediate that may be criAcal to the research process but that don’t become part of the ’final’ dataset. Data could be: •  ObservaAonal •  Experimental •  Simulated •  Derived 3 •  Data management is a verb – it involves intenAonal effort and acAvity. •  The main goals of DM are preserva'on and reuse, for you and for others. •  Covers all aspects of the data lifecycle from planning digital data capture methods, whiIling down, ingesAon to databases, providing for access and reuse, to transformaAon. 4 5 1.  Requirements: funder, journal, etc. 2.  Further your field: speed the pace of discovery by enhancing discovery, access and reuse 1.  research & knowledge base develops more rapidly 3.  Increase your visibility & impact: share your data, get credit with data citaAon 4.  Increase dataset value: More likely to be re-­‐visited and re-­‐used in ways that may be different from their original intenAon 5.  Integrity: data (documentaAon will preserve integrity of data over Ame) & you (achieve ethics compliance [HIPAA] and enhance your reputaAon via transparency) 1.  Allows for validaAon of your research 6.  Efficiency: avoid digital archeology, avoid duplicaAon of effort 1.  Allows easier exchange with other scienAsts 7.  Preserve your lifelong body of work! 6 If we agree that data that are beIer managed are more easily, rapidly, and effecAvely shared, then we can also agree that beIer data management can enable and speed up discoveries. 7 85 cancer microarray clinical trial publicaAons with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citaAons. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citaAons, independently of journal impact factor, date of publicaAon, and author country of origin using linear regression. Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased CitaAon Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 A [very] short bibliography on citaAon advantage: 1.  Henneken, E.A. & Accomazzi, A. Linking to Data -­‐ Effect on CitaAon Rates in Astronomy. (2011). 2.  Dorch, B. On the CitaAon Advantage of linking to data. (2012). 3.  Piwowar, H.A. & Vision, T.J. Data reuse and the open data citaAon advantage. PeerJ 1, e175 (2013) 4.  Sears, J.R. Data Sharing Effect on ArAcle CitaAon Rate in Paleoceanography. AGU Fall Meet. Abstr. -­‐1, 1628 (2011). 8 22 February 2013: The Office of Science and Technology Policy in the White House released a memorandum about expanding pubic access to the results of federally funded research. In addiAon to scholarly publicaAons, federal agencies are making serious efforts to increase the sharing of research data. All federal agencies with more than $100M in R&D expenditures are subject to this memo. hIp://www.whitehouse.gov/sites/default/files/microsites/ostp/
ostp_public_access_memo_2013.pdf 9 All of these federal agencies sponsor work being conducted at OSU. Red $$ are federal agencies funding the most work at OSU; amounts shown are funding to OSU in FY2011-­‐2012. [source: OSU Research Office] 10 11 So, we’ve agreed on what “data” means, and what data management is. Let’s talk about the large variety of data that you may obtain or create. The types of data that you use and/or create have a large impact on most of your data management acAviAes. 12 13 Most research uses mulAple data types & formats. My own dissertaAon included the generaAon or use of all of the data types that we’ve discussed (which might help to explain why it took me 7 years to earn a Ph.D.). hIp://hdl.handle.net/1957/9088 Image credit: Document by Piotrek Chuchla from The Noun Project 14 hIp://en.wikibooks.org/wiki/StaAsAcs/Different_Types_of_Data/
QuanAtaAve_and_QualitaAve_Data More on qualitaAve data: ”Although we may have categories, the categories may have a structure to them. When there is not a natural ordering of the categories, we call these nominal categories. Examples might be gender, race, religion, or sport. When the categories may be ordered, these are called ordinal variables. Categorical variables that judge size (small, medium, large, etc.) are ordinal variables. Aytudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure.” More on quanAtaAve data: “However, not all numbers are conAnuous and measurable. For example, the social security number is a number, but not something that one can add or subtract.” 15 See more on geospaAal analysis here: hIp://en.wikipedia.org/wiki/GeospaAal_analysis 16 17 If you don’t know yet – make an educated guess. Why is this important? The data types that you are creaAng and/or using have implicaAons for data documentaAon, storage and backup, data citaAon & ethics or reuse, and preservaAon and sharing (basically, all aspects of data management. 18 Before I move on – quesAons about anything so far? 19 Common data management areas are: •  Planning •  Storage & backup •  OrganizaAon & naming •  DocumentaAon & metadata •  Legal & ethical consideraAons •  Sharing & reuse •  Archiving & preservaAon [image source: hIp://upload.wikimedia.org/wikipedia/commons/2/22/
Earth_Western_Hemisphere_transparent_background.png] 20 Things to consider: •  How much data? •  Resources needed •  supplies and services •  costs associated with them •  Roles & responsibiliAes •  Metadata •  project level (documentaAon) •  data level (descripAon) •  Data formats •  Data storage & backup •  working data vs. archive vs. shared/discoverable •  Ethics & consent •  Copyright (open data) •  Sharing •  How? With who? 21 22 Quote from: UE RDM Kit So, let’s review a few real-­‐world scenarios. 23 Hard drive failures happen ALL THE TIME, without noAce. There’s a one-­‐liner that goes something like, “There are two kinds of people in this world: those who have had a hard drive fail, and those who haven’t had a had hard drive fail…yet.” (ha ha) Personal example: tell the story about Maria’s hard drive failure, and expensive (several $K) parAal recovery of her dissertaAon data. She lost a month’s worth of work. image credit: hIp://www.johnytech.com/wp-­‐content/uploads/2012/09/booterrors.jpg 24 Something else that happens ALL THE TIME: your computer get stolen (Wiley, post-­‐
research cruise theU example). Or, you crash your bike and your computer goes flying (Angel). image credit: hIp://www.flickr.com/photos/jeffreywarren/299422173/ 25 Lesson: offsite, redundant storage is strongly encouraged. 26 From my post-­‐doctoral work: Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcion’s research vessel 1 km inland. The marine research staAon was leU standing, but everything in it was destroyed (including our samples and data servers). Luckily, we had the data server replicated at OSU, so we only lost the physical samples that were in Chile. Think it can’t happen to you? Yeah, I thought the same thing. [photo credit: hIp://www.ocean-­‐partners.org/aIachments/473_Being%20towed.JPG] 27 Not from my personal experience: from the DataONE Data Stories project: hIps://
notebooks.dataone.org/data-­‐stories/the-­‐best-­‐laid-­‐schemes-­‐of-­‐backups-­‐and-­‐
redundancy/ Basic synopsis of this story: A researcher had made regular backups of her data, 5 GB worth, to an external drive. Her computer hard drive failed, so she sent it to IT, had it repaired, and went to restore her data using what was on her external backup drive. Unfortunately, only 30% of the data were accessible from the drive. “The only thing they could determine was that the soUware used to back up the data was not compaAble with the recovery process.” Lesson: “As it turned out, creaAng backups was not enough to completely guard against tragedy; the only way to make sure those backups could serve their purpose in the event of an emergency was to check them regularly and ensure the data they contained were recoverable.” 28 From NECDMC Module 4: Backup allows you to restore your data in the event that it is lost. A backup strategy is a plan for ensuring the accessibility of research data during the life of a project. What your strategy will be depends on the amount of data you are working with, the frequency that your data changes, and the system requirements for storing and rendering it. 29 From NECDMC Module 4: Would you need just the data files themselves, the soUware that created them, or customized scripts wriIen for data analysis? Depending on your research project, you may want to perform full, differenAal, or incremental backups. 30 Before we can talk about backing up your data, we need to address where you can store it. We will talk about the advantages, disadvantages and longevity of each opAon. Then, we’ll move on to backup. 31 Some content from: UE RDM Kit 32 USB flash drives, Compact Discs (CDs) and Digital Video Discs (DVDs) If you choose to use CDs, DVDs and USB flash drives (for example, for working data or extra backup copies), you should: •  Choose high quality products from reputable manufacturers. •  Follow the instrucAons provided by the manufacturer for care and handling, including environmental condiAons and labeling. •  Regularly check the media to make sure that they are not failing •  periodically 'refresh' the data (that is, copy to a new disk or new USB flash drive); every 2-­‐5 years •  Ensure that any private or confidenAal data is password-­‐protected and/or encrypted. Some content from: UE RDM Kit 33 For example: •  Fileservers managed by your research group, department or college •  Fileservers managed by InformaAon Services (IS) image credit: hIp://www.flickr.com/photos/ketmonkey/3329372324/ Some content from: UE RDM Kit 34 For example: DropBox, SkyDrive, Box, Mozy, Image credit: Cloud by Benni from The Noun Project Some content from: UE RDM Kit 35 See more info here: hIp://oregonstate.edu/helpdocs/soUware/google-­‐apps-­‐osu Fully encrypted 36 37 Local storage opAons: •  Personal computer •  External storage •  USB drive •  Drobo box •  Network server •  PI or department server located near you Remote storage opAons: •  Network server •  Run by InformaAon Services, likely cloud-­‐based •  Cloud storage •  Google Apps Suite (Drive, Docs, Forms, etc.) •  up to 30GB of data •  Variable access levels •  Login portal: hIp://oregonstate.edu/main/online-­‐services/google-­‐
apps-­‐for-­‐osu 38 39 Using a though~ul, consistent file-­‐naming strategy is the easiest way to make a big difference in organizing your files, and making them easy to find (not just by you, but by others in your group and your adviser). [image credit: PhDComics.com] 40 Tweet me @AWhitTwit with your #overlyhonestmethods! 41 1. 
2. 
Be consistent • 
Have convenAons for naming: (1) Directory structure (2) Folder names (3) File names • 
Always include the same informaAon (e.g. date and Ame) • 
Retain the order of informaAon (e.g. YYYYMMDD, not MMDDYYY ) Be descrip've • 
Try to keep file and folder names under 32 characters Within reason, Include relevant informaAon such as: Unique idenAfier (ie. Project Name or Grant # in folder name) Project or research data name CondiAons (Lab instrument, Solvent, Temperature, etc.) Run of experiment (sequenAal) Date (in file properAes too) Use applicaAon-­‐specific codes in 3-­‐leIer file extension and lowercase: mov, Af, wrl When using sequenAal numbering, make sure to use leading zeros to allow for mulA-­‐
digit versions. For example, a sequence of 1-­‐10 should be numbered 01-­‐10; a sequence of 1-­‐100 should be numbered 001-­‐010-­‐100. No special characters: & , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > -­‐ 42 43 Show examples of versions -­‐  Can go back when you make mistakes -­‐  when changes are made -­‐  Share work with other people -­‐  Both work on things at the same Ame and merge back together -­‐  Akin to game of telephone-­‐ version control can let you see exactly when a change was made 44 See Git, TortoiseSVN, Mercurial, … 45 Before I move on – quesAons about anything so far? 46 47 hIp://orcid.org – get an idenAfier for free! Here’s Amanda’s ORCID page: hIp://orcid.org/0000-­‐0003-­‐2429-­‐8879 Thanks to Jackie Wirz at OHSU for finding this perfect example! 48 Thanks to Jackie Wirz at OHSU for finding this perfect example! 49 50 51 52 53 54 PROJECT LEVEL: Context of data collection
Data collection methods
Structure, organization of data files
Data sources used (see citing data)
Data validation, quality assurance
Transformations of data from the raw data through analysis
Information on confidentiality, access & use conditions
DATA LEVEL: Variable names, and descriptions
Explanation of codes and classification schemes used
Algorithms used to transform data
File format and software (including version) used
55 56 [add content about licensing] Intellectual Property Office for CommercializaAon & Corporate Development (OCCD) Copyright Licensing Charging for data? Data aIribuAon & citaAon HUMAN SUBJECTS? Informed consent & anonymizaAon required prior to publishing Resources @ OSU: Office of Research Integrity, InsAtuAonal Review Board (IRB) Responsible Conduct of Research (RCR) Program 57 “Learn it. Know it. Live it.”, Brad Hamilton, from Fast Times at Ridgemont High. 58 How to share data: •  Repository: discipline-­‐specific, journal, etc. •  Disciplinary Repositories •  DataBib: hIp://databib.lib.purdue.edu/ •  Open Access Directory: hIp://oad.simmons.edu/oadwiki/Data_repositories •  OSU's digital repository, ScholarsArchive@OSU •  Personal web site •  By request Best practices:
•  Well documented, w/standard metadata schema •  Non-­‐proprietary file formats 59 How to share data: •  Repository: discipline-­‐specific, journal, etc. •  OSU's digital repository, ScholarsArchive@OSU •  hIp://ir.library.oregonstate.edu •  Personal web site •  By request Best practices:
•  Well documented, w/standard metadata schema •  Non-­‐proprietary file formats There are storage opAons for various disciplines including discipline related repositories. Some scienAfic journals also have their own data repositories. OSU has an insAtuAonal repository, Scholars Archive that serves as a repository for University Scholarship. Currently we store data sets on a case-­‐by-­‐case basis. If you don’t know where to store your data, you may consult with the library or talk to people in your department and/or your field to see if there is a logical repository for your work. 60 What is a data paper? hIp://guides.library.oregonstate.edu/data-­‐management-­‐data-­‐
papers-­‐journals “Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recogniAon for authors. Data papers also provide much more thorough context and descripAon than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).” “Data papers thoroughly describe datasets, and do not usually include any interpretaAon or discussion (an excepAon may be discussion of different methods to collect the data, e.g.).” 61 62 Your data is being managed whether you realize it or not, but in order to preserve data and make the most of it, researchers should acAvely plan their data management process. A data management plan is a document that provides a clear roadmap of the process of collecAng, storing, sharing and preserving your data. The earlier in the process a researcher starts in the data management process, not only makes it easier for her, but many repositories have requirements about how the data has been formaIed etc. . . And it may not be possible to meet those requirements retroacAvely. [sample DMP text from: hIp://libguides.unm.edu/loader.php?type=d&id=194916] 63 These are the five secAons for the generic NSF data management plan, but they cover what is generally discussed in any good data management plan. Things to consider: •  How much data? •  Resources needed •  supplies and services •  costs associated with them •  Roles & responsibiliAes •  Metadata •  project level (documentaAon) •  data level (descripAon) •  Data formats •  Data storage & backup •  working data vs. archive vs. shared/discoverable •  Ethics & consent •  Sharing 64 • 
• 
• 
• 
Create ready to use plans Meet agency requirements Step-­‐by-­‐step guidance OSU-­‐specific guidance 65 Please feel free to contact Amanda Whitmire or Maura ValenAno with any data management quesAons you might have or quesAons about deposiAng your research to ScholarsArchive@OSU digital repository, especially research data associated with published research arAcles and/or your thesis or dissertaAon. 66 67 68 As humans, we aren’t designed to interpret columns of numbers, or to be able to extract meaning from a table with a passing glance. As researchers and pracAAoners, we need to transform our data into something more meaningful. 69 We need to be the ambassadors of our data, to represent it in a way that speaks to why you made the effort to collect it and what you expect to learn from it. Sadly, most researchers, the large majority, never receive any training in how to share and communicate about their data, even to their peers. 70 71 Get the whole lifecycle graphic here: “Data management throughout the research lifecycle. Amanda Whitmire. figshare. hIp://dx.doi.org/10.6084/m9.figshare.774628” 72 OSU Research Data Services can help with data management during most of the research lifecycle phases. Examples include: •  Data management plan consultaAon •  Finding & accessing secondary data •  Data descripAon & metadata •  Ways to share your data within your project team •  How to prepare data for archiving and where to put it •  How to license your datasets •  How to cite your datasets (& help others cite it properly) 73 74 Could be: •  Sensor output •  Audio visual recording •  Lab/field notes •  GPS coordinates •  Survey results •  Interview notes Characteris'cs of raw data: •  Raw units (volts, e.g.) •  No [extra] metadata •  No context How to manage raw data: •  MulAple copies in mulAple locaAons •  DON'T modify them 75 CharacterisAcs of intermediate data •  Data that are in-­‐process •  Anything between ‘raw’ and ‘final’ •  MulAple stages, mulAple formats, using various soUware •  May be shareable w/colleagues, but not likely w/wider audience Managing intermediate data •  MulAple copies (of most recent versions) in mulAple locaAons •  Use sensible folder hierarchies & file naming convenAons •  Add metadata immediately & then addiAonally as you make changes 76 CharacterisAcs of final data: •  May be a subset of raw data, or a much larger dataset •  Shareable with anyone •  Non-­‐proprietary formats are ideal •  Metadata are thorough & complete Managing final data: •  MulAple copies in mulAple locaAons •  Refresh data every 2-­‐5 years •  If you use a proprietary format, migrate data to new version regularly •  Share via an open repository [image copyright: hIp://www.flickr.com/photos/usnavy/] 77 Image credit: Stock X.chnge 78 Image credit: OSU media services 79 For example, in climate science, HUGE amounts of data are compiled: hIp://
www.realclimate.org/index.php/data-­‐sources/ In the example image, hydrographic data from all over the world are compiled into a database, and new variables can be derived from them. For example, depth profiles of seawater density from measurements of pressure, temperature, and salinity (via conducAvity measurements). Image credit; hIp://cdiac.ornl.gov/oceans/RepeatSecAons/
CDIAC_CLIVAR_global_map.jpg 80 Image credit: hIp://media.oregonlive.com/pacific-­‐northwest-­‐news/photo/
gs61tsun103jpg-­‐730687c0e8a35305.jpg This model of tsunami driUage is based on ocean topography and surface winds. See: hIp://iprc.soest.hawaii.edu/research/research.php, research by N. Maximenko. 81 Image credit: hIp://upload.wikimedia.org/wikipedia/commons/
6/66/1930_census_Olsen_Schmidt.gif 82 An'cipate: •  Volume/File type(s) •  Raw data vs. processed/analyzed data •  Versioning •  File Naming ConvenAons •  Privacy Concerns •  Storage pracAce •  Backup plans; LOCKSS; checksums Storage needs will depend on the volume and type of data in quesAon. A plan should anAcipate the parAcular storage needs of your research. How much data do you expect to collect and how will storage infrastructure be funded? How will files be arranged on disk. This may seem like a trivial issue, but it’s important. If the person who has been responsible for maintaining files leaves, you need to be able to find things. You’ll also want to group files meaningfully with their related metadata. Data are, of course, subject to loss and corrupAon. (Bonus joke: there are two types of computer users… those who have suffered hard drive failure and those who haven’t suffered hard drive failure…yet). Will there be mulAple copies—on-­‐site and off-­‐site? Are there checks against file corrupAon? 83 Lesson: offsite, redundant storage is strongly encouraged. • 
1 on your local workstaAon • 
1 local/removable, such as external hard drive • 
1 on central server and/or • 
1 remote, such as on a cloud server* *Depending on the type of data, as cloud servers are not always secure 84 Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcion’s research vessel 1 km inland. The marine research staAon was leU standing, but everything in it was destroyed (including our samples). Think it can’t happen to you? Yeah, I thought the same thing. [photo credit: hIp://www.ocean-­‐partners.org/aIachments/473_Being%20towed.JPG] 85 ~back to Amanda~ NOTE: 15 October 2013: the soUware upgrade in this classroom to Office 2013 introduced a bug and subsequent lack of funcAonality for ColecAca. Likewise, DataUP is currently throwing a bug for unknown reasons. I am in contact with both enAAes, but the bugs could not be fixed by today. Our apologies! DataUp is “an open source tool helping researchers document, manage, and archive their tabular data, DataUp operates within the scienAst's workflow and integrates with MicrosoU® Excel. hIp://dataup.cdlib.org/ (EML metdata) ColecAca for Excel allows documenAng of Variables, Code Lists, and Data Sets directly from within MicrosoU Excel. Export your data documentaAon to an XML file in the DDI metadata format, the standard for data documentaAon. Open and edit it from ColecAca Designer, ColecAca Express, or other DDI applicaAons. hIp://
www.colecAca.com/soUware/colecAcaforexcel (DDI metadata) 86 
Download