Messy Data Metadata 1) 2) 3) Research lab data: Often transient data that doesn’t get out to anyone; it’s of not value. Mine, collaborators, the public. The quantity of data gets smaller as it goes on. The level of curation should be proportional to its value. Public data more curated than local lab data. Instrument-based research lab: person collecting for reason X may want to collect for reason X and Y because someone (including themselves) may be interested for Y later. Data drawn in from sensor: Data interpretation and assessment of quality pulled in from sensors dependent on instrument and how deployed. Need to captured on spot because use temporally distant from generation and data is ephemeral. Sample at high frequency when something interesting – real time attribute. Would experiment re-creation be easier than trying to understand how someone else created data? • Curation cost an investment decision based on sense of value of data. – Sometimes forward looking view at problem puts us into right mindset (of generator instead of tool building problem solver): – Curating data may not be valuable because 5 years from now will obsoleted by new technology (e.g., microarray). Seismic data about Hati earthquake or Katrina hurricane cannot be replaced. Sustainability science data sets get more valuable as they age (particularly as collection longevity increases.) • Value of data changes as our understanding evolves. • Tying data to process; high quality process metadata. • Layering in metadata as time goes on and interpretations • Simple missing steps we’re not doing. – query database, don’t get version of database back. Simple thing and we’re not doing it ourselves. – Small piece of semantics is better than no semantics. No self respecting CS person would have developed tags – too unstructured – 63% of biology data does not use data models. Similar number doesn’t use controlled vocabularies. Achievable steps for generating, using, and reusing data? Propose two models • Model 1 : open development model: data in open from start. OK if not perfect. Will lose first access rights, need embargo model: – e.g., Human genome race: public group did first model; publish reads of chromosome map files. Data used by private group to do better job. • Model 2: At point where peer published paper, publish data and supplementary notes. Seeing recording of work in progress (supplementary notes) can be useful. Paper essentially benefits academic . • In either model, need credit system for data; need propagation mechanism so attribution flows with data. Integration houses want to give credit, and need it at point of ingest.