Messy Data

advertisement
Messy Data
Metadata
1)
2)
3)
Research lab data: Often transient data that doesn’t get out to
anyone; it’s of not value. Mine, collaborators, the public. The
quantity of data gets smaller as it goes on. The level of curation
should be proportional to its value. Public data more curated than
local lab data.
Instrument-based research lab: person collecting for reason X
may want to collect for reason X and Y because someone
(including themselves) may be interested for Y later.
Data drawn in from sensor: Data interpretation and assessment
of quality pulled in from sensors dependent on instrument and
how deployed. Need to captured on spot because use temporally
distant from generation and data is ephemeral. Sample at high
frequency when something interesting – real time attribute.
Would experiment re-creation be easier than trying to
understand how someone else created data?
• Curation cost an investment decision based on sense of value of data.
– Sometimes forward looking view at problem puts us into right mindset (of
generator instead of tool building problem solver):
– Curating data may not be valuable because 5 years from now will obsoleted by
new technology (e.g., microarray). Seismic data about Hati earthquake or
Katrina hurricane cannot be replaced. Sustainability science data sets get
more valuable as they age (particularly as collection longevity increases.)
• Value of data changes as our understanding evolves.
• Tying data to process; high quality process metadata.
• Layering in metadata as time goes on and interpretations
• Simple missing steps we’re not doing.
– query database, don’t get version of database back. Simple thing and we’re
not doing it ourselves.
– Small piece of semantics is better than no semantics. No self respecting CS
person would have developed tags – too unstructured
– 63% of biology data does not use data models. Similar number doesn’t use
controlled vocabularies.
Achievable steps for generating, using, and
reusing data? Propose two models
• Model 1 : open development model: data in open from
start. OK if not perfect. Will lose first access rights, need
embargo model:
– e.g., Human genome race: public group did first model; publish
reads of chromosome map files. Data used by private group to
do better job.
• Model 2: At point where peer published paper, publish data
and supplementary notes. Seeing recording of work in
progress (supplementary notes) can be useful. Paper
essentially benefits academic .
• In either model, need credit system for data; need
propagation mechanism so attribution flows with data.
Integration houses want to give credit, and need it at point
of ingest.
Download