Storing and Organizing Data Informatics I101 February 18, 2004 John C. Paolillo Storing Data • Encoding: fixed or variable width • Memory • Storage medium: – Magnetic: tape, disk, hard disk – Optical: CD, DVD, etc. – Silicon: Programable Read Only Memory (PROM), Erasable PROM, etc. Compact Disk Recording LED Light beam Lens Lens Data groove, etched in surface of plastic, has a slight “wobble” that helps locate the data The Recording Process Light beam — pulses to record on and off states, steady for reading Crystaline metal alloy recording surface Pits of amorphous solid left when metal re-cools 1.6µm 0.74µm 0.32µm CD Media States • Crystaline: bright, reflects light well – “off” state • Amorphous: dark, scatters light – “on” state • Micro-crystaline: reflects light, but not brightly – “erased” state (= “off”) How Erasing Takes Place Writing isn’t perfect The center pits (dots) are partly erased by the heating caused by the writing of the nearby longer pits (dashes) which were written later. Reference van Houten, Henk; and Wouter Leibbrandt. 2000. “Phase change recording”. Communications of the ACM, 43.11: 64-71. http://www.acm.org/dl Storing Data • Encoding: we may need to change from one encoding to another – Task of the device driver – Gives us a stream of bits • Medium: different media require different treatment of the data for storage – Task of the device hardware itself – Gives us a stream of bits read/write-able by the device But how do we find the data later? Data Organization • Index for the data – File names, extensions – Metadata (date, program that uses it, etc.) – Directory structures • All data storage systems use some kind of data organization – The principles of data organization are the same no matter what the data or where it is organized When Organization is Critical • National Center for Biotechnology Information (NCBI) Genbank: – 28 billion DNA base pairs (A, C, G, T) – 22 million sequences (possible genes) This is a lot of data to manage. In NCBI it has been indexed with many kinds of metadata and integrated with information from scientific publications, so the overall enterprise is larger yet. Other Similar Applications • NASA mars and other missions – http://photojournal.jpl.nasa.gov/index.html • The National Virtual Observatory – http://www.us-vo.org/ • Centers for Disease Control – http://www.cdc.gov/ • Homeland Security Data and Metadata Data: any object of interest which can be characterized and encoded in digital form Metadata: data about data — data used to help index and locate data of interest in some application Data Organization Schemes • Hierarchical – Data organized into object hierarchies for easy access – Metadata is in the tree structure of the hierarchies – XML Databases • Network – Objects link to some selected other objects – Metadata is embedded in the data – The World-Wide Web • Relational – Data organized into relations – Metadata is in the structure of the relations – Most Database Management Systems (DBMSs) Relations Metadata Data Actor Meryl Streep Johnny Depp Meg Ryan ... Movie The Hours Dead Man Against the Ropes ... Date Summer 2003 Summer 1994 Winter 2004 ...