Scientific Data Management
Craig A.Stewart
stewart@iu.edu
University Information Technology Services
Indiana University
Copyright 2006 – All rights reserved
12 March 2006
License terms
•
•
Please cite as: Stewart, C.A. 2006. Scientific Data Management. Tutorial presented
at PittCon 2006, 12-17 March 2006. Orlando, FL. http://hdl.handle.net/2022/13993
Some figures are shown here taken from web, under an interpretation of fair use
that seemed reasonable at the time and within reasonable readings of copyright
interpretations. Such diagrams are indicated here with a source url. In several
cases these web sites are no longer available, so the diagrams are included here
for historical value. Except where otherwise noted, by inclusion of a source url or
some other note, the contents of this presentation are © by the Trustees of Indiana
University. This content is released under the Creative Commons Attribution 3.0
Unported license (http://creativecommons.org/licenses/by/3.0/). This license
includes the following terms: You are free to share – to copy, distribute and transmit
the work and to remix – to adapt the work under the following conditions: attribution
– you must attribute the work in the manner specified by the author or licensor (but
not in any way that suggests that they endorse you or your use of the work). For
any reuse or distribution, you must make clear to others the license terms of this
work.
13 June 2002
2
Goals
• Statistics:
– Provide a good, practical introduction to statistical
concepts
– Provide hands-on experience doing statistical analysis with
MS Excel
• Data Management
– Explain the key problems, and the concepts and
nomenclature surrounding the problems of scientific data
management, ranging from hardware to policy
• Focus will be on solutions that can be implemented at the
level of individual laboratories or laboratory working groups.
This class will cast a relatively wide net, and provide many
references for your use after the tutorial is over.
Key issues in scientific data
management
• Starting point: you had data in an output file.
• What does it mean?
• How do I store it while I want to keep it? (And how do I get rid
of it reliably when I want to?)
• How should it be organized and accessed electronically so
that it can easily be used now and in the future?
• When and how do you need to comply with HIPPA and
CFR21 part 11?
What does a data set mean?
Hwæt! We Gardena in geardagum,
þeodcyninga, þrym gefrunon,
hu ða æþelingas ellen fremedon.
Oft Scyld Scefing sceaþena þreatum…
What does a data set mean?
Exp_2_2_feb_14_1981
30 0
30 0
0.0 139.5 000.0
.
.
.
.
.
.
.
.
.
5.0 142.
.
.
.
.
.
.
.
.
.
.
0.0060
.0053
.0057
.0060
.0055
.5760
.5707
.5696
.5718
.5755
.0045
.0047
.0045
.0045
.0045
.4821
.4821
.4847
.4857
.4879
0.02123
.02123
.02123
.02123
.02123
.43607
.43247
.43161
.43325
.43450
.02169
.02169
.02167
.02167
.02164
.36409
.36512
.36733
.36851
.37028
-20.48
-20.48
-20.47
-20.44
-20.46
0.00
0.00
0.00
0.00
0.00
1.38
1.39
1.38
1.41
1.41
5.45
5.46
5.46
5.46
5.46
098.4571
98.4557
98.4536
98.4533
98.4557
98.4396
98.4319
98.4350
98.4305
98.4305
98.8949
98.8938
98.8952
98.8942
98.8942
98.9020
98.9020
98.8991
98.8960
98.8949
26.2
408.03
408.03
408.03
408.83
409.16
26.4
412.24
412.18
412.01
411.78
411.78
What does a data set mean?
2
m
F
1
2
99
320
1
2
210
420
3
F
2
195
2
350
4
M
1
110
1
215
5
M
2
218
2
364
6
F
3
120
1
355
7
M
3
125
1
355
1
Some things to think about
• 25 years ago data was stored on punched tape or punched
cards
• How would you get data off an old AppleII+ diskette? How
about one of those high-density 5 ¼” DOS diskettes?
• The backup tape in the sock drawer (especially if it’s a VMS
backup tape of an SPSS-VMS data file)
• The no-longer-easily-handled data file on a CD (e.g. 1990
Census data)
• Data is essentially irreproducible more than a short period of
time after the fact
Outline
•
•
•
•
•
•
•
•
•
The problem – the data deluge
Introduction to statistical concepts
Data management strategies
Hands-on exercise using MS Excel (or SPSS) to do
statistics
Lunch break
Physical storage of data: RAID, tapes, CDs, etc.
Data security, backup, and legal issues
Data management strategies, part 2
Some thoughts about the future and future technologies
The problem – the Data Deluge
Bits, Bytes, and the proof that CDs
have consciousness
• A bit is the basic unit of storage, and is always either a 1 or a
0.
• 8 bits make a byte, the smallest usual unit of storage in a
computer.
• MegaByte (MB) - 1,048,576 bytes (A CD-ROM holds ~ 600
MBs)
• GigaByte (GB) – ~ 1 billion bytes
• TeraByte (TB) - ~ 1 trillion bytes (a large library might have ~1
TB of data in printed material)
• PetaByte (PB) – 1 thousand TBs
• ExaByte (EB) – 1 thousand PBs
Explosion of data and need to retain it
•
•
•
•
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Science historically has
struggled to acquire data;
computing was largely used to
simulate systems without much
underlying data
Lots of data:
– Lots of data available “out
there”
– Dramatically accelerating
ability to produce new data
One of the key challenges, and
one of the key uses of
computing, is now to make
sense out of data now so easily
produced
Need to preserve availability of
data for ???
Accelerating ability to produce new data
• Diffractometer – 1 TB/year
• Synchotron – 60 GB/day
bursts
• Gene expression chip
readers – 360 GB/day
• Human Genome – 3
GB/person
• High-energy physics – 1 PB
per year
http://www.gene-chips.com/sample1.jpg
http://atlasinfo.cern.ch/Atlas/Welcome.html
Probability & Statistics
•
•
•
•
•
•
•
Basic Terminology – Chapter 3 in book
A bit about graphs (p 75)
Means, medians, distributions
Probability – Chapter 6
Sample variability – Chapter 9
Confidence limits – Chapter 10
Hypothesis testing – t-test
Data Management Strategies
Data management strategies, Part I
• Flat files
• Spreadsheets
• Statistical software
Flat files
• Nothing beats an ASCII flat file for simplicity
• ASCII files are not typically used for data storage by
commercial software because proprietary formats can be
accessed more quickly
• If you want a reliable way to store data that you will be able to
retrieve later reliably (media issues notwithstanding), an
ASCII flat file is a good choice.
Data Management Strategies: Flat files,
II
• IF you use an ASCII flat file for simple long-term storage, be
sure that:
– The file name is self-explanatory
– There is no information embedded in the file name that is
not also embedded in the file
– Each individual data file includes a complete data
dictionary, explanation of the instrument model and
experimental conditions, and explanation of the fields
– Lay the data out in accordance with First, Second, and
Third Normal Forms as much as is possible (more on
these terms later)
Data dictionary
• Definition from webopedia.com:
– In database management systems, a file that defines the
basic organization of a database. A data dictionary
contains a list of all files in the database, the number of
records in each file, and the names and types of each
field. …
• More generally:
– A data dictionary is what you (or someone else) will need
to make sense of the data more than a few days after the
experiment is run
Spreadsheet Software as a data
management tool
• Microsoft’s Excel may suffice for many data management
needs (it is NOT FDA CFR Part 11 compliant!)
• If any given data set can be described in a 2D spreadsheet
with up to hundreds of rows and columns, and if there is
relatively little need to work across data sets, then Excel might
do the trick for you
• If it will work, then why not
Spreadsheet software as a data
management tool, con’t
• Designed originally to be electronic accountant ledgers
• Feature creep in some ways has helped those who have
moderate amounts of data to manage
• There are several options, including Open Source products
such as Gnumeric and nearly open source products such as
StarOffice (see www.openoffice.org)
• Since MS Excel is the most commonly used spreadsheet
package, this discussion will focus on MS Excel
The MS Excel Data menu
• Sort: Ascending or descending sorts on multiple columns
• Lists: Allow you to specify a list (use only one list per
spreadsheet) and then perform filters, selecting only those
that meet a certain criteria (probably more useful for mailing
lists than scientific data management)
• Validation: lets you check for typos, data translation errors,
etc. by searching for out of bounds data
• Consolidate
• Group and outline
• Pivottable
• Get external data
MS Excel Statistics
• Mean, standard deviation, confidence intervals, etc. up to ttest are available as standard functions within MS Excel
• One-way ANOVA and more complex statistical routines are
available in the Statistics Add-in Pack
HIGHLY RECOMMENDED
• “Excel Data Analysis” Jinjer Simon, Wiley Publishing, Inc.
• Comes with a set of excellent software tools for doing data
analysis with Excel
• If you think there is any chance that you can manage your
data with Excel, buy the book, and then buy licenses for the
software that comes with it.
MS Excel Graphics
• Does certain things quite easily
• If it doesn’t do what you want it to do easily – it probably won’t
do it at all
• Constraints on the way data are laid out in the spreadsheet
are often an issue
Statistical Software as a data
management tool
• SPSS and SAS are the two leading packages
• Both have ‘spreadsheet-like’ data entry or editing interfaces
• Both have been around a long time, and are likely to remain
around for a good while
• Workstation and mainframe versions of both available
13 June 2002
143
What’s wrong with this program?
DATA LIST FILE=sample.dat
/id 1 v1 3 (A) v2 5 v3 7-9 v4 11 v5 13-15
LIST VARIABLES v1 v2 v3
ONEWAY v3 BY v2 (1,3)
REGRESSION
/DEPENDENT=v5
/METHOD=ENTER v3
FINISH
m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
4 m 1 110 1 215
5 m 2 218 2 364
6 f 3 120 1 355
7 m 3 125 1 335
Better….
DATA LIST FILE=sample.dat
/id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
LIST VARIABLES gender weight glucose
ONEWAY glucose BY weight (1,3)
REGRESSION
/DEPENDENT=reactime
/METHOD=ENTER glucose
FINISH
m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
4 m 1 110 1 215
5 m 2 218 2 364
6 f 3 120 1 355
7 m 3 125 1 335
Now you have a fighting chance
DATA LIST FILE=sample.dat
/id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
VARIABLE LABELS ID ‘Subjet ID #' GENDER 'Subject Gender'
WEIGHT ‘Subject Weight in pounds’ GLUCOSE ‘Blood glucose level’
BP ‘Blood Pressure’ REACTIME ‘Reaction Time in Minutes”
VALUE LABELS GENDER m ‘Male’ f ‘Female’
LIST VARIABLES gender weight glucose
ONEWAY glucose BY weight (1,3)
REGRESSION
/DEPENDENT=reactime
/METHOD=ENTER glucose
FINISH
1 m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
.
An example SAS program
/* Computer Anxiety in Middle School Chlidren */
/* The following procedure specifies value lables for variables */
PROC FORMAT;
VALUE $sex 'M'='Male'
'F'='Female';
VALUE exp 1='upto 1 year' 2='2-3 yrs' 3='3+ yrs';
VALUE school 1='rural' 2='city' 3='suburban';
DATA anxiety;
INFILE clas;
INPUT ID 1-2 SEX $ 3 (EXP SCHOOL) (1.) (C1-C10) (1.)
(M1-M10) (1.) MATHSCOR 26-27 COMPSCOR 28-29;
FORMAT SEX $SEX.; FORMAT EXP EXP.; FORMAT SCHOOL SCHOOL.;
/* conditional transformation */
IF MATHSCOR=99 THEN MATHSCOR=.;
IF COMPSCOR=99 THEN COMPSCOR=.;
/* Recoding variables. Several items are to be reversed while scoring. */
/* The Likert type questionnaire had a choice range of 1-5 */
C3=6-C3; C5=6-C5; C6=6-C6; C10=6-C10;
M3=6-M3; M7=6-M7; M8=6-M8; M9=6-M9;
COMPOPI = SUM (OF C1-C10) /*FIND SUM OF 10 ITEMS USING SUM FUNCTION */;
MATHATTI = M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /*ADDING ITEM BY ITEM */;
/* Labeling variables */
LABEL ID='STUDENT IDENTIFICATION' SEX='STUDENT GENDER'
EXP='YRS OF COMP EXPERIENCE' SCHOOL='SCHOOL REPRESENTING'
MATHSCOR='SCORE IN MATHEMATICS' COMPSCOR='SCORE IN COMPUTER SCIENCE'
COMPOPI='TOTAL FOR COMP SURVEY' MATHATTI='TOTAL FOR MATH ATTI
SCALE';
SAS example, Part 2
/* Printing data set by choosing specific variables */
PROC PRINT;
VAR ID EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI;
TITLE 'LISTING OF THE VARIABLES';
/* Creating frequency tables */
PROC FREQ DATA=ANXIETY;
TABLES SEX EXP SCHOOL;
TABLES (EXP SCHOOL)*SEX;
TITLE 'FREQUENCY COUNT';
/* Getting means */
PROC MEANS DATA=ANXIETY;
VAR COMPOPI MATHATTI MATHSCOR COMPSCOR;
TITLE 'DESCRIPTIVE STATICTS FOR CONTINUOUS VARIABLES';
RUN;
/* Please refer to the following URL for further infomation */
/* http://www.indiana.edu/~statmath/stat/sas/unix/index.html */
An example SPSS program
TITLE 'COMPUTER ANXIETY IN MIDDLE SCHOOL CHILDREN'
DATA LIST FILE=clas.dat
/ID 1-2 SEX 3 (A) EXP 4 SCHOOL 5 C1 TO C10 6-15 M1 TO M10 16-25
MATHSCOR 26-27 COMPSCOR 28-29
MISSING VALUES MATHSCOR COMPSCOR (99)
RECODE C3 C5 C6 C10 M3 M7 M8 M9 (1=5) (2=4) (3=3) (4=2) (5=1)
RECODE SEX ('M'=1) ('F'=2) INTO NSEX /* Changing char var into numeric
var
COMPUTE COMPOPI=SUM (C1 TO C10) /*Find sum of 10 items using SUM
function
COMPUTE MATHATTI=M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /* Adding eachi item
VARIABLE LABELS ID 'STUDENT IDENTIFICATION' SEX 'STUDENT GENDER'
EXP 'YRS OF COMP EXPERIENCE' SCHOOL 'SCHOOL REPRESENTING'
MATHSCOR 'SCORE IN MATHEMATICS' COMPSCOR 'SCORE IN COMPUTER SCIENCE'
COMPOPI 'TOTAL FOR COMP SURVEY' MATHATTI 'TOTAL FOR MATH ATTI SCALE'
SPSS Example, Part 2
/*Adding labels
VALUE LABELS SEX 'M' 'MALE' 'F' 'FEMALE'/
EXP 1 'UPTO 1 YR' 2 '2 YEARS' 3 '3 OR MORE'/
SCHOOL 1 'RURAL' 2 'CITY' 3 'SUBURBAN'/
C1 TO C10 1 'STROGNLY DISAGREE' 2 'DISAGREE'
3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/
M1 TO M10 1 'STROGNLY DISAGREE' 2 'DISAGREE'
3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/
NSEX 1 'MALE' 2 'FEMALE'/
PRINT FORMATS COMPOPI MATHATTI (F2.0) /*Specifying the print format
comment Listing variables.
* listing variables.
LIST VARIABLES=SEX EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI/
FORMAT=NUMBERED /CASES=10 /* Only the first 10 cases
FREQUENCIES VARIABLES=SEX,EXP,SCHOOL/ /* Creating frequency tables
STATISTICS=ALL
USE ALL.
ANOVA COMPSCOR by EXP(1,3).
FINISH
comment Please refer to the following URL for further infomation
http://www.indiana.edu/~statmath/stat/spss/unix/index.html.
Keys to using Statistical Software as a data
management tool
• Be sure to make your programs and files self-defining. Use
variable labels and data labels exhaustively.
• Write out ASCI versions of your program files and data sets.
• Stat packages generally are able to produce platformindependent ‘transport’ files. Good for transport, but be wary
of them as a long-term archival format
• Statistical software is excellent when your data can be
described well without having to use relational database
techniques. If you can describe the data items as a very long
vector of numbers, you’re set!
• Statistical software is especially useful when many
transformations or calculations are required - but beware
transforms, calculations, and creation of new variables
interactively!
Hands on statistical analysis project
Physical storage of data:
CDs, DVDs, disk, tapes
Durability of media
•
•
•
•
•
•
Stone: 40,000 years
Ceramics: 8,000 years
Papyrus: 5,000 years
Parchment: 3,000 years
Paper: 2,000 years
Magnetic tape: 10 years (under ideal conditions; 3-5 more
conservative)
• CD-RW: 5-10 years (under ideal conditions; 1.5 years more
conservative)
• Magnetic disk: 5 years
• Even if the media survives, will the technology to read it
survive?
Data storage: media issues
• So what do you do with data on a paper tape?
• Long term data storage inevitably forces you to confront two
issues:
– the lifespan of the media
– the lifespan of the reading device
• Removable Magnetic media
– The right answer to any long-term (or even intermediateterm) data storage problem is almost never any sort of
removable magnetic media. It’s always a race between the
lifespan of the media and the lifespan of the readers.
– Esoteric removable magnetic media are never a good
idea. Even Zip drives are probably not a good bet in the
long run. What do you do with a critical data set when your
only copy is on a Bernoulli drive?
Non-magnetic removable media
•
•
•
•
•
•
•
•
•
CD – Compact Disk 650-703 MB
CD-ROM – CD-Read Only Memory
CD-RW – CD –Read/Write
CD speeds: 12x2x24 (x = 150 KB)
DVD - Digital Versatile Disk 4.7 GB
DVD-R/RW (Pioneer)
DVD+RW – (Sony/HP)
DVD-RAM – a distant 3rd
Don’t set any of these on your dashboard!
CD-RW diagram
http://www.pctechguide.com/09cdrrw.htm#CD-R
CDs and DVDs con’t
• For routine, reliable, reasonably dense storage of data around
the lab, you can’t beat CDs or DVDs. There is no reason today to
buy a PC without at least a CD burner, and preferably a DVD
burner.
• CD writers are commonplace & reliable
• DVD writers are newer, more costly, and more prone to format
issues.
• Always be sure to have extensive and complete information on
the CD – including everything you need to know to remember
what it really is later. There should be no data physically on the
CD that is not contained in a file burned on the CD.
• Watch out for longevity issues!!
– CD R/W – can be rewritten up to 1,000 times
– Shelf life 5-10 years
An example DVD burner
• HP DVD 630E
• 4.7 GB (up to 8.5 using H
software)
• ~$200
• Lets you burn both the data
AND a label!!
www.shopping.hp.com
Low-tech, but effective storage
• Stores 100 CDs or DVDs
• < $100 from www.skymall.com
CD & DVD Jukeboxes
• Jukeboxes are effective storage devices and the media
are standard and hand removable
• 240 disk jukebox above from
http://www.kubikjukebox.com/index.htm
• Capacity 153 GB to ~2.2 TB
Magnetic Tapes
• Tapes store data in tracks on a magnetic medium. The actual
material on the tape can become brittle and/or worn and fall
off.
• Tapes are best used in machine room environments with
controlled humidity.
• There are three situations in which tapes are the right choice:
– Within production machine rooms
– As backup media
– For transfer between machine rooms under some
circumstances
Tape formats
• There are several formats with small user bases; these should
probably be avoided.
• DAT tapes don’t last well
• For system backups of office, lab, or departmental servers,
Digital Linear Tape (DLT) is best choice
• In machine rooms, Linear Tape Open (LTO) is the best choice.
(http://www.lto-technology.com/)
• LTO is a multi-vendor standard with two variants:
– Accelis: faster, lower capacity, lower popularity
– Ultrium: 10-20 Gbps, high capacity (100 GB/tape; 200 w
compression). Excellent for write-intensive applications
13 June 2002
143
3480/3490
Format Tape
Size
3480
575‘
3490E 1100’
Tracks Capacity
Uncompressed Compressed
18
210 MB
400 MB
36
800 MB
1600 MB
3590
Format Tape
Size
Tracks Capacity
Uncompressed Compressed
3590B
3590B
3590E
3590E
3590H
3590H
128
128
256
256
384
384
http://www.discinterchange.com/
media_photos/media-3480_.html
Std
Ext
Std
Ext
Std
Ext
10 GB
20 GB
20 GB
40 GB
30 GB
60 GB
30 GB
60 GB
60 GB
120 GB
90 GB
180 GB
Tape Robots
• Xcerta tape reader
– Holds 10 tapes
– 600 GB total capacity
• A nice brochure about other lab-scale
automated tape loaders is available from
www.quantum.com
http://www.tapedrives-3480to3590.com/134-04-20075/
• STK Tape Silo
– Holds thousands of tapes
– 2.4 PB total capacity
Tape conversion services
• If you are presented with data in a physical format you can’t
read, there are several services that outsource data recovery
• Do be careful of two issues that may be separate (with
separate price tags!): getting files off a tape for which you
have no reader; getting the files into a format you can read
with software you have.
• There are several of these companies. Two examples:
– Mueller Media Corporation http://www.mullermedia.com/
(capabilities include recovery in a fashion suitable for
litigation purposes)
– Legacy Engineering http://www.legacyconversions.com/
Spinning disk storage
• JBOD (Just a Bunch of Disk) – alright so long as it’s alright to
loose data now and again. High speed access, takes
advantage of relatively low cost of disk drives. Good for
temporary data parking while data awaits reduction.
• RAID (Redundant Array of Independent Disks) – what you
need if you don’t want to lose data.
• Lifecycle replacement an issue in both cases
Types of disk
• SCSI (Small Computer System Interface.
• ATA (Advanced Technology Advancement) or IDE
– Intended for “internal to server” use – 40 cm cables
– Most people mean ATA when they say IDE (Intelligent Drive
Electronics ). Most people also mean Parallel ATA when they
say ATA
• Enhanced IDE, a newer version of IDE developed by Western
Digital Corporation, also called ATA-2
• Serial ATA
– Evolutionary replacement for ATA
– Thinner, longer cables – 1 meter
• Fibre channel – ANSI standard for a machine room fabric
connecting disks
13 June 2002
143
Disk Trends
•
•
•
•
Capacity: doubles each year
Transfer rate: 40% per year
MB per $: doubles each year
Currently – < $3,000 per GB for cheapest options
13 June 2002
143
RAID*
• Level 0: Provides data striping (spreading out blocks of each
file across multiple disks) but no redundancy. This improves
performance but does not deliver fault tolerance.
• Level 1: Provides disk mirroring.
• Level 3: Same as Level 0, but also reserves one dedicated
disk for error correction data. It provides good performance
and some level of fault tolerance.
• Level 5: Provides data striping at the byte level and also stripe
error correction information. This results in excellent
performance and good fault tolerance.
RAID 3
“This scheme consists of an array of HDDs for data and one
unit for parity. … The scheme generates from XOR (exclusiveor) parity derived from bit 0 through bit7. If any of the HDDs
fail, it restores the original data by an XOR between the
redundant bits on other HDDs and the parity HDD. With RAID
3, all HDDs operate constantly. “
http://www.studiostuff.com/ADTX/adtxwhatisraid.html
RAID 5
“RAID5 implements striping and parity. In RAID5,
the parity is dispersed and stored in all HDDs. ….
RAID5 is most commonly used in the products on
market these days.”
*http://www.studio-stuff.com/ADTX/adtxwhatisraid.html
But depending upon your paranoia level..
• RAID 5+1 and 1+5 – mirroring plus RAID 5. High performance
and really good protection against multiple failures
• If you have RAID disk arrays, that provides reliable access to
data (within a machine room) so long as you don’t loose a
disk controller
• To ensure that your data stays available (so long as you have
power), each disk array must be attached to two servers
simultaneously
NAS and SAN
• Storage Area Network (SAN) is a high-speed subnetwork of
shared storage devices. A storage device is a machine that
contains nothing but a disk or disks for storing data. A SAN's
architecture works in a way that makes all storage devices
available to all servers on a LAN or WAN.
• A network-attached storage (NAS) device is a server that is
dedicated to file sharing through some protocol such as NFS.
NAS does not provide any of the activities that a server in a
server-centric system typically provides, such as e-mail,
authentication or file management. …
• Definitions modified from www.webopedia.com
• Several vendors offer best of both worlds
Storage Bricks
•
•
•
•
•
•
Group of hard disks inside a sealed box
Includes spare disks
Typically RAID 5
When one disk fails, one of the spares is put to use
When you’re out of spares…
Sun seems to have originated this idea
Apple XServe RAID
http://www.apple.com/xserve/raid/
•
•
•
•
•
Up to 5.6 TB
Compatible with Mac, Windows, & Linux servers
~$3,000 per GB not counting cost of server
Hot-swappable
White papers available from www.apple.com
Other interesting media
www.pricegrabber.com/search_getprod.php/
masterid=2513477/search=pcmcia%
20hard%20drive
5 GB for less than $200!
http://www.supermediastore.com/
superflash-usb-2-flash-drive-1gb.html
1 GB for less than $100!
iPod & a USB cable ~ 20 GB HD
www.detnews.com/pix/2004/08/28/ipod.jpg
Heirarchical Storage Management
Systems
• Differential cost of media
– RAM
$60-$100/MB
– RAID
$4-$10/MB
– CD
~$1/MB (readers included)
– Tape
$0.05-$1/MB
• Differential read rates and access times:
– Disk: 1 GB/sec; 9-20 ms access time
– Tape: 200 MB/sec; <1 min (autoloader)
Hierarchical Storage Management
• The objective of an HSM is to optimize the distribution of data
between disk and tape so as to store extremely large amounts
of data at reasonably economical costs while keeping track of
everything
• Most data is read rarely. Tape is cheap. Keep rarely read data on
disk.
• Data that is often used keep on disk.
• Stage data to disk on command for faster access when you know
you’re going to need it later.
• Stage data to disk in output.
• Manage data on tape so as to handle security and reliability.
• Metadata system keeps track of what everything is and where it is!
HSM products
• EMASS Inc. - AMASS (Archival Management and Storage System).
http://www.emass.com
• Veritas – www.veritas.com
• LSF – Sun Microsystems, Inc.
• HPSS (High Performance Storage System) – a consortium-lead
product designed originally for weapons labs and now marketed by
IBM
• Tivoli storage Manager - http://www-
306.ibm.com/software/tivoli/products/storage-mgr/
And a word or two about EMC
• EMC has a variety of storage products ranging from desktop
backup to enterprise storage systems
• Dantz Retrospect 7 backup software for Windows
• Overall focus is on high-end spinning disk storage
• Including tiered spinning disk storage, backup to disk, and
content management systems.
Data Security
Backups
• A properly administered backup system and schedule is a must.
• How often should you back up? More frequently than the amount
of elapsed time it takes you to acquire an amount of data that
you can’t afford to loose.
• Backup schedules – full and incremental
– Example backup scheduled
• 1st Sunday of month: full backup
• Incremental backups from Sunday on Monday, from
Monday on Tuesday, from Tuesday on Wednesday,
from Sunday on Thursday, from Thursday on Friday,
from Friday on Saturday
• Incremental from first Sunday on Second Sunday
• RAID disk enhances reliability of storage, but it’s not a substitute
for backups
Backup
• Office automated backup systems provide backup against
system crashes & viruses. Cost - $100 to $500 or more
depending upon capacity
• Portable backup for Laptops – 5 GB hardcard
• Some backup systems
– Quantum (www.quantum.com)
– Omnibak (www.hp.com)
– Legato (www.legato.com)
– Tivoli (IBM)
– For single PCs backups to DVDs are a real option now!
Zipping files
• The use of zip utilities as a way to backup work is GREATLY
underappreciated
• WinZip - < $30 for a great utility for zipping groups of files
together (www.winzip.com)
• StuffIt - < $100, great utility for Windows, Mac, or Linux
• Tar – free unix/linux utility
• Compression is great, but the biggest utility is the ability to
group files together in one bundle!
Version management
• CVS – concurrent version system
– Excellent for managing any sort of work that is regularly
updated and modified, such as documents and programs
– www.cvshome.org
– Intro for new users: www.cvshome.org/new_users.html
– When something is broken, or otherwise screwed up, CVS
allows you to backtrack reliably to a version that once
worked
• Sourceforge.net – a CVS-managed software repository for
open source software projects
Content Management
• Finding information in your own
webspace Google:
www.google.com/ enterprise/
products_landing.html
• Finding info on your own laptop:
Google Desktop Search:
desktop.google.com
• Managing things so that what’s
out there is what you really want
– many products, including
SiteRefresh
(www.refreshsoftware.com)
• Bibliographic software: EndNote
www.refreshsoftware.com/
SiteRefresh_Core_Content_Management
Disaster recovery
• If your data is too important to lose, then it’s too important to
have in just one copy, or have all of the copies in just one
location.
• Natural disasters, human factors (e.g. fire), theft (a significant
portion of laptop thefts have data theft as their purpose) can
all lead to the loss of one copy of your data.
• Offsite data storage is essential
– Vaulting services
– Remote locations of your business
– Online backup services are now a real option!
• www.usdatatrust.com/
• www.backup.com (< $100/year)
Data Security
• Some percentage of laptop thefts are intentional and aimed at
stealing data!
• Windows XP Professional
– Encrypting File System (EFS). But if your account is
destroyed or you forget the password...
– Recovery Agent provides a secondary account with the
ability to recover the data
• Other systems provide similar features
• And as before…. The 5 GB hardcard can be a real help
Security software
• Antivirus software: Symantec and others
• Antispyware software: Spyware Eliminator and others
(DON’T USE ANYTHING THAT’S FREE TO
DOWNLOAD!!!!!!!)
• Use FireFox or other browser - not Internet Explorer
• Scanning your systems (if your server is not scanned
regularly, it’s not secure)
– Open source: Nessus - www.nessus.org/
– Commercial derivative – Tenable Network Security
(www.tenablesecurity.com/products/)
• In general, beware of software that you can download for free
that is not clearly an open source product!!!
Legal ramifications
• HIPAA (Health Insurance Portability and Accountability Act)
– Basically requires that any personally identifiable health
data be kept totally secure
(which generally means encrypted)
– Good source of information: http://www.hipaa.org/
• FDA 21 CFR Part 11
– Basically requires that any data used in drug development
have a full audit trail
– Good source of information: http://www.21cfrpart11.com/
Getting rid of data (with certainty!)
• Deleting files is not enough!
• Wiping Utilities
– Symantec Ghost's gdisk utility (used in combination with
the "/diskwipe /dod" flags)
(enterprisesecurity.symantec.com/products/products.cfm?
productID=3)
– Declasfy
(www.dmares.com/maresware/df.htm#DECLASFY)
• Hard disk destruction services
– E.g. Webroot ecosafe disk destruction
(www.webroot.com/wb/products/ecosafe/index.php)
Data Management strategies, part II
Your own applications in Perl or C
• Perl
– Portable extensible report language
– Problematic esoteric rubbish lister
– It’s a bit of both
– Perl is good way to manipulate small amounts of data in a
prototype setting, but performance in a production setting
will probably seem inadequate
• Use Perl to prototype, but if you’re using Perl, rewrite the final
application in C or C++
13 June 2002
143
LIMS systems
• The opposite of data reduction….
• Developed for petrochemical and pharmaceutical applications
– Highly repetitive tests
– Regular comparisons with standards
– Legal compliance issues are often involved
• If you need a LIMS system, good rule of thumb is 10X
expansion of storage needs
• Assume a LIMS system will require at least 0.5 FTE dedicated
staff for a lab or lab group
LIMS systems, con’t
• Sapphire (Made by LabVantage). http://labvantage.com/
– One of the standard large LIMS
– Very good on regulatory compliance
• Nautilus (Made by Thermo Electron Corp
http://www.thermo.com/com/cda/product/detail/1,1055,10380,
00.html)
– Good LIMS system, perhaps the best of the easier LIMS to
use
• Good source of review information: LIMSource
http://www.limsource.com/home.html
Laboratory Electronic Notebook
• Intuitively similar function –
computerizing lab processes
• The concept is that a LEN should
be less constraining than a LIMS
• Results thus far are mixed
• Two example systems
– Tripos Electronic Notebook
http://www.tripos.com/sciTech
/enterpriseInfo/opInfoTech/ten
.html
– DOE 2000 Electronic
Notebook
http://www.csm.ornl.gov/enote
/
http://www.csm.ornl.gov/enote/
Database Definitions
• Database management system: A collection of programs that
enables you to store, modify, and extract information from a
database.
• Types of DBMSs: relational, network, flat, and hierarchical.
• If you need a DBMS, you need a relational DBMS
• Query: a request to extract data from a database, e.g.:
– SELECT ALL WHERE NAME = “JONES" AND AGE > 21
• SQL (structured query language) – the standard query
language
Relational Databases*
• Relational Database theory developed at IBM by E.F. Codd
(1969)
• Codd's Twelve Rules – the key to relational databases but
also good guides to data management generally.
• Codd’s work is available in several venues, most extensively
as a book. The number of rules has now expanded to over
300, but we will start with rules 1-12 and the 0th rule.
• 0th rule: A relational database management system (DBMS)
must manage its stored data using only its relational
capabilities.
• *Based on Tore Bostrup. www.fifteenseconds.com
Codd’s 12 rules
• 1. Information Rule. All information in the database should be
represented in one and only one way -- as values in a table.
• 2. Guaranteed Access Rule. Each and every datum (atomic
value) is guaranteed to be logically accessible by resorting to
a combination of table name, primary key value, and column
name.
• 3. Systematic Treatment of Null Values. Null values (distinct
from empty character string or a string of blank characters
and distinct from zero or any other number) are supported in
the fully relational DBMS for representing missing information
in a systematic way, independent of data type.
Codd’s 12 rules, con’t
• 4. Dynamic Online Catalog Based on the Relational Model.
The database description is represented at the logical level in
the same way as ordinary data, so authorized users can apply
the same relational language to its interrogation as they apply
to regular data.
Codd’s 12 rules, con’t
• 5. Comprehensive Data Sublanguage Rule. A relational
system may support several languages and various modes of
terminal use. However, there must be at least one language
whose statements are expressible, per some well-defined
syntax, as character strings and whose ability to support all of
the following is comprehensible:
– data definition
– view definition
– data manipulation (interactive and by program)
– integrity constraints
– authorization
– transaction boundaries (begin, commit, and rollback).
Codd’s 12 rules, con’t
• 6. View Updating Rule. All views that are theoretically
updateable are also updateable by the system.
• 7. High-Level Insert, Update, and Delete. The capability of
handling a base relation or a derived relation as a single
operand applies not only to the retrieval of data, but also to
the insertion, update, and deletion of data.
• 8. Physical Data Independence. Application programs and
terminal activities remain logically unimpaired whenever any
changes are made in either storage representation or access
methods.
13 June 2002
143
Codd’s 12 rules, con’t
• 9. Logical Data Independence. Application programs and
terminal activities remain logically unimpaired when
information preserving changes of any kind that theoretically
permit unimpairment are made to the base tables.
• 10. Integrity Independence. Integrity constraints specific to a
particular relational database must be definable in the
relational data sublanguage and storable in the catalog, not in
the application programs.
Codd’s 12 rules, con’t
• 11. Distribution Independence. The data manipulation
sublanguage of a relational DBMS must enable application
programs and terminal activities to remain logically
unimpaired whether and whenever data are physically
centralized or distributed.
• 12. Nonsubversion Rule. If a relational system has or
supports a low-level (single-record-at-a-time) language, that
low-level language cannot be used to subvert or bypass the
integrity rules or constraints expressed in the higher-level
(multiple-records-at-a-time) relational language.
The problem with (some) DBMS
computer science
• Database theory is wonderful stuff
• It is sometimes possible to get so caught up in the theory of
how you would do something that the practical matters of
actually doing it go by the wayside
• This is particularly true of the concept of “normal forms” – only
three of which we will cover
Some terminology
Formal Name
Relation
Tuple
Attribute
Common Name
Table
Row
Column
Also known as
Entity
Record
Field
A key is a field that *could* serve as a unique identifier of
records. The Primary key is the one field chosen to be the
unique identifier of records.
First Normal Form
• Reduce entities to first normal form (1NF) by removing
repeating or multivalued attributes to another, child entity.
Specimen #
Measurement #`1 Measurement #2 Measurement #3
14
35
43
38
Specimen #
Specimens
14
Measurement#
Value
14
1
35
14
2
43
14
3
38
Second Normal Form
• Reduce first normal form entities to second normal form (2NF)
by removing attributes that are not dependent on the whole
primary key.
Specimen #
Measurement#
14
14
16
Specimen #
Measurement#
14
14
16
Specimens
Species
1 M. musculus
2 M. musculus
3 R. norvegicus
Species
14 M. Musculus
16 R. norvegicus
Value
35
43
38
Value
1
2
3
35
43
38
Third Normal form
• Reduce second normal form entities to third normal form
(3NF) by removing attributes that depend on other,
nonkey attributes (other than alternative keys).
• It may at times be beneficial to stop at 2NF for
performance reasons!
Specimen #
Measurement#
14
14
16
Specimen #
1
2
3
Measurement#
14
14
16
O2 consumption Mass
35
43
85
O2 consumption Mass
1
35
2
43
3
85
O2
consumption
per gram
14
2.50
15
2.87
28
3.04
14
15
28
On to database products
• Microsoft Access – Common, relatively inexpensive,
moderately scalable. Widely used for managing small
scientific data projects. Good linkages to Excel and stat
software
• Microsoft SQL Server – More scalable – commonly used for
departmental (or larger) databases
• Oracle – Common, relatively more expensive, extremely
robust and scalable
• DB2 – Relatively common, IBM’s commercial database
application
• MySQL – Becoming more common, free, good for prototyping
and small-scale applications
• ACCESS databases
can be very
sophisticated
• ACCESS versions
can be an issue
• Backups daily are
critical for all
databases
Database applications and the web?
• An Open Source option
– MySQL - database
– PHP - web scripting application
– Apache - web server
• Oracle and its web modules
• Stat package and web modules
XML
• The Extensible Markup Language (XML) is the universal format
for structured documents and data on the Web.
• http://www.w3.org/XML/
• Half of “XML in 10 points” (http://www.w3.org/XML/1999/XML-in10-points)
– XML is for structuring data. XML makes it easy for a computer
to generate data, read data, and ensure that the data
structure is unambiguous.
– XML looks a bit like HTML. Like HTML, XML makes use of
tags (words bracketed by '<' and '>') and attributes (of the
form name="value").
– XML is text, but isn't meant to be read.
– XML is verbose by design. (And it’s *really* verbose)
– XML is a family of technologies. (This leads to the opportunity
to create discipline-specific XML templates)
XML
• XML really is one of the most important data presentation
technologies to be developed in recent years
• XML is a meta-markup language
• The development and use of DTDs (document type definition)
is time consuming, critical, and subject to the usual laws
regarding standards
• XML is a way to present data, but not a good way to organize
lots of data
• XML is VERBOSE!!!!!!
•
•
•
•
•
•
Some XML examples
Chemical Markup Language http://www.xml-cml.org/
Extensible Data Format http://xml.gsfc.nasa.gov/XDF/XDF_home.html
CellML Chemical Markup Language http://www.xml-cml.org/
SBML (Systems Biology Markup Language) www.sbml.org
Extensible Data Format http://xml.gsfc.nasa.gov/XDF/XDF_home.html
MathML www.mathml.org
– (a + b)2 (from
www.dessci.com/en/support/tutorials/mathml/gitmml/big_picture.htm
)
<msup>
<mfenced>
<mi>a</mi>
<mo>+</mo>
<mi>b</mi>
</mfenced>
<mn>2</mn>
</msup>
XML issues
• Great technology
• Good commercial authoring systems available or in
development
• The problem with standards….
• Perhaps the biggest challenge in XML is the fact that it is so
easy to put together a web site and propose a DTD as a
standard, making the creation of real standards a challenge
More about Markup Languages
• Some important MLs
– MathML
– ChemML
– MAGEML – gene expression chip
– SBML – Systems Biology Markup Language
– CellML
• XML editing and authoring tools
– Altova - www.altova.com/
– Oxygen - www.oxygenxml.com/
– Don’t even try without an editing tool
Web Services and the Semantic Web
•
•
•
•
Service providers publish to a repository
Clients look up services from that repository
Clients and providers then interact
If you have a teenager, web services are being used in your
home
• The goal of Web services is not to make hard things easy. It is
to make EXTREMELY hard things manageable, reliable, and
secure
• Semantic web. Semantics is the study of meaning in language
and communication. The goal of the semantic Web is to
provide unambiguous communications at the meaning level.
• Key standard setting body: WC3 http://www.w3.org/
XML vs PDF
• PDF files are essentially universally readable. PDF file
formats give you a picture of what was once data in a fashion
that makes retrieval of the data hard at best.
• XML requires a bit more in terms of software, but preserves
the data as data, that others can interact with.
• Utility of XML and PDF interacts with proprietary concerns,
institutional concerns, and community concerns – which are
not always in harmony!
Data exchange among heterogeneous
formats
• I have data files in SAS, SPSS, Excel, and Access formats.
What do I do?
• Each of the more widely used stat packages contain
significant utilities for exchanging data. Stata makes a
package called Stat Transfer
• DBMS/Copy (Conceptual Software) probably the best
software for exchange among heterogeneous formats
Distributed Data
•
•
•
•
•
Data warehouses
Data federations
Distributed File Systems
External data sources
Data Grids
Data warehouses
• In a large organization one might want to ask research
questions of transactional data. And what will the MIS folks
say about this?
• Transactions have to happen now; the analysis does not
necessarily have to.
• Data warehousing is the coordinated, architected, and
periodic copying of data from various sources, both inside and
outside the enterprise, into an environment optimized for
analytic and informational processing (Definition from “Data
warehousing for dummies” by Alan R. Simon
Getting something out of the data
warehouse
• Querying and reporting: tell me what’s what
• OLAP (On-Line Analytical Processing): do some analysis and
tell me what’s up, and maybe test some hypotheses
• Data mining: Atheoretic. Give me some obscure information
about the underlying structure of the data
• EIS (Executive Information Systems): boil it down real simple
for me
• More buzzwords:
– Data Mart: Like a data warehouse, but perhaps more
focused. [Term often used by the newly renamed and
reorganized Data Mart team after a fiasco]
– Operational Data Store: Like a data warehouse, but the
data are always current (or almost). [Day traders]
Web-accessible databases
•
•
•
Especially prominent in biomedical sciences. E.g. NCBI:
Entrez http://www.ncbi.nlm.nih.gov/entrez/
Pubmed
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
– Provides access to over 11 million MEDLINE citations
•
Nucleotide
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
– collection of sequences from several sources, including GenBank,
RefSeq, and PDB.
•
•
Protein
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Genome
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
– The whole genomes of over 800 organisms.
Federated databases
• Databases tied together in a way
that permits data retrieval
(generally) and perhaps data
writing
• Benefits of federated approach:
– Local access control. Lets data owner
control access
– Acknowledges multiple sources of data
– By focusing on the edges of contact,
should be more flexible over the long
run
• Shortcomings: Right now,
significant hand work in
constructing such systems
• Example product: IBM’s
DiscoveryLink
KEGG pathway information
Knowledge management, searchers,
and controlled vocabularies
• A tremendous amount of effort has gone in to natural
language processing, AI, knowledge discovery, etc. with
results ranging from mixed to disappointing.
• If you want to be able to search large volumes of data on an
ad-hoc basis, then controlled vocabularies are essential.
Results here are mixed as well, but at least the problems are
sociological, not technological.
• Examples:
– GO (Gene Ontology) Gene Ontology Consortium,
http://www.geneontology.org/
– MeSH (Medical Subject Headings)
http://www.nlm.nih.gov/mesh/meshhome.html
Grids
• What’s a grid? Hottest current buzzword
• A way to link together disparate, geographically disparate
computing resources to create a meta-computing facility
• The term ‘computing grid’ was coined in analogy to the electrical
power grid
• Three types of grids:
– Compute
– Collaborative
– Data
What’s a TeraGrid and Why?
• What is the TeraGrid?
• What is it trying to accomplish?
• How and why did IU get involved?
www.teragrid.org
TeraGrid Deep, Wide, Open
•
•
•
Deep – Providing powerful resources for deep
computation to enable scientific discovery
Wide – Providing access to resources in a transparent
fashion to support wide community access
Open – TeraGrid is an open partnership of resource
providers, users and technologists creating an
expanding service-oriented, standards-based
foundation for eScience
• Portals and Gateways
– LEAD (Linked Environments for Atmospheric
Discovery) provides weather forecast
modeling tools
Future of computing
• The PC market will continue to be driven largely by home
uses (esp games)
• In scientific data management, the utility of computing
systems will be less determined by chip speeds and more by
memory and disk configurations, and internal and external
bandwidth
• And the future is uncertain!
• It may be best to take intermediate term views of the future – 3 to 5
to perhaps 10 years, and build into your thinking the constant need
to refresh
The ongoing challenge
• One of the key problems in data storage is that you can’t just
store it. Data stored and left alone is unlikely under most
circumstances to be readable – and less likely to be
comprehensible and useable – in 20 years. The problem, of
course, is that there is an ever increasing need for
tremendous longevity in the utility of data. Because of this it is
essential that data receive ongoing curation, and migration
from older media and devices to newer media and devices.
Only in this way can data remain useful year after year.
A few pointers to references
•
•
•
•
•
•
•
•
•
•
•
•
J. Simon. Excel Data Analysis. 2003. Wiley Publishing
Statistical software: tutorials on www.indiana.edu/~statmath
R. Stephens and R. Pew. 2003. Teach yourself beginning databases in 24 hours.
SAMS publishing
Online Training Solutions. 2004. Step by Step Microsoft Access. Microsoft Press.
A. Barrows. 2004. Access 2003 for Dummies. IDG
A. Khurshudov. 2001. The essential guide to computer data storage. Prentice Hall
Alan R. Simon. Data warehousing for Dummies. 1997. IDG Books
E.R. Harold & W. Scott Means. 2001. XML in a nutshell. O’Reilley
R. Schmelzer et al. 2002. XML and web services unleashed. SAMS Publishing.
J. Bean. 2003. XML for data architects. Morgan Kaufman
G.M. Nielson, H. Hagen, H. Mueller. 1997. Scientific Visualization. IEEE Computer Society
C. Gibas & P. Jambeck. 2001. Developing bioinformatics computer skills. O’Reilly
Thanks!
• Feel free to email questions to me at any time:
stewart@iu.edu