1
Scientific Data Management
Craig A.Stewart
stewart@iu.edu
University Information Technology Services
Indiana University
Copyright 2004 – All rights reserved
10 March 2004
License terms
• Please cite as:Stewart, C.A. 2004. Scientific Data Management.
Tutorial presented at PittCon 2004, 7-12 Mar, Chicago, IL.
http://hdl.handle.net/2022/13998
• Some figures are shown here taken from web, under an
interpretation of fair use that seemed reasonable at the time and
within reasonable readings of copyright interpretations. Such
diagrams are indicated here with a source url. In several cases
these web sites are no longer available, so the diagrams are
included here for historical value. Except where otherwise noted, by
inclusion of a source url or some other note, the contents of this
presentation are © by the Trustees of Indiana University. This
content is released under the Creative Commons Attribution 3.0
Unported license (http://creativecommons.org/licenses/by/3.0/).
This license includes the following terms: You are free to share – to
copy, distribute and transmit the work and to remix – to adapt the
work under the following conditions: attribution – you must
attribute the work in the manner specified by the author or licensor
(but not in any way that suggests that they endorse you or your use
of the work). For any reuse or distribution, you must make clear to
others the license terms of this work.
2
Why a tutorial on Scientific Data
Management at Pittcon?
• As scientific research becomes more oriented towards
high-volume lab work, there will be increasing problems
in managing large volumes of data.
• Regulatory changes are having an important impact on
laboratory data management.
• It is becoming increasingly important to assure longterm preservation of data of all sorts; techniques
developed and understood in the scientific data
management area can help.
• This class was given successfully at previous LIMS
conferences
3
The key matter to be discussed
today
• Once you have collected your data, and the data has
been written into an output file:
– On what storage medium/system,
– and in what logical structure,
– and with what sort of record of custody
• should data be stored in to assure its long term
readability and utility?
4
5
Approach & Goals
• Approach
– This tutorial casts a very wide net.
– We will cover a large span of scale – ranging from
single spreadsheets to hundreds of TBs of data.
• Goals
– Explain the key problems, and the concepts and
nomenclature surrounding the problems
– Identify some of what might be the right answers,
and a few of the answers that are definitely wrong
– Provide information and references so you can
independently pursue matters of interest to you.
– At the end of the tutorial, you might not be ready
to design a large data management system, but
you will know how to start
6
Sources & format
• There exists no text for this material that covers this
material in the manner discussed in this tutorial. CAS
is an expert in some of the areas to be discussed
today, but not all. Expect extensive footnoting and
acknowledgement of other sources.
• The level of detail is intentionally uneven. Greater
detail is generally associated with one of two factors:
– A topic is sufficiently straightforward that some
details will let the participant go off and do
something on her/his own.
– A topic is especially important and is included in
the notes in detail so participants may refer to it
later. (In this case we may skim over some details
during the actual presentation).
7
Outline
• The problem – the data deluge, plus data doesn't age
as gracefully as you (probably) think
• Physical storage of data: RAID, tapes, CDs, etc.
• Data security & Legal issues
• Data management strategies:
– Flat files
– Excel as a scientific data management tool
– Relational databases
• Specialized scientific data storage formats
– Data exchange among heterogeneous formats.
– Data warehouses, federations, and grids
– Visualization and collection-time data reduction
• Closing thoughts
Bits, Bytes, and the proof that
CDs have consciousness
• A bit is the basic unit of storage, and is always either a
1 or a 0.
• 8 bits make a byte, the smallest usual unit of storage in
a computer.
• MegaByte (MB) - 1,048,576 bytes (A CD-ROM holds ~
600 MBs)
• GigaByte (GB) – ~ 1 billion bytes
• TeraByte (TB) - ~ 1 trillion bytes (a large library might
have ~1 TB of data in printed material)
• PetaByte (PB) – 1 thousand TBs
• ExaByte (EB) – 1 thousand PBs
8
9
The problem of scientific data
management
Explosion of data and need to
retain it
• Science historically has
struggled to acquire data;
computing was largely used
to simulate systems without
much underlying data
• Lots of data:
– Lots of data available
“out there”
– Dramatically accelerating
ability to produce new
data
• One of the key challenges,
and one of the key uses of
computing, is now to make
sense out of data now so
easily produced
• Need to preserve availability
of data for ???
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
10
11
Accelerating ability to produce new data
• Diffractometer – 1
TB/year
• Synchotron – 60
GB/day bursts
• Gene expression chip
readers – 360 GB/day
• Human Genome – 3
GB/person
• High-energy physics –
1 PB per year
http://atlasinfo.cern.ch/Atlas/Welcome.html
12
Some things to think about
• 25 years ago data was stored on punched tape or
punched cards
• How would you get data off an old AppleII+ diskette?
How about one of those high-density 5 ¼” DOS
diskettes?
• The backup tape in the sock drawer (especially if it’s a
VMS backup tape of an SPSS-VMS data file)
• The no-longer-easily-handled data file on a CD (e.g.
1990 Census data)
• Data is essentially irreproducible more than a short
period of time after the fact
Have you even tried to read one
of your old data files?
Exp_2_2_feb_14_1981
30 0
30 0
0.0 139.5 000.0
.
.
.
.
.
.
.
.
.
5.0 142.
.
.
.
.
.
.
.
.
.
.
0.0060
.0053
.0057
.0060
.0055
.5760
.5707
.5696
.5718
.5755
.0045
.0047
.0045
.0045
.0045
.4821
.4821
.4847
.4857
.4879
0.02123
.02123
.02123
.02123
.02123
.43607
.43247
.43161
.43325
.43450
.02169
.02169
.02167
.02167
.02164
.36409
.36512
.36733
.36851
.37028
-20.48
-20.48
-20.47
-20.44
-20.46
0.00
0.00
0.00
0.00
0.00
1.38
1.39
1.38
1.41
1.41
5.45
5.46
5.46
5.46
5.46
098.4571
98.4557
98.4536
98.4533
98.4557
98.4396
98.4319
98.4350
98.4305
98.4305
98.8949
98.8938
98.8952
98.8942
98.8942
98.9020
98.9020
98.8991
98.8960
98.8949
26.2
408.03
408.03
408.03
408.83
409.16
26.4
412.24
412.18
412.01
411.78
411.78
13
14
Even a small file can be
undecipherable!
2
m
F
1
2
99
320
1
2
210
420
3
F
2
195
2
350
4
M
1
110
1
215
5
M
2
218
2
364
6
F
3
120
1
355
7
M
3
125
1
355
1
15
And something even older…
Hwæt! We Gardena in geardagum,
þeodcyninga, þrym gefrunon,
hu ða æþelingas ellen fremedon.
Oft Scyld Scefing sceaþena þreatum…
This is from Beowulf, written 1,000 years ago. Think about
the language problem relative to the half-life of
radioactive waste!
16
Physical storage of data:
tapes,
CDs, disk
17
Durability of media
•
•
•
•
•
•
Stone: 40,000 years
Ceramics: 8,000 years
Papyrus: 5,000 years
Parchment: 3,000 years
Paper: 2,000 years
Magnetic tape: 10 years (under ideal conditions; 3-5
more conservative)
• CD-RW: 5-10 years (under ideal conditions; 1.5 years
more conservative)
• Magnetic disk: 5 years
• Even if the media survives, will the technology to read
it?
18
Data storage: media issues
• So what do you do with data on a paper tape?
• Long term data storage inevitably forces you to confront two
issues:
– the lifespan of the media
– the lifespan of the reading device
• Removable Magnetic media
– The right answer to any long-term (or even intermediateterm) data storage problem is almost never diskettes. It’s
always a race between the lifespan of the media and the
lifespan of the readers.
– Esoteric removable magnetic media are never a good idea.
Even Zip drives are probably not a good bet in the long
run. What do you do with a critical data set when your only
copy is on a Bernoulli drive?
19
Magnetic Tapes
• Tapes store data in tracks on a magnetic medium. The
actual material on the tape can become brittle and/or
worn and fall off.
• Tapes are best used in machine room environments with
controlled humidity.
• There are three situations in which tapes are the right
choice:
– Within production machine rooms
– As backup media
– For transfer between machine rooms under some
circumstances
20
Tape formats
• There are several formats with small user bases; these
should probably be avoided.
• DAT tapes don’t last well
• For system backups of office, lab, or departmental
servers, Digital Linear Tape (DLT) is best choice
• In machine rooms, Linear Tape Open (LTO) is the best choice.
(http://www.lto-technology.com/)
• LTO is a multi-vendor standard with two variants:
– Accelis: faster, lower capacity, lower popularity
– Ultrium: 10-20 Gbps, high capacity (100 GB/tape;
200 w compression). Excellent for write-intensive
applications
21
3480 Tape
200 MB
3490 Tape
400 MB
3490e
800 MB
3-20
MB/sec
3590 Tape
10-30 GB
3590e
20-60 GB
Estimated
life 30 years
22
Tape Robots
• STK Tape Silo
– Holds thousands of
tapes
– 2.4 PB total capacity
• Xcerta tape reader
– Holds 10 tapes
– 600 GB total capacity
http://www.tapedrives-3480to3590.com/134-04-20075/
23
Tape conversion services
• If you are presented with data in a physical format you
can’t read, there are several services that outsource
data recovery
• Do be careful of two issues that may be separate:
getting files off a tape for which you have no reader;
getting the files into a format you can read with
software you have.
• There are several of these companies. Two examples:
– Mueller Media Corporation
http://www.mullermedia.com/ (capabilities include
recovery in a fashion suitable for litigation purposes)
– Legacy Engineering
http://www.legacyconversions.com/
24
Non-magnetic removable media
•
•
•
•
•
•
•
•
CD – Compact Disk 650-703 MB
CD-ROM – CD-Read Only Memory
CD-RW – CD –Read/Write
CD speeds: 12x2x24 (x = 150 KB)
DVD-R Digital Versatile Disk 4.7 GB
DVD-RAM
DVD-RW – DVD-Read/Write (Pioneer)
DVD+RW – Another DVD Read/Write
(Sony/Philips) – the likely future most
common standard
• Don’t set any of these on your
dashboard!
CD-RW diagram
http://www.pctechguide.com/09cdrrw.htm#CD-R
25
CDs and DVDs con’t
• For routine, reliable, reasonably dense storage of data
around the lab, you can’t beat CDs or DVDs.
• CD writers are commonplace & reliable
• DVD writers are newer, more costly, and more prone to
format issues.
• Always be sure to have extensive and complete
information on the CD – including everything you need
to know to remember what it really is later. There
should be no data physically on the CD that is not
contained in a file burned on the CD.
• Watch out for longevity issues!!
– CD R/W – can be rewritten up to 1,000 times
– Shelf life 5-10 years
26
CD & DVD Jukeboxes
• Jukeboxes are good for
what they do
• Because the basic media
are standard, if you had
to ditch your investment
in the jukebox itself you
could still reuse the
media
• 240 CD jukebox at left
from
http://www.kubikjukebo
x.com/index.htm
27
CD & DVD Jukeboxes, con’t
• System shown at left
holds 16 jukeboxes; each
holds 240 CDs
• http://www.kubikjukebo
x.com/index.htm
28
Spinning disk storage
• JBOD (Just a Bunch of Disk) – alright so long as it’s alright to
loose data now and again. High speed access, takes
advantage of relatively low cost of disk drives. Good for
temporary data parking while data awaits reduction.
• RAID (Redundant Array of Independent Disks) – what you need
if you don’t want to lose data.
• Lifecycle replacement an issue in both cases
29
Types of disk
• SCSI (Small Computer System Interface.
• ATA (Advanced Technology Advancement) or IDE
– Intended for “internal to server” use – 40 cm cables
– Most people mean ATA when they say IDE (Intelligent Drive
Electronics ). Most people also mean Parallel ATA when
they say ATA
• Enhanced IDE, a newer version of IDE developed by Western
Digital Corporation, also called ATA-2
• Serial ATA
– Evolutionary replacement for ATA
– Thinner, longer cables – 1 meter
• Fibre channel – ANSI standard for a machine room fabric
connecting disks
30
Disk Trends
•
•
•
•
Capacity: doubles each year
Transfer rate: 40% per year
MB per $: doubles each year
Currently – around $5,000 per GB for cheapest options
31
RAID*
• Level 0: Provides data striping (spreading out blocks of each file
across multiple disks) but no redundancy. This improves
performance but does not deliver fault tolerance.
• Level 1: Provides disk mirroring.
• Level 3: Same as Level 0, but also reserves one dedicated disk for
error correction data. It provides good performance and some level
of fault tolerance.
• Level 5: Provides data striping at the byte level and also stripe error
correction information. This results in excellent performance and
good fault tolerance.
32
RAID 3
“This scheme consists of an array of HDDs for data and one
unit for parity. … The scheme generates from XOR (exclusiveor) parity derived from bit 0 through bit7. If any of the HDDs
fail, it restores the original data by an XOR between the
redundant bits on other HDDs and the parity HDD. With RAID
3, all HDDs operate constantly. “
http://www.studiostuff.com/ADTX/adtxwhatisraid.html
33
RAID 5
“RAID5 implements striping and parity. In RAID5,
the parity is dispersed and stored in all HDDs. ….
RAID5 is most commonly used in the products on
market these days.”
*http://www.studio-stuff.com/ADTX/adtxwhatisraid.html
34
But it takes more than RAID…
• If you have RAID disk arrays, that provides reliable
access to data (within a machine room) so long as you
don’t loose a disk controller
• To ensure that your data stays available (so long as you
have power), each disk array must be attached to two
servers simultaneously
35
NAS and SAN
• Storage Area Network (SAN) is a high-speed subnetwork
of shared storage devices. A storage device is a machine
that contains nothing but a disk or disks for storing
data. A SAN's architecture works in a way that makes all
storage devices available to all servers on a LAN or WAN.
• A network-attached storage (NAS) device is a server
that is dedicated to file sharing through some
protocol such as NFS. NAS does not provide any of
the activities that a server in a server-centric
system typically provides, such as e-mail,
authentication or file management. …
• Definitions modified from www.webopedia.com
• EMC now offers the best of both worlds!
36
Storage Bricks
•
•
•
•
•
•
Group of hard disks inside a sealed box
Includes spare disks
Typically RAID 5
When one disk fails, one of the spares is put to use
When you’re out of spares…
Sun seems to have originated this idea
37
Data Security
38
Backups
• A properly administered backup system and schedule is a
must.
• How often should you back up? More frequently than the
amount of elapsed time it takes you to acquire an amount of
data that you can’t afford to loose.
• Backup schedules – full and incremental
– Example backup scheduled
• 1st Sunday of month: full backup
• Incremental backups from Sunday on Monday, from
Monday on Tuesday, from Tuesday on Wednesday,
from Sunday on Thursday, from Thursday on Friday,
from Friday on Saturday
• Incremental from first Sunday on Second Sunday
• Full backups, stored offsite, every six months
• RAID disk enhances reliability of storage, but it’s not a
substitute for backups
39
Backup
• Office automated backup systems provide backup
against system crashes & viruses. Cost - $100 to $500
or more depending upon capacity
• Portable backup for Laptops – 5 GB hardcards ~$350
• Some backup systems
– Omnibak (www.hp.com)
– Legato (www.legato.com)
– Tivoli (IBM)
– For single PCs backup to CD-RW is a real option now!
40
Disaster recovery
• If your data is too important to lose, then it’s too
important to have in just one copy, or have all of the
copies in just one location.
• Natural disasters, human factors (e.g. fire), theft (a
significant portion of laptop thefts have data theft as
their purpose) can all lead to the loss of one copy of
your data. If it’s your only copy…… or the only location
where copies are kept…
• Offsite data storage is essential
– Vaulting services
– Remote locations of your business
– Online backup services are now a real option!
41
Data Security
• Some percentage of laptop thefts are intentional and
aimed at stealing data!
• Windows XP Professional
– Encrypting File System (EFS). But if your account is
destroyed or you forget the password...
– Recovery Agent provides a secondary account with
the ability to recover the data
• Other systems provide similar features
• And as before…. The 5 GB hardcard can be a real help
42
Legal ramifications
• HIPAA (Health Insurance Portability and Accountability
Act)
– Basically requires that any personally identifiable
health data be kept totally secure
– Good source of information: http://www.hipaa.org/
• FDA 21 CFR Part 11
– Basically requires that any data used in drug
development have a full audit trail
– Good source of information:
http://www.21cfrpart11.com/
Getting rid of data (with
certainty!)
• Deleting files is not enough!
• Wiping Utilities
– Symantec Ghost's gdisk utility (used in combination
with the "/diskwipe /dod" flags)
(http://enterprisesecurity.symantec.com/products/pr
oducts.cfm?productID=3)
– Declasfy
(http://www.dmares.com/maresware/df.htm#DECLA
SFY)
• Hard disk destruction services
– E.g. Webroot ecosafe disk destruction
(http://www.webroot.com/wb/products/ecosafe/ind
ex.php)
43
44
Data Management Strategies
45
Data management strategies
•
•
•
•
•
Flat files
Spreadsheets and Statistical software
Relational Databases
XML
Specialized scientific data formats
46
Flat files
• Nothing beats an ASCII flat file for simplicity
• ASCII files are not typically used for data storage by
commercial software because proprietary formats can
be accessed more quickly
• If you want a reliable way to store data that you will be
able to retrieve later reliably (media issues
notwithstanding), an ASCII flat file is a good choice.
Data Management Strategies:
Flat files, II
• IF you use an ASCII flat file for simple long-term
storage, be sure that:
– The file name is self-explanatory
– There is no information embedded in the file name
that is not also embedded in the file
– Each individual data file includes a complete data
dictionary, explanation of the instrument model and
experimental conditions, and explanation of the
fields
– Lay the data out in accordance with First, Second,
and Third Normal Forms as much as is possible
(more on these terms later)
47
48
Data dictionary
• Definition from webopedia.com:
– In database management systems, a file that defines
the basic organization of a database. A data
dictionary contains a list of all files in the database,
the number of records in each file, and the names
and types of each field. …
• More generally:
– A data dictionary is what you (or someone else) will
need to make sense of the data more than a few days
after the experiment is run
Spreadsheet Software as a data
management tool
• Microsoft’s Excel may suffice for many data
management needs (it is NOT FDA CFR Part 11
compliant!)
• If any given data set can be described in a 2D
spreadsheet with up to hundreds of rows and columns,
and if there is relatively little need to work across data
sets, then Excel might do the trick for you
49
Spreadsheet software as a data
management tool, con’t
• Designed originally to be electronic accountant ledgers
• Feature creep in some ways has helped those who have
moderate amounts of data to manage
• There are several options, including Open Source
products such as Gnumeric and nearly open source
products such as StarOffice (see www.openoffice.org)
• Since MS Excel is the most commonly used spreadsheet
package, this discussion will focus on MS Excel
50
51
The MS Excel Data menu
• Sort: Ascending or descending sorts on multiple
columns
• Lists: Allow you to specify a list (use only one list per
spreadsheet) and then perform filters, selecting only
those that meet a certain criteria (probably more useful
for mailing lists than scientific data management)
• Validation: lets you check for typos, data translation
errors, etc. by searching for out of bounds data
• Consolidate
• Group and outline
• Pivottable
• Get external data
52
MS Excel Statistics
• Mean, standard deviation, confidence intervals, etc. up
to t-test are available as standard functions within MS
Excel
• One-way ANOVA and more complex statistical routines
are available in the Statistics Add-in Pack
53
MS Excel Graphics
• Does certain things quite easily
• If it doesn’t do what you want it to do easily – it
probably won’t do it at all
• Constraints on the way data are laid out in the
spreadsheet are often an issue
Statistical Software as a data
management tool
• SPSS and SAS are the two leading packages
• Both have ‘spreadsheet-like’ data entry or editing
interfaces
• Both have been around a long time, and are likely to
remain around for a good while
• Workstation and mainframe versions of both available
54
What’s wrong with this
program?
DATA LIST FILE=sample.dat
/id 1 v1 3 (A) v2 5 v3 7-9 v4 11 v5 13-15
LIST VARIABLES v1 v2 v3
ONEWAY v3 BY v2 (1,3)
REGRESSION
/DEPENDENT=v5
/METHOD=ENTER v3
FINISH
m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
4 m 1 110 1 215
5 m 2 218 2 364
6 f 3 120 1 355
7 m 3 125 1 335
55
56
Better….
DATA LIST FILE=sample.dat
/id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
LIST VARIABLES gender weight glucose
ONEWAY glucose BY weight (1,3)
REGRESSION
/DEPENDENT=reactime
/METHOD=ENTER glucose
FINISH
m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
4 m 1 110 1 215
5 m 2 218 2 364
6 f 3 120 1 355
7 m 3 125 1 335
57
Now you have a fighting chance
DATA LIST FILE=sample.dat
/id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
VARIABLE LABELS ID ‘Subjet ID #' GENDER 'Subject Gender'
WEIGHT ‘Subject Weight in pounds’ GLUCOSE ‘Blood glucose level’
BP ‘Blood Pressure’ REACTIME ‘Reaction Time in Minutes”
VALUE LABELS GENDER m ‘Male’ f ‘Female’
LIST VARIABLES gender weight glucose
ONEWAY glucose BY weight (1,3)
REGRESSION
/DEPENDENT=reactime
/METHOD=ENTER glucose
FINISH
1 m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
.
58
An example SAS program
/* Computer Anxiety in Middle School Chlidren */
/* The following procedure specifies value lables for variables */
PROC FORMAT;
VALUE $sex 'M'='Male'
'F'='Female';
VALUE exp 1='upto 1 year' 2='2-3 yrs' 3='3+ yrs';
VALUE school 1='rural' 2='city' 3='suburban';
DATA anxiety;
INFILE clas;
INPUT ID 1-2 SEX $ 3 (EXP SCHOOL) (1.) (C1-C10) (1.)
(M1-M10) (1.) MATHSCOR 26-27 COMPSCOR 28-29;
FORMAT SEX $SEX.; FORMAT EXP EXP.; FORMAT SCHOOL SCHOOL.;
/* conditional transformation */
IF MATHSCOR=99 THEN MATHSCOR=.;
IF COMPSCOR=99 THEN COMPSCOR=.;
/* Recoding variables. Several items are to be reversed while scoring. */
/* The Likert type questionnaire had a choice range of 1-5 */
C3=6-C3; C5=6-C5; C6=6-C6; C10=6-C10;
M3=6-M3; M7=6-M7; M8=6-M8; M9=6-M9;
COMPOPI = SUM (OF C1-C10) /*FIND SUM OF 10 ITEMS USING SUM FUNCTION */;
MATHATTI = M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /*ADDING ITEM BY ITEM */;
/* Labeling variables */
LABEL ID='STUDENT IDENTIFICATION' SEX='STUDENT GENDER'
EXP='YRS OF COMP EXPERIENCE' SCHOOL='SCHOOL REPRESENTING'
MATHSCOR='SCORE IN MATHEMATICS' COMPSCOR='SCORE IN COMPUTER SCIENCE'
COMPOPI='TOTAL FOR COMP SURVEY' MATHATTI='TOTAL FOR MATH ATTI
SCALE';
59
SAS example, Part 2
/* Printing data set by choosing specific variables */
PROC PRINT;
VAR ID EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI;
TITLE 'LISTING OF THE VARIABLES';
/* Creating frequency tables */
PROC FREQ DATA=ANXIETY;
TABLES SEX EXP SCHOOL;
TABLES (EXP SCHOOL)*SEX;
TITLE 'FREQUENCY COUNT';
/* Getting means */
PROC MEANS DATA=ANXIETY;
VAR COMPOPI MATHATTI MATHSCOR COMPSCOR;
TITLE 'DESCRIPTIVE STATICTS FOR CONTINUOUS VARIABLES';
RUN;
/* Please refer to the following URL for further infomation */
/* http://www.indiana.edu/~statmath/stat/sas/unix/index.html */
60
An example SPSS program
TITLE 'COMPUTER ANXIETY IN MIDDLE SCHOOL CHILDREN'
DATA LIST FILE=clas.dat
/ID 1-2 SEX 3 (A) EXP 4 SCHOOL 5 C1 TO C10 6-15 M1 TO M10 16-25
MATHSCOR 26-27 COMPSCOR 28-29
MISSING VALUES MATHSCOR COMPSCOR (99)
RECODE C3 C5 C6 C10 M3 M7 M8 M9 (1=5) (2=4) (3=3) (4=2) (5=1)
RECODE SEX ('M'=1) ('F'=2) INTO NSEX /* Changing char var into numeric
var
COMPUTE COMPOPI=SUM (C1 TO C10) /*Find sum of 10 items using SUM
function
COMPUTE MATHATTI=M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /* Adding eachi item
VARIABLE LABELS ID 'STUDENT IDENTIFICATION' SEX 'STUDENT GENDER'
EXP 'YRS OF COMP EXPERIENCE' SCHOOL 'SCHOOL REPRESENTING'
MATHSCOR 'SCORE IN MATHEMATICS' COMPSCOR 'SCORE IN COMPUTER SCIENCE'
COMPOPI 'TOTAL FOR COMP SURVEY' MATHATTI 'TOTAL FOR MATH ATTI SCALE'
61
SPSS Example, Part 2
/*Adding labels
VALUE LABELS SEX 'M' 'MALE' 'F' 'FEMALE'/
EXP 1 'UPTO 1 YR' 2 '2 YEARS' 3 '3 OR MORE'/
SCHOOL 1 'RURAL' 2 'CITY' 3 'SUBURBAN'/
C1 TO C10 1 'STROGNLY DISAGREE' 2 'DISAGREE'
3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/
M1 TO M10 1 'STROGNLY DISAGREE' 2 'DISAGREE'
3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/
NSEX 1 'MALE' 2 'FEMALE'/
PRINT FORMATS COMPOPI MATHATTI (F2.0) /*Specifying the print format
comment Listing variables.
* listing variables.
LIST VARIABLES=SEX EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI/
FORMAT=NUMBERED /CASES=10 /* Only the first 10 cases
FREQUENCIES VARIABLES=SEX,EXP,SCHOOL/ /* Creating frequency tables
STATISTICS=ALL
USE ALL.
ANOVA COMPSCOR by EXP(1,3).
FINISH
comment Please refer to the following URL for further infomation
http://www.indiana.edu/~statmath/stat/spss/unix/index.html.
Keys to using Statistical Software as
a data management tool
• Be sure to make your programs and files self-defining.
Use variable labels and data labels exhaustively.
• Write out ASCI versions of your program files and data
sets.
• Stat packages generally are able to produce platformindependent ‘transport’ files. Good for transport, but be
wary of them as a long-term archival format
• Statistical software is excellent when your data can be
described well without having to use relational database
techniques. If you can describe the data items as a very
long vector of numbers, you’re set!
• Statistical software is especially useful when many
transformations or calculations are required - but
beware transforms, calculations, and creation of new
variables interactively!
62
Your own applications in Perl or
C
• Perl
– Portable extensible report language
– Problematic esoteric rubbish lister
– It’s a bit of both
– Perl is good way to manipulate small amounts of data
in a prototype setting, but performance in a
production setting will probably seem inadequate
• Use Perl to prototype, but if you’re using Perl, rewrite
the final application in C or C++
63
64
LIMS systems
• The opposite of data reduction….
• Developed for petrochemical and pharmaceutical
applications
– Highly repetitive tests
– Regular comparisons with standards
– Legal compliance issues are often involved
• If you need a LIMS system, good rule of thumb is 10X
expansion of storage needs
• Assume a LIMS system will require at least 0.5 FTE
dedicated staff for a lab or lab group
65
LIMS systems, con’t
• Sapphire (Made by LabVantage). http://labvantage.com/
– One of the standard large LIMS
– Very good on regulatory compliance
• Nautilus (Made by Thermo Electron Corp
http://www.thermo.com/com/cda/product/detail/1,105
5,10380,00.html)
– Good LIMS system, perhaps the best of the easier
LIMS to use
• Good source of review information: LIMSource
http://www.limsource.com/home.html
66
Laboratory Electronic Notebook
http://www.csm.ornl.gov/enote/
• Intuitively similar function –
computerizing lab processes
• The concept is that a LEN
should be less constraining
than a LIMS
• Results thus far are mixed
• Two example systems
– Tripos Electronic Notebook
http://www.tripos.com/sci
Tech/enterpriseInfo/opInfo
Tech/ten.html
– DOE 2000 Electronic
Notebook
http://www.csm.ornl.gov/e
note/
67
Database Definitions
• Database management system: A collection of programs
that enables you to store, modify, and extract
information from a database.
• Types of DBMSs: relational, network, flat, and
hierarchical.
• If you need a DBMS, you need a relational DBMS
• Query: a request to extract data from a database, e.g.:
– SELECT ALL WHERE NAME = “JONES" AND AGE > 21
• SQL (structured query language) – the standard query
language
68
Relational Databases*
• Relational Database theory developed at IBM by E.F.
Codd (1969)
• Codd's Twelve Rules – the key to relational databases
but also good guides to data management generally.
• Codd’s work is available in several venues, most
extensively as a book. The number of rules has now
expanded to over 300, but we will start with rules 1-12
and the 0th rule.
• 0th rule: A relational database management system
(DBMS) must manage its stored data using only its
relational capabilities.
• *Based on Tore Bostrup. www.fifteenseconds.com
69
Codd’s 12 rules
• 1. Information Rule. All information in the database
should be represented in one and only one way -- as
values in a table.
• 2. Guaranteed Access Rule. Each and every datum
(atomic value) is guaranteed to be logically accessible by
resorting to a combination of table name, primary key
value, and column name.
• 3. Systematic Treatment of Null Values. Null values
(distinct from empty character string or a string of blank
characters and distinct from zero or any other number)
are supported in the fully relational DBMS for
representing missing information in a systematic way,
independent of data type.
70
Codd’s 12 rules, con’t
• 4. Dynamic Online Catalog Based on the Relational
Model. The database description is represented at the
logical level in the same way as ordinary data, so
authorized users can apply the same relational language
to its interrogation as they apply to regular data.
71
Codd’s 12 rules, con’t
• 5. Comprehensive Data Sublanguage Rule. A relational
system may support several languages and various
modes of terminal use. However, there must be at least
one language whose statements are expressible, per
some well-defined syntax, as character strings and
whose ability to support all of the following is
comprehensible:
– data definition
– view definition
– data manipulation (interactive and by program)
– integrity constraints
– authorization
– transaction boundaries (begin, commit, and rollback).
72
Codd’s 12 rules, con’t
• 6. View Updating Rule. All views that are theoretically
updateable are also updateable by the system.
• 7. High-Level Insert, Update, and Delete. The capability
of handling a base relation or a derived relation as a
single operand applies not only to the retrieval of data,
but also to the insertion, update, and deletion of data.
• 8. Physical Data Independence. Application programs
and terminal activities remain logically unimpaired
whenever any changes are made in either storage
representation or access methods.
73
Codd’s 12 rules, con’t
• 9. Logical Data Independence. Application programs and
terminal activities remain logically unimpaired when
information preserving changes of any kind that
theoretically permit unimpairment are made to the base
tables.
• 10. Integrity Independence. Integrity constraints specific
to a particular relational database must be definable in
the relational data sublanguage and storable in the
catalog, not in the application programs.
74
Codd’s 12 rules, con’t
• 11. Distribution Independence. The data manipulation
sublanguage of a relational DBMS must enable
application programs and terminal activities to remain
logically unimpaired whether and whenever data are
physically centralized or distributed.
• 12. Nonsubversion Rule. If a relational system has or
supports a low-level (single-record-at-a-time)
language, that low-level language cannot be used to
subvert or bypass the integrity rules or constraints
expressed in the higher-level (multiple-records-at-atime) relational language.
The problem with (some) DBMS
computer science
• Database theory is wonderful stuff
• It is sometimes possible to get so caught up in the
theory of how you would do something that the
practical matters of actually doing it go by the wayside
• This is particularly true of the concept of “normal forms”
– only three of which we will cover
75
76
Some terminology
Formal Name
Relation
Tuple
Attribute
Common Name
Table
Row
Column
Also known as
Entity
Record
Field
A key is a field that *could* serve as a unique identifier of
records. The Primary key is the one field chosen to be the
unique identifier of records.
77
First Normal Form
• Reduce entities to first normal form (1NF) by
removing repeating or multivalued attributes to
another, child entity.
Specimen #
Measurement #`1 Measurement #2 Measurement #3
14
35
43
38
Specimen #
Specimens
14
Measurement#
Value
14
1
35
14
2
43
14
3
38
78
Second Normal Form
• Reduce first normal form entities to second normal form
(2NF) by removing attributes that are not dependent on
the whole primary key.
Specimen #
Measurement#
14
14
16
Specimen #
Measurement#
14
14
16
Specimens
Species
1 M. musculus
2 M. musculus
3 R. norvegicus
Species
14 M. Musculus
16 R. norvegicus
Value
35
43
38
Value
1
2
3
35
43
38
79
Third Normal form
• Reduce second normal form entities to third normal
form (3NF) by removing attributes that depend on
other, nonkey attributes (other than alternative
keys).
• It may at times be beneficial to stop at 2NF for
performance reasons!
Specimen #
Measurement#
14
14
16
Specimen #
1
2
3
Measurement#
14
14
16
O2 consumption Mass
35
43
85
O2 consumption Mass
1
35
2
43
3
85
O2
consumption
per gram
14
2.50
15
2.87
28
3.04
14
15
28
80
On to database products
• Microsoft Access – Common, relatively inexpensive,
moderately scalable. O.k. for personal use
• Microsoft SQL Server – More scalable – commonly used
for departmental (or larger) databases
• Oracle – Common, relatively more expensive, extremely
robust and scalable
• DB2 – Relatively common, IBM’s commercial database
application
• MySQL – Becoming more common, free, good for
prototyping and small-scale applications
81
MySQL
•
•
•
•
Open source database software
Available for several operating systems
Downloadable from www.mysql.com
Excellent for prototyping database applications, and in
many cases plenty for production
•
•
•
•
•
•
•
•
•
Components of MySQL
(exemplary of database
products
generally)
mysql – executes sql commands
mysqlaccess – manages users
mysqladmin – database administration
mysqld – MySQL server process
mysqldump – dumps definition and contents of a
database into a file
mysqlhotcopy – hot backup of databast
mysqlimport – imports data from other formats
mysqlshow – shows information about server and
objects
mysqld_safe – starts and manages mysql on Unix
82
Database applications and the
web?
• An Open Source option
– MySQL - database
– PHP - web scripting application
– Apache - web server
• Oracle and its web modules
• Stat package and web modules
83
84
XML
• The Extensible Markup Language (XML) is the universal format
for structured documents and data on the Web.
• http://www.w3.org/XML/
• Half of “XML in 10 points”
(http://www.w3.org/XML/1999/XML-in-10-points)
– XML is for structuring data. XML makes it easy for a
computer to generate data, read data, and ensure that the
data structure is unambiguous.
– XML looks a bit like HTML. Like HTML, XML makes use of
tags (words bracketed by '<' and '>') and attributes (of the
form name="value").
– XML is text, but isn't meant to be read.
– XML is verbose by design. (And it’s *really* verbose)
– XML is a family of technologies. (This leads to the
opportunity to create discipline-specific XML templates)
85
XML
• XML really is one of the most important data
presentation technologies to be developed in recent
years
• XML is a meta-markup language
• The development and use of DTDs (document type
definition) is time consuming, critical, and subject to the
usual laws regarding standards
• XML is a way to present data, but not a good way to
organize lots of data
86
Some XML examples
• Chemical Markup Language http://www.xml-cml.org/
• Extensible Data Format
http://xml.gsfc.nasa.gov/XDF/XDF_home.html
• CellML Chemical Markup Language http://www.xml-cml.org/
• SBML (Systems Biology Markup Language) www.sbml.org
• Extensible Data Format
http://xml.gsfc.nasa.gov/XDF/XDF_home.html
• MathML www.mathml.org
– (a + b)2 (from
www.dessci.com/en/support/tutorials/mathml/gitmml/big_pictu
re.htm)
<msup>
<mfenced>
<mi>a</mi>
<mo>+</mo>
<mi>b</mi>
</mfenced>
<mn>2</mn>
</msup>
87
XML issues
• Great technology
• Good commercial authoring systems available or in
development
• The problem with standards….
• Perhaps the biggest challenge in XML is the fact that it
is so easy to put together a web site and propose a DTD
as a standard, making the creation of real standards a
challenge
88
XML vs PDF
• PDF files are essentially universally readable. PDF file
formats give you a picture of what was once data in a
fashion that makes retrieval of the data hard at best.
• XML requires a bit more in terms of software, but
preserves the data as data, that others can interact with.
• Utility of XML and PDF interacts with proprietary
concerns, institutional concerns, and community
concerns – which are not always in harmony!
Specialized data storage
formats - HDF
•
•
•
•
Hierarchical Data Format (HDF)
HDF is an open-source effort
http://hdf.ncsa.uiuc.edu/
HDF5 is a general purpose library and file format for
storing scientific data.
89
90
HDF, con’t
• HDF5 can store two primary objects: datasets and
groups. A dataset is essentially a multidimensional
array of data elements, and a group is a structure
for organizing objects in an HDF5 file.
• Using these two basic objects, one can create and
store almost any kind of scientific data structure.
• Designed to address the data management needs
of scientists and engineers working in high
performance, data intensive computing
environments.
• HDF5 emphasizes storage and I/O efficiency.
• HDF is nontrivial to implement
• If you need the full capabilities of HDF, there’s nothing like it
91
Free Software Foundation
• Many of the software products mentioned in this talk
(XML, Perl, etc.) are Open Source Software
• The GNU general public license is the standard license
for such software
• Some of the best software for specific scientific
communities is open source (community software)
• There are certain expectations about such software and
how it is used
Data exchange among
heterogeneous formats
• I have data files in SAS, SPSS, Excel, and Access formats.
What do I do?
• Each of the more widely used stat packages contain
significant utilities for exchanging data. Stata makes a
package called Stat Transfer
• DBMS/Copy (Conceptual Software) probably the best
software for exchange among heterogeneous formats
92
93
Distributed Data
•
•
•
•
•
Data warehouses
Data federations
Distributed File Systems
External data sources
Data Grids
94
Data warehouses
• In a large organization one might want to ask research
questions of transactional data. And what will the MIS
folks say about this?
• Transactions have to happen now; the analysis does not
necessarily have to.
• Data warehousing is the coordinated, architected, and
periodic copying of data from various sources, both
inside and outside the enterprise, into an environment
optimized for analytic and informational processing
(Definition from “Data warehousing for dummies” by
Alan R. Simon
Getting something out of the
data warehouse
• Querying and reporting: tell me what’s what
• OLAP (On-Line Analytical Processing): do some analysis
and tell me what’s up, and maybe test some hypotheses
• Data mining: Atheoretic. Give me some obscure
information about the underlying structure of the data
• EIS (Executive Information Systems): boil it down real
simple for me
95
96
More Buzzwords
• Data Mart: Like a data warehouse, but perhaps more
focused. [Term often used by the newly renamed Data
Mart team after a Data Warehouse fiasco]
• Operational Data Store: Like a data warehouse, but the
data are always current (or almost). [Day traders]
97
Distributed File Systems OpenAFS
• The file system formerly known as Andrew File System –
Widely used among physicists
• AFS is a distributed filesystem product, pioneered at Carnegie
Mellon University and supported and developed as a product
by Transarc Corporation (now IBM Pittsburgh Labs). It offers a
client-server architecture for file sharing, providing location
independence, scalability and transparent migration
capabilities for data.
• The only show in town for simple data distribution without
going to experimental computer science projects or fairly
involved commercial products
98
AFS Structure
• AFS operates on the basis of “cells”
• Each cell depends upon a cell
server that creates the root level
directory for that cell
• Other network-attached devices
can attach themselves into the AFS
cell directory structure
• Moving data from one place to
another than becomes just like a
file operation except that it is
mediated by the network
• Requires installation of client
software (available for most Unix
flavors and Windows)
Root server
Client 1
Department 1
Researcher 1
Department 2
Researcher 2
99
Grids
• What’s a grid? Hottest current buzzword
• A way to link together disparate, geographically disparate
computing resources to create a meta-computing facility
• The term ‘computing grid’ was coined in analogy to the
electrical power grid
• Three types of grids:
– Compute
– Collaborative
– Data
100
Compute Grids
• Compute grids tie together disparate computing
facilities to create a metacomputer.
• Supercomputers: Globus is an experimental system that
historically focuses on tying together supercomputers
• PCs:
– Entropia is a commercial product that aims to tie
together multiple PCs
– SETI@Home
101
Collaboration Grids
• http://www-fp.mcs.anl.gov/fl/accessgrid/
102
Data Grids - Example
• Tier 0: CERN
• Tier 1: A national
center
• Tier 2 a center
covering one region
of a large country
• Tier 3: workgroup
server
• Tier 4: the (thousands
of) desktops
103
Example Data Grids
• GriPhyN (Grid Physics Network) –
The key problem: too much data
(PB per year)
• Biomedical data
• Globus – beginning to integrate
data grid functionality
• Avaki – commercial data grid
product
• Data Grids “virtualize” data
locality
• Chemistry example – Reciprocal
Net
(http://www.reciprocalnet.org/)
104
Layered Grid Architecture
(By Analogy to Internet Architecture)
“Coordinating multiple resources”:
ubiquitous infrastructure services,
app-specific distributed services
“Sharing single resources”:
negotiating access, controlling use
Collective
Application
Resource
“Talking to things”: communication
(Internet protocols) & security
Connectivity
Transport
Internet
“Controlling things locally”: Access
to, & control of, resources
Fabric
Link
June 5, 2002
Introduction to Grid Computing
http://www.globus.org/about/events/US_tutorial/slides/index.html
1
Internet Protocol Architecture
Application
105
Example:
Data Grid Architecture
App
Discipline-Specific Data Grid Application
Collective Coherency control, replica selection, task management,
(App)
virtual data catalog, virtual data code catalog, …
Collective Replica catalog, replica management, co-allocation,
(Generic) certificate authorities, metadata catalogs,
Resource
Connect
Access to data, access to computers, access to network
performance data, …
Communication, service discovery (DNS),
authentication, authorization, delegation
Fabric Storage systems, clusters, networks, network caches, …
June 5, 2002
Introduction to Grid Computing
http://www.globus.org/about/events/US_tutorial/slides/index.html
50
106
Web-accessible databases
• Especially prominent in biomedical sciences. E.g. NCBI:
• Entrez http://www.ncbi.nlm.nih.gov/entrez/
• Pubmed
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
– Provides access to over 11 million MEDLINE citations
• Nucleotide
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
– collection of sequences from several sources, including GenBank,
RefSeq, and PDB.
• Protein
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
• Genome
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
– The whole genomes of over 800 organisms.
107
Federated databases
• A federation of databases is a group of databases that
are tied together in some reasonable way permitting
data retrieval (generally) and sometimes (maybe in the
future) data writing
• Benefits of federated approach:
– Local access control. Lets data owner control access
– Acknowledges multiple sources of data
– By focusing on the edges of contact, should be more
flexible over the long run
• Shortcomings: Right now, significant hand work in
constructing such systems
• Example product: IBM’s DiscoveryLink
108
The idealized view of DiscoveryLink
Architecture
Lab
Results
DL
Clinical
Data
Toxicity
Data
109
110
Microarray Data Portal
• Web application and database designed for annotation and analysis
of microarray experiments.
• Annotation: Designed for users to set up experimental design first
minimizing amount of time for sample entry but still getting in the
essential info
• Analysis
– Allows user to partition data into groups based on their
annotation.
– Extensive filtering, search, and display options
– T-test, Clustering, SVD, etc.
– Allows different views of data based on informatics
associated with the genes (e.g. KEGG, GO, Chromosome
Location)
Annotation
111
KEGG pathway information
112
Online Biological Data Retrieval
• Web queries used to quickly identify SNPs and Genes in
specific regions and return information about those
identified SNPs and Genes.
• Used by the Hereditary Diseases and Family Studies
Division of the Medical and Molecular Genetics
Department of the Indiana University School of
Medicine.
• Live demo (hopefully)
http://www.medgen.iupui.edu/binf/cgiproto.html
• Marker1: D5S2057
• Marker2: D5S436
• Filter on tissue expression: Muscle
• < 60 seconds vs 10 hours
113
114
115
A commercial data grid: Avaki
• www.avaki.com
• Provides a set of
features that are
similar to other
data grid projects
described
• Provides excellent
security
• Economies of scale
• Becoming widely
used in life
sciences
Real-time data reduction as a
critical strategy
• Data: bits and bytes
• Information: that which reduces uncertainty (Claude
Shannon). Literally …the difference between two forms
of organization or between two states of uncertainty
before and after a message has been received, but also
the degree to which one variable of a system depends
on or is constrained by another. (from
http://pespmc1.vub.ac.be/ASC/INFORMATION.html)
• In other words, if there is no realistic circumstance in
which you would take an action based on or influenced
by a certain number, than this number is data, not
information
• We collect a lot more data than we do information!
116
117
Real-time data reduction
• Given that we collect much more data than
information, what do we do?
• If we can identify something as reliably just
data, and definitely not possibly information,
why keep it?
• In some cases of instruments that produce data
continually, a PC dedicated to on-the-fly data
reduction can drastically reduce data storage
requirements
Knowledge management,
searchers, and controlled
vocabularies
• A tremendous amount of effort has gone in to natural
language processing, AI, knowledge discovery, etc. with
results ranging from mixed to disappointing.
• If you want to be able to search large volumes of data on
an ad-hoc basis, then controlled vocabularies are
essential. Results here are mixed as well, but at least the
problems are sociological, not technological.
• Examples:
– GO (Gene Ontology) Gene Ontology Consortium,
http://www.geneontology.org/
– MeSH (Medical Subject Headings)
http://www.nlm.nih.gov/mesh/meshhome.html
118
119
Visualization
• The days when you could take a stack of greenbar down
to your favorite bar, page through the output, and
understand your data are gone.
• Data visualization is becoming the only means by which
we can have any hope of understanding the data we are
producing
• A single gene expression chip can produce more pixels
of data than the human eye&mind together are capable
of processing
120
Gene expression chips
http://www.microarrays.org/
121
Visualization Options
• For 2D: your monitor and some software!
• 2D commercial software
• 2D Open source: OpenmDX http://www.opendx.org/
http://www.research.ibm.com/dx/imageGallery/
122
Large-scale 3D systems
• CAVE™ - Cave Automatic
Virtual Environment
– Anything *but* automatic
– Best immersive 3D
technology available
• Immersadesk
– Furniture-scale 3-D
environment
– Easier to program than
CAVE
– Immersive 3D feel not as
good as CAVE, but less
expensive
123
A Lab-scale 3D system – the JohnE-BoxTM
Commercially available from CAE-Net, Inc. http://www.cae-net.com/
143
Heirarchical Storage
Management Systems
• Differential cost of media
– RAM
$60-$100/MB
– RAID
$4-$10/MB
– CD
~$1/MB (readers included)
– Tape
$0.05-$1/MB
• Differential read rates and access times:
– Disk: 1 GB/sec; 9-20 ms access time
– Tape: 200 MB/sec; <1 min (autoloader)
124
Hierarchical Storage
Management
• The objective of an HSM is to optimize the distribution
of data between disk and tape so as to store extremely
large amounts of data at reasonably economical costs
while keeping track of everything
• Most data is read rarely. Tape is cheap. Keep rarely read data
on disk.
• Data that is often used keep on disk.
• Stage data to disk on command for faster access when you
know you’re going to need it later.
• Stage data to disk in output.
• Manage data on tape so as to handle security and reliability.
• Metadata system keeps track of what everything is and where
it is!
125
126
HSM products
• EMASS Inc. - AMASS (Archival Management and Storage
System). http://www.emass.com
• Veritas – www.veritas.com
• LSF – Sun Microsystems, Inc.
• HPSS (High Performance Storage System) – a consortium-lead
product designed originally for weapons labs and now
marketed by IBM
• Tivoli storage Manager - http://www-
306.ibm.com/software/tivoli/products/storage-mgr/
HPSS – High Performance
Storage System
• Controlled by a consortium, but produced and released
as a service from IBM (as opposed to a product)
• Designed to meet the needs of some of the most
demanding and security-conscious customers in the
world
• Customers include:
– Lawrence Berkely Laboratories
– Los Alamos National Laboratories
– Sandia National Laboratories
– San Diego Supercomputer Center
– Indiana University
127
128
More about HPSS
• Requirements
– Absolute reliability of data in all forms (reliably read whenever
authorized person wants, and reliably not available to anyone
unauthorized)
– High capacity
– Speed
– Fault detection/Correction
• Components
– Name Server (NS) – translates standard file names and paths into
HPSS object identifier
– Bitfile Server (BFS) – provides logical bitfiles to clients
– Storage Server (SS) – manages relationship between logical files
and physical files
– Physical Volume Library (PVL) – maps logical volumes to physical
cartridges. Issues commands to PVR
– Physical Volume Repository – mounts and dismounts cartridges
– Mover (MVR) – transfers data from a source to a sink
129
130
The future of storage
• “In-place” increases in density
• New technologies:
– WORM Optical Storage & holographics
– Millepedes
– Non-corrosive metal
131
Holographic storage
• Based on 3-D rather than
2-D data storage
• Constantly going to
revolutionize storage RSN
• Significant problems with
media stability
• Holographic WORM (Write
Once Read Many)
technologies may
someday deliver
phenomenally dense
storage
Image © IBM may not be used
without permission
132
Millipede Storage
• Based on atomic force
microscopy (AFM): tiny
depressions melted by an
AFM tip into a polymer
medium represent stored
data bits that can then be
read by the same tip.
• Thermomechanical storage
is capable of achieving data
densities in the hundreds of
Gb/in² range
• Current best – 20 to 100
Gb/in²
• Expected limits for
magnetic recording (60–70
Gb/in²).
www.zurich.ibm.com/st/storage/millipede.html
133
Millipede Storage, Part 2
• Read/Write rate of
individual probe is
limited
• The Read/Write head
consists of ~1,000
individual probes that
read in parallel
www.zurich.ibm.com/st/storage/millipede.html
Storage of text on nonreactive
metal disks
• All of the commonly used storage media depend upon
arbitrary standards and are fragile
• If you have data that you really want to keep secure for
a long time, why not write it as text on non-corrosive
metal disks?
134
135
Future of computing
• The PC market will continue to be driven largely by
home uses (esp games)
• In scientific data management, the utility of computing
systems will be less determined by chip speeds and
more by memory and disk configurations, and internal
and external bandwidth
• And the future is uncertain!
– If you can see clearly what your storage requirements are
25 years into the future, and they are large scale and
significant, then a tremendous investment based on what’s
available today may be reasonable.
– In any other case, it may be best to take shorter views – 5
to perhaps 10 years, and build into your thinking the
constant need to refresh
136
The ongoing challenge
• One of the key problems in data storage is that you can’t just
store it. Data stored and left alone is unlikely under most
circumstances to be readable – and less likely to be
comprehensible and useable – in 20 years. The problem, of
course, is that there is an ever increasing need for
tremendous longevity in the utility of data. Because of this it
is essential that data receive ongoing curation, and migration
from older media and devices to newer media and devices.
Only in this way can data remain useful year after year.
137
A few pointers to references
• Statistical software: tutorials on
www.indiana.edu/~statmath
• Alan R. Simon. Data warehousing for Dummies. 1997.
IDG Books
• E.R. Harold & W. Scott Means. 2001. XML in a nutshell.
O’Reilley
• A. Khurshudov. 2001. The essential guide to computer
data storage. Prentice Hall
• A. Barrows. 2001. Access 2002 for Dummies. IDG
• G.M. Nielson, H. Hagen, H. Mueller. 1997. Scientific
Visualization. IEEE Computer Society
• C. Gibas & P. Jambeck. 2001. Developing bioinformatics
computer skills. O’Reilly