Establishing a Mechanism for Maintaining File Integrity within the Data Archive

advertisement
Establishing a Mechanism for
Maintaining File Integrity
within the Data Archive
Thomas C. Stein, Edward A. Guinness,
Susan H. Slavney
Planetary Data Systems Geosciences Node
Washington University
St. Louis, Missouri
Background
• Must develop and maintain long-term
(100 year) archive
– For NASA’s past, present and future
orbital and landed missions to Mars,
Venus, and the Moon
• Current repository
– 15 missions
– 143 data sets (half of them active in 2004)
– 9 TB data products
Background
PDS GEOSCIENCES NODE
CUMULATIVE DATA HOLDINGS, Tb
1000
100
10
1
FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09 FY10 FY11
Background
Data Sets Actively Archived
80
70
60
50
40
30
20
10
0
1988
1990
1992
1994
1996
1998
2000
2002
2004
Early Archives
• Data sets archived at the end of a
mission
• Archive small enough to be published
on CD-ROM
• Copies sent to hundreds of science
users
Current Archives
• Data sets archived throughout active mission
with releases at 3- to 6-month intervals
• Active missions revise and redeliver data sets
during their lifetime
• Archive stored on SAN at Geosciences Node
• Work with multiple active missions at one
time
– Active: Mars Global Surveyor, Mars Odyssey, Mars
Exploration Rovers, Mars Express
– Planning: Mars Reconnaissance Orbiter,
MESSENGER, Phoenix, Mars Science Laboratory,
Lunar Reconnaissance Orbiter
Issue
• How to ensure that archived data do
not change over time
That is, how to maintain file integrity
Uses
• Tracking and validating electronic
deliveries from data suppliers
• Protecting the on-line repository
against file loss or corruption
• Provide end users with means to
validate electronically download data
Requirements
• Maintain inventory
– Data products, product labels, ancillary
files
– Product type, format, file location, size,
integrity “key”
• Maintain state information
– Backup status
– Data access permissions
Requirements
• Check in new or updated data
– Multiple deliveries may be made by
instrument team during life of mission
– Verify that all files in a delivery are present
– Verify that contents of the files match what
was sent by data provider
Requirements
• Maintain integrity of the online
repository
– Automated checks
• Manifest (filename, location, and size)
• Content via integrity key
• Access permissions
– Partial integrity checks may be run
manually
– Report of results produced automatically
Requirements
• Track backups of data
– Scheduled backups
– Onsite and offsite copies
– Simulated restores with integrity checks
Requirements
• Provide location and validation method
of data on demand
– URL and local file system location of a file
– Integrity key
– Multiple files and keys should be
packaged “on the fly” into standard
formats (e.g., tar and zip)
Potential Methods
•
•
•
•
File to file (per byte) comparison
File size
Bit counts (simple parity)
File checksums
Selecting a Method
• Availability
– Freely available source code
– Use on multiple platforms
• Calculation speed
– Important for data set ingestion and verification
• Size of key
– Minimize amount of information required to
compare files
• Ease of use
– Data providers and end users should not need
special knowledge
• Accuracy
– Needs to work every time
Selecting a Method
File to
file
File size
Bit
counts
Availability
yes
yes
Calculation
speed
variable
Checksum size
Ease of Use
Accuracy
MD5
RIPEMD
-160
RIPEMD
-320
SHA-1
SHA-512
yes
yes
yes
yes
yes
yes
fast
very fast
very fast
very fast
fast
very fast
fast
< 2 TB
64 bits
64 bits
32 bits
40 bits
80 bits
40 bits
80 bits
easy
easy
easy
easy
easy
easy
easy
easy
excellent
fair
poor
excellent
excellent
excellent
excellent
excellent
• The National Institute of Standards and Technology
has not approved MD5 as a secure hashing algorithm.
Solution
• Use SHA-1 checksums at the file level
to provide digital signatures
Approach
• Data flow model involves three phases
– Data delivery from science team
– Data ingestion
– Data integrity
D ATA
P R O V ID E R
M A N IF E S T
A N C IL L A R Y F IL E S
D ATA P R O D U C T S
D IG ITA L S IG N AT U R E S
IN IT IA L D E L IV E R Y
D ATA
IN G E S T IO N
PHASE
U P D AT E
M A N IF E S T C H E C K
D IG ITA L S IG N AT U R E C H E C K
P D S S TA N D A R D S C H E C K
C R E AT E A D D IT IO N A L A N C IL L A R Y F IL E S
A R C H IV E
(O N L IN E )
D ATA
IN T E G R IT Y
PHASE
(O N G O IN G )
BACKUP
(O F F L IN E )
M A N IF E S T C H E C K
R A N D O M T E S T R E S TO R E
S IG N AT U R E C H E C K
S IG N AT U R E C H E C K
R E P O RT
D ATA
P R O V ID E R
M A N IF E S T
A N C IL L A R Y F IL E S
D ATA P R O D U C T S
D IG ITA L S IG N AT U R E S
IN IT IA L D E L IV E R Y
D ATA
IN G E S T IO N
PHASE
U P D AT E
M A N IF E S T C H E C K
D IG ITA L S IG N AT U R E C H E C K
P D S S TA N D A R D S C H E C K
C R E AT E A D D IT IO N A L A N C IL L A R Y F IL E S
A R C H IV E
(O N L IN E )
D ATA
IN T E G R IT Y
PHASE
(O N G O IN G )
BACKUP
(O F F L IN E )
M A N IF E S T C H E C K
R A N D O M T E S T R E S TO R E
S IG N AT U R E C H E C K
S IG N AT U R E C H E C K
R E P O RT
Test Case – Mars Express
• Using checksums to validate Mars
Express HRSC data set transfer
• 320 GB in 11280 files
– 772 files > 100 MB
– 44 files > 1 GB
• 11 hours to create checksums; 3
seconds to compare checksum lists
Test Case – Mars Exploration
Rovers
• Making checksums available to end
user within MER Analyst’s Notebook
– New feature
– MD5 and SHA-1 checksums available
• 3 TB in 2.6 million files
– 160 files > 100 MB
• Checksums created in 3 days
Future Work
• In-house
– Integrate checksums into existing
validation tools
– Develop data integrity tools to carry out
regular checks of archive
• External
– Work with science teams to incorporate
checksums as part of their data delivery
– Educate end users regarding checksums
Provenance
Download