Establishing a Mechanism for Maintaining File Integrity within the Data Archive Thomas C. Stein, Edward A. Guinness, Susan H. Slavney Planetary Data Systems Geosciences Node Washington University St. Louis, Missouri Background • Must develop and maintain long-term (100 year) archive – For NASA’s past, present and future orbital and landed missions to Mars, Venus, and the Moon • Current repository – 15 missions – 143 data sets (half of them active in 2004) – 9 TB data products Background PDS GEOSCIENCES NODE CUMULATIVE DATA HOLDINGS, Tb 1000 100 10 1 FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09 FY10 FY11 Background Data Sets Actively Archived 80 70 60 50 40 30 20 10 0 1988 1990 1992 1994 1996 1998 2000 2002 2004 Early Archives • Data sets archived at the end of a mission • Archive small enough to be published on CD-ROM • Copies sent to hundreds of science users Current Archives • Data sets archived throughout active mission with releases at 3- to 6-month intervals • Active missions revise and redeliver data sets during their lifetime • Archive stored on SAN at Geosciences Node • Work with multiple active missions at one time – Active: Mars Global Surveyor, Mars Odyssey, Mars Exploration Rovers, Mars Express – Planning: Mars Reconnaissance Orbiter, MESSENGER, Phoenix, Mars Science Laboratory, Lunar Reconnaissance Orbiter Issue • How to ensure that archived data do not change over time That is, how to maintain file integrity Uses • Tracking and validating electronic deliveries from data suppliers • Protecting the on-line repository against file loss or corruption • Provide end users with means to validate electronically download data Requirements • Maintain inventory – Data products, product labels, ancillary files – Product type, format, file location, size, integrity “key” • Maintain state information – Backup status – Data access permissions Requirements • Check in new or updated data – Multiple deliveries may be made by instrument team during life of mission – Verify that all files in a delivery are present – Verify that contents of the files match what was sent by data provider Requirements • Maintain integrity of the online repository – Automated checks • Manifest (filename, location, and size) • Content via integrity key • Access permissions – Partial integrity checks may be run manually – Report of results produced automatically Requirements • Track backups of data – Scheduled backups – Onsite and offsite copies – Simulated restores with integrity checks Requirements • Provide location and validation method of data on demand – URL and local file system location of a file – Integrity key – Multiple files and keys should be packaged “on the fly” into standard formats (e.g., tar and zip) Potential Methods • • • • File to file (per byte) comparison File size Bit counts (simple parity) File checksums Selecting a Method • Availability – Freely available source code – Use on multiple platforms • Calculation speed – Important for data set ingestion and verification • Size of key – Minimize amount of information required to compare files • Ease of use – Data providers and end users should not need special knowledge • Accuracy – Needs to work every time Selecting a Method File to file File size Bit counts Availability yes yes Calculation speed variable Checksum size Ease of Use Accuracy MD5 RIPEMD -160 RIPEMD -320 SHA-1 SHA-512 yes yes yes yes yes yes fast very fast very fast very fast fast very fast fast < 2 TB 64 bits 64 bits 32 bits 40 bits 80 bits 40 bits 80 bits easy easy easy easy easy easy easy easy excellent fair poor excellent excellent excellent excellent excellent • The National Institute of Standards and Technology has not approved MD5 as a secure hashing algorithm. Solution • Use SHA-1 checksums at the file level to provide digital signatures Approach • Data flow model involves three phases – Data delivery from science team – Data ingestion – Data integrity D ATA P R O V ID E R M A N IF E S T A N C IL L A R Y F IL E S D ATA P R O D U C T S D IG ITA L S IG N AT U R E S IN IT IA L D E L IV E R Y D ATA IN G E S T IO N PHASE U P D AT E M A N IF E S T C H E C K D IG ITA L S IG N AT U R E C H E C K P D S S TA N D A R D S C H E C K C R E AT E A D D IT IO N A L A N C IL L A R Y F IL E S A R C H IV E (O N L IN E ) D ATA IN T E G R IT Y PHASE (O N G O IN G ) BACKUP (O F F L IN E ) M A N IF E S T C H E C K R A N D O M T E S T R E S TO R E S IG N AT U R E C H E C K S IG N AT U R E C H E C K R E P O RT D ATA P R O V ID E R M A N IF E S T A N C IL L A R Y F IL E S D ATA P R O D U C T S D IG ITA L S IG N AT U R E S IN IT IA L D E L IV E R Y D ATA IN G E S T IO N PHASE U P D AT E M A N IF E S T C H E C K D IG ITA L S IG N AT U R E C H E C K P D S S TA N D A R D S C H E C K C R E AT E A D D IT IO N A L A N C IL L A R Y F IL E S A R C H IV E (O N L IN E ) D ATA IN T E G R IT Y PHASE (O N G O IN G ) BACKUP (O F F L IN E ) M A N IF E S T C H E C K R A N D O M T E S T R E S TO R E S IG N AT U R E C H E C K S IG N AT U R E C H E C K R E P O RT Test Case – Mars Express • Using checksums to validate Mars Express HRSC data set transfer • 320 GB in 11280 files – 772 files > 100 MB – 44 files > 1 GB • 11 hours to create checksums; 3 seconds to compare checksum lists Test Case – Mars Exploration Rovers • Making checksums available to end user within MER Analyst’s Notebook – New feature – MD5 and SHA-1 checksums available • 3 TB in 2.6 million files – 160 files > 100 MB • Checksums created in 3 days Future Work • In-house – Integrate checksums into existing validation tools – Develop data integrity tools to carry out regular checks of archive • External – Work with science teams to incorporate checksums as part of their data delivery – Educate end users regarding checksums Provenance