Co-funded by the European Union under FP7-ICT-2009-6 Survey on Italian Preservation Repositories Silvio Salza, salza@dis.uniroma1.it CINI-Università di Roma “La Sapienza” Storage Solution Webinar, April 14th 2014 Co-ordinated by aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 The APARSEN Wp23 questionnaire • Questionnaire on Storage solutions and Scalability prepared • and distributed as part of APARSEN WP23 The questionnaire focused on: - Profile of the repositories (mission, volumes, type of objects) Storage management policy Organization of the storage system Cost (TCO) and quality assessment • CINI designed the questionnaire and was in charge of the • Italian survey 8 large repositories in different areas were surveyed Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Italian regulations on digital preservation • Organization of Italian repositories mostly driven by national • • • regulations Regulation issued in a 2001 bill and later updated in 2010-14 Are mandatory for Public Administrations since 2001 Private companies must comply as well for some types of records: health-care records, fiscal records, e-invoices etc. Quite often the focus is just on complying with the regulations: the design of the repository and the quality of the preservation process are not given sufficient attention Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Profile of the surveyed repositories Cultural Heritage e-Gov Mission • • • • Other XXXX XXXXXXXX XXXX Number of Digital Objects < 10% 20% - 100% > 100% Yearly increase Most repositories are active since less than 5 years High yearly growth rate (average 100%) Generally a single type of digital object is preserved Access granted to registered users only Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Storage management policy • Only 50% of the repositories has a formally declared • • storage management policy (the ones in the e-gov area) None provided a link to a public policy document Crucial issues: - Regular integrity checks (always specified) Backup interval (always specified) Data recovery workflow (specified in one case only) Storage Management policy should always be formally declared and possibly made public Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Three-level storage organization Access Preservation Mirrors the core level and protects it from external accesses Backup Periodical dumps of the core level Most repositories (but not all) declared a three-level storage organization: - Preservation: core level devoted to preservation - Access: front-end level to support external access - Backup: back-end level for periodic dumps Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Storage implementation RAID5 RAID1 Access None RAID5 HD WORM Preservation RAID5 RAID1 Tape-DVD Backup • In 3 cases there was no separate access level • One repository claimed it unnecessary since it was using a WORM device (EMC2 Centera) for the core level • Two others claimed RAID at the core level provided enough redundancy and the file system provided for write protection • Backups typically made on a weekly basis Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 About storage media and systems • Tape cartridges are OK for backup • DVD and other consumer-level optical media should be • • avoided as too risky, but are still used in small repositories RAID replication at the core level is not equivalent to having a separate level for access (this would be a separate device) Using a single level of WORM devices, despite their quality, has some serious drawbacks: - These devices typically rely on proprietary firmware Data can’t be read without the intermediation of the firmware Replication is still limited to a single device Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Local versus geographical replication • Replication is the key element to achieve reliability • Different levels of replication: - Device: within a given device (e.g. RAID5) Local: locally but involving different devices Geographical: replicated data kept in different locations • Local (and device) replication is vulnerable to catastrophic • • (but not unlikely) events: flood, fire, earthquake Reliability of RAID systems assumes that faults of different devices are statistically independent (a tricky assumption!) If the room where the devices are is flooded all of them will fail Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 The night of the earthquake • University of L’Aquila in Central • • Italy maintained and updated daily a backup copy of its records in a computing center in Bologna (some 300 Km away) On April 9th 2004 an earthquake destroyed most of the city Thanks to the geographical replication, not a single record was lost Bologna L’Aquila Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 How reliable is my repository? • Interviewed repository managers were asked to give some figures to assess the quality of the preservation service: - Reliability: probability of not loosing any data in a given time Availability: percentage of time the system can be accessed Cost: TCO (Total Cost of Ownership) per TB/year • Only a few provided answers to these questions • Only very few answers were credible: one guy claimed his repository had achieved 100% reliability!! Can he fly too? Inability to provide these figures is a clear indicator of the poor level of the design Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 What about outsourcing? • All storage levels in the surveyed repositories were in-house • But part of the questionnaire dealt with outsourcing options • Cloud storage was proposed as a main option to provide geographical replication, at least for some storage level • The result was discouraging: no answer, even if we insisted • The attitude was like that of children saying: “I won’t eat it, and I won’t even taste it!” One guy finally claimed that cloud was too expensive and too unreliable. But the same guy was unable to provide any figure for his own reliability and TCO Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Conclusion 1: Improve the design process • A good design should evaluate different alternatives • Quantitative elements should be used to compare them: - TCO Reliability Availability Level and type of replication Lifespan • These elements also form the basis for assessing the quality • of the preservation service The storage management policy should be clearly stated Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT-2009-6 Conclusion 2: Exploit new opportunities • Improve reliability by exploiting redundancy • Geographical redundancy is a key element • Move remotely at least one level of storage: don’t put all your eggs in one basket • Overcome prejudices about outsourcing: - Why in-house systems should be better? One may get reasonable control of outsourced resources Special conditions can be negotiated • Cloud storage is a great opportunity: it should be carefully considered before being dismissed Survey on Italian Preservation Repositories Silvio Salza, CINI-Università di Roma “La Sapienza” Webinar on Storage Systems, April 14th 2014 aparsen.eu #APARSEN aparsen.eu Network of Excellence #APARSEN