Digital Records Infrastructure David Thomas 7 March 2012 What is changing • Nature of records • Our understanding of the risk • Threat profile • Expectations Nature of record Digitised records will replace paper originals Thi means that This h we require: i Higher standard of integrity Higher g standard of p preservation 4 Understanding of the risk Currently few risks with formats – inter glacial period? Risks before records are transferred (Digital Continuity) Poor capture Sensitivity and closure issues 5 Mixed media collections Stephen J Gould’s papers at St f d - 850 b Stanford boxes off ttextual t l material, approximately 450 audiovisual items items, and 1 1,180 180 computer media files 6 Understanding of the risk For good reasons it is not possible to predict the rate of data l loss ffrom a storage system This doesn’t doesn t stop manufacturers from making claims: LTO 1 LTO 2 LTO 3 LTO 4 17 year life 21 year life 30 year life lif 17 year life But how many manufacturers give guarantees? 7 E pectations Expectations Fast access and quick response to FoI enquiries The greatest threat is volume, volume, volume • The volumes are now so huge that only powerful automated systems can cope – the days of human intervention are over 9 Data volumes arriving at TNA 2012 - 2014 Born-Digital • 2012 Olympics records – estimated at 30 TB • The results of the 20-Year Rule change 5 TB • The Government Web Archive 80 TB • Total 125 TB • Other Oth material t i l may b be coming i 2 – 3 terabytes t b t ffrom Hill Hillsborough b h Digitised • Home H G Guard d records d 113 TB • 1939 National Health Register 93 TB • Digitisation Di i i i off military ili service i records d 181 TB • Total 387 TB 10 Data storage 2011 - 2020 1400 1200 Terabytes 1000 800 600 400 200 0 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Jan-19 Jan-20 How do we compare 3 petabytes 3 petabytes NARA 184 terabytes 190 terabytes 150 billion web pages 1 1 billion web pages 1.1 12 What we need • Ability to process and store very large volumes of data • Abilityy to identifyy formats of files so appropriate pp p preservation processes can be implemented in the future • Defence D f against i t malware l which hi h might i ht d damage th the system or attack users • Ability to ensure the integrity of records by using an appropriate cryptographic hash function(MD5 or SHA2) • Abilityy to conduct regular g hash checks to determine whether any bits have been lost • Ideally use two different software systems in case of a catastrophic t t hi failure f il off one • Ability to handle closed or sensitive records 13