OVERVIEW OF SCAPE LSDRT SCENARIOS AS LISTED ON WIKI Maureen Pennock July 4TH 2012 LSDRT1: Assessing preservation risks in large media files (SB) Dataset mpeg2 Transport stream with Danish TV broadcast Issue (s) IS3 - Unknown preservation risks in large media files SC: Extremely large files, checksumming takes around 3 months Evaluation -Solution -- LSDRT2 – Validating files migrated from TIFF to JPEG2000 (BL) Dataset JISC1 19th Century Digitised newspapers (TIFF, XML mets metadata, sample JPEGS) Issue (s) IS44 - QA of migrated images SC: Volume of content; automation at scale IS02 – Do acquired files conform to an agreed technical profile, are they valid, are they complete? (TO BE INCLUDED) SC: Evaluation -Solution SO01 – Simple JP2 file structure checker (obsolete) SO02 – Jpylyzer JP2 validator & properties extractor – solves 2nd issue (no workflow link) LSDRT3 – Identification, characterisation & validation of v lg image collections (BL) Dataset (1) JISC 1 19th century Digitised newspapers (TIFF > JPEG) (2) TIFF with scanned books Issue (s) (1) IS44 – QA of migrated images (1) IS02 – Do acquired files conform to an agreed technical profile, are they valid & are they complete? SC: Rate/speed of checking large collection (2) IS23 – Validation of TIFF according to institutional collection profile Evaluation -Solution (2) SO01 – Simple JP2file structure checker provides basic check that a JP2 file is complete (Evaluated, but no workflow link) LSDRT4 – Out of sync S&V in WMV > Video format x Migration results (SB) Dataset WMV with Danish TV broadcasts Issue (s) IS13 – Migration from WMV to Video format x results in out of sync S&V SC: Number of files, automation needed to avoid manual check Evaluation Objectives – scalability & automation Solution SO05 – Video Migration & QA (Evaluated, but no workflow link) SO02 – xcorrSound QA audio comparison tool (Evaluated with workflow link) LSDRT5 – Detecting audio files with very bad sound quality (SB) Dataset Mp3 with Danish Radio broadcasts (200GB sample) Issue (s) IS20 – Detect audio files with very bad sound quality SC: Large amount of data (20TB total) Evaluation Objectives – scalability (number of files); reliability & precision. Speed not an issue. Solution -- LSDRT6 – Large scale migration mp3 > WAV (SB) Dataset Mp3 with Danish Radio broadcasts (200GB sample) Issue (s) (1) IS21 – Migration of mp3 to WAV SC: Large amount of data (20TB total), required distributed platform (2) IS20 – Detect audio files with very bad sound quality SC: Large amount of data (20TB total) Evaluation (1) Objectives – scalability & robustness. Speed: c20 files ph per node. (2) Objectives – scalability (number of files); reliability & precision. Speed not an issue. Solution (1) SO04 – Audio mp3 to WAV migration & QA workflow (Evaluated and workflow documented) (2) SO02 – xcorrSound QA audio comparison tool (included in SO04) (2) -- LSDRT7 – Characterise & validate very large video files (SB) Dataset Mpeg video with Danish TV broadcasts Issue (s) IS22 – Characterise & validate v large mpeg1 & mpeg2 files SC: Very large files; very large collection Evaluation Objectives: robustness & scalability. Speed 2TB <24 hours. Solution SO25 – Rosetta v3.0 implementation integrated with DROID6 (untested) LSDRT9 – Characterisation of large amounts of WAV audio (SB) Dataset WAV with Danish Radio broadcasts, ripped audio CDs and SB in-house audio digitisation Issue (s) IS24 – Characterisation of large amounts of WAV audio SC: V large amounts of data; some v large files Evaluation Objectives: scalability & functionality Solution SO06 – Use Ffprobe to characterise WAV (no evaluation) LSDRT10 – Capturing representation information from original image files (BL) Dataset Camera raw files Issue (s) IS40 – Complexity of Camera raw files SC: Complexity of proprietary formats Evaluation Solution -SO26 – Automated RAW to DNG migration + QA (no workflow, untested) LSDRT11 – Image based document comparison (ONB) Dataset (1) Austrian National Library - Digital Book collection (2) IDP samples from BL Issue (s) (1) IS27 Quality assurance in re-downloaded workflows of audio books SC: V large collection: speed for compressing each file becomes an issue at scale. (2) IS27 and IS10- Potential bit rot in image files that were stored in CD SC: Volume of collection means automated approach is required. Evaluation (1) Objectives – Scalability (throughputs); Reliability; Precision (2) -Solution (1) - -(2) SO09 QA for corresponding JP2K comparison for old & new Google Book versions (no workflow, untested) (2) SO10 QA for TIFF to corresponding JP2K comparison (no workflow, untested) (2) SO16 QA for estimation of affine transformation (no workflow, untested) LSDRT 12 – Detect undesired influence of lossy JP2 compression on OCR in absence of ground truth (ONB) Dataset Austrian National Library – Digital Book collection Issue (s) IS01 – Digitised TIFFS do not meet storage & access requirements SC: Large amount of data, subsequent speed of processing Evaluation -Solution SO28 – A heuristic measure for detecting undesired influence of lossy JP2 compression on OCR in the absence of ground truth (workflow defined) Dataset Issue (s) Evaluation Solution