OVERVIEW OF SCAPE LSDRT SCENARIOS AS LISTED ON WIKI

advertisement
OVERVIEW OF SCAPE LSDRT
SCENARIOS AS LISTED ON WIKI
Maureen Pennock
July 4TH 2012
LSDRT1: Assessing preservation risks in large media files (SB)
Dataset
mpeg2 Transport stream with Danish TV broadcast
Issue (s)
IS3 - Unknown preservation risks in large media files
SC: Extremely large files, checksumming takes around 3 months
Evaluation -Solution
--
LSDRT2 – Validating files migrated from TIFF to JPEG2000 (BL)
Dataset
JISC1 19th Century Digitised newspapers (TIFF, XML mets metadata,
sample JPEGS)
Issue (s)
IS44 - QA of migrated images
SC: Volume of content; automation at scale
IS02 – Do acquired files conform to an agreed technical profile, are they
valid, are they complete? (TO BE INCLUDED)
SC:
Evaluation -Solution
SO01 – Simple JP2 file structure checker (obsolete)
SO02 – Jpylyzer JP2 validator & properties extractor – solves 2nd issue
(no workflow link)
LSDRT3 – Identification, characterisation & validation of v lg image collections (BL)
Dataset
(1) JISC 1 19th century Digitised newspapers (TIFF > JPEG)
(2) TIFF with scanned books
Issue (s)
(1) IS44 – QA of migrated images
(1) IS02 – Do acquired files conform to an agreed technical profile, are
they valid & are they complete?
SC: Rate/speed of checking large collection
(2) IS23 – Validation of TIFF according to institutional collection profile
Evaluation -Solution
(2) SO01 – Simple JP2file structure checker provides basic check that a
JP2 file is complete (Evaluated, but no workflow link)
LSDRT4 – Out of sync S&V in WMV > Video format x Migration results (SB)
Dataset
WMV with Danish TV broadcasts
Issue (s)
IS13 – Migration from WMV to Video format x results in out of sync S&V
SC: Number of files, automation needed to avoid manual check
Evaluation Objectives – scalability & automation
Solution
SO05 – Video Migration & QA (Evaluated, but no workflow link)
SO02 – xcorrSound QA audio comparison tool (Evaluated with workflow
link)
LSDRT5 – Detecting audio files with very bad sound quality (SB)
Dataset
Mp3 with Danish Radio broadcasts (200GB sample)
Issue (s)
IS20 – Detect audio files with very bad sound quality
SC: Large amount of data (20TB total)
Evaluation Objectives – scalability (number of files); reliability & precision. Speed
not an issue.
Solution
--
LSDRT6 – Large scale migration mp3 > WAV (SB)
Dataset
Mp3 with Danish Radio broadcasts (200GB sample)
Issue (s)
(1) IS21 – Migration of mp3 to WAV
SC: Large amount of data (20TB total), required distributed
platform
(2) IS20 – Detect audio files with very bad sound quality
SC: Large amount of data (20TB total)
Evaluation (1) Objectives – scalability & robustness. Speed: c20 files ph per node.
(2) Objectives – scalability (number of files); reliability & precision.
Speed not an issue.
Solution
(1) SO04 – Audio mp3 to WAV migration & QA workflow (Evaluated and
workflow documented)
(2) SO02 – xcorrSound QA audio comparison tool (included in SO04)
(2) --
LSDRT7 – Characterise & validate very large video files (SB)
Dataset
Mpeg video with Danish TV broadcasts
Issue (s)
IS22 – Characterise & validate v large mpeg1 & mpeg2 files
SC: Very large files; very large collection
Evaluation Objectives: robustness & scalability. Speed 2TB <24 hours.
Solution
SO25 – Rosetta v3.0 implementation integrated with DROID6 (untested)
LSDRT9 – Characterisation of large amounts of WAV audio (SB)
Dataset
WAV with Danish Radio broadcasts, ripped audio CDs and SB in-house
audio digitisation
Issue (s)
IS24 – Characterisation of large amounts of WAV audio
SC: V large amounts of data; some v large files
Evaluation Objectives: scalability & functionality
Solution
SO06 – Use Ffprobe to characterise WAV (no evaluation)
LSDRT10 – Capturing representation information from original image files (BL)
Dataset
Camera raw files
Issue (s)
IS40 – Complexity of Camera raw files
SC: Complexity of proprietary formats
Evaluation
Solution
-SO26 – Automated RAW to DNG migration + QA (no workflow, untested)
LSDRT11 – Image based document comparison (ONB)
Dataset
(1) Austrian National Library - Digital Book collection
(2) IDP samples from BL
Issue (s)
(1) IS27 Quality assurance in re-downloaded workflows of audio books
SC: V large collection: speed for compressing each file becomes
an issue at scale.
(2) IS27 and IS10- Potential bit rot in image files that were stored in CD
SC: Volume of collection means automated approach is
required.
Evaluation (1) Objectives – Scalability (throughputs); Reliability; Precision
(2) -Solution
(1) - -(2) SO09 QA for corresponding JP2K comparison for old & new Google
Book versions (no workflow, untested)
(2) SO10 QA for TIFF to corresponding JP2K comparison (no workflow,
untested)
(2) SO16 QA for estimation of affine transformation (no workflow,
untested)
LSDRT 12 – Detect undesired influence of lossy JP2 compression on OCR in absence of
ground truth (ONB)
Dataset
Austrian National Library – Digital Book collection
Issue (s)
IS01 – Digitised TIFFS do not meet storage & access requirements
SC: Large amount of data, subsequent speed of processing
Evaluation -Solution
SO28 – A heuristic measure for detecting undesired influence of lossy
JP2 compression on OCR in the absence of ground truth (workflow
defined)
Dataset
Issue (s)
Evaluation
Solution
Download