SCAPE The SCAPE Platform Overview Rainer Schmidt SCAPE Training Event September 16th – 17th, 2013 The British Library SCAPE SCAlable Preservation Environments Goal of the SCAPE Platform • Hardware and software platform to support scalable preservation in terms of computation and storage. • Employing an scale-out architecture to supporting preservation activities against large amounts of data. • Integration of existing tools, workflows, and data sources and sinks. • A data center service providing a scalable execution and storage backend for different object management systems. • Based a minimal set of defined services for • processing tools and/or queries closely to the data. SCAPE SCAlable Preservation Environments Underlying Technologies • The SCAPE Platform is built on top of existing data-intensive computing technologies. • Reference Implementation leverages Hadoop Software Stack (HDFS, MapReduce, Hive, …) • Virtualization and packaging model for dynamic deployments of tools and environments • Debian packages and IaaS suppot. • Repository Integration and Services • Data/Storage Connector API (Fedora and Lily) • Object Exchange Format (METS/PREMIS representation) • Workflow modeling, translation, and provisioning. • Taverna Workbench and Component Catalogue • Workflow Compiler and Job Submission Service SCAPE SCAlable Preservation Environments Architectural Overview (Core) Component Lookup API Component Catalogue Workflow Modeling Environment Component Registration API SCAPE SCAlable Preservation Environments Architectural Overview (Core) Focus of this talk Component Lookup API Component Catalogue Workflow Modeling Environment Component Registration API SCAPE SCAlable Preservation Environments Hadoop Overview SCAPE SCAlable Preservation Environments The Framework • Open-source software framework for large-scale dataintensive computations running on large clusters of commodity hardware. • Derived from publications Google File System and MapReduce publications. • Hadoop = MapReduce + HDFS • MapReduce: Programming Model (Map, Shuffle/Sort, Reduce) and Execution Environment. • HDFS: Virtual distributed file system overlay on top of local file systems. SCAPE SCAlable Preservation Environments Programming Model • Designed for write one read many times access model. • Data IO is handled via HDFS. • Data divided into blocks (typically 64MB) and distributed and replicated over data nodes. • Parallelization logic is strictly separated from user program. • Automated data decomposition and communication between processing steps. • Applications benefit from built-in support for data-locality and fail-safety . • Applications scale-out on big clusters processing very large data volumes. SCAPE SCAlable Preservation Environments Cluster Set-up SCAPE SCAlable Preservation Environments Platform Deployment • There is no prescribed deployment model • Private, institutionally-shared, external data center • Possible to deploy on “bare-metal” or using virtualization and cloud middleware. • Platform Environment packaged as VM image • Automated and scalable deployment. • Presently supporting Eucalyptus (and AWS) clouds. • SCAPE provides two shared Platform instances • Stable non-virtualized data-center cluster • Private-cloud based development cluster • Partitioning and dynamic reconfiguration SCAPE SCAlable Preservation Environments Deploying Environments • IaaS enabling packaging and dynamic deployment of (complex) Software Environments • But requires complex virtualization infrastructure • Data-intensive technology is able to deal with a constantly varying number of cluster nodes. • Node failures are expected and automatically handled • System can grow/shrink on demand • Network Attached Storage solution can be used as data source • But does not scalability and performance needs for computation • SCAPE Hadoop Clusters • Linux + Preservation tools + SCAPE Hadoop libraries • Optionally Higher-level services (repository, workflow, …) SCAPE SCAlable Preservation Environments Using the Cluster SCAPE SCAlable Preservation Environments • Wrapping Sequential Tools • Using a wrapper script (Hadoop Streaming API) • PT’s generic Java wrapper allows one to use pre-defined patterns (based on toolspec language) • Works well for processing a moderate number of files • e.g. applying migration tools or FITS. • Writing a custom MapReduce application • Much more powerful and usually performs better. • Suitable for more complex problems and file formats, such as Web archives. • Using a High-level Language like Hive and Pig • Very useful to perform analysis of (semi-)structured data, e.g. characterization output. 13 SCAPE SCAlable Preservation Environments Available Tools • Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes • SCAPE Debian Packages • Supporting SCAPE Tool Specification Language • MapReduce libs for processing large container files • For example METS and (W)arc RecordReader • Application Scripts • Based on Apache Hive, Pig, Mahout • Software components to assemble a complex data-parallel workflows • Taverna and Oozie Workflows SCAPE SCAlable Preservation Environments Sequential Workflows • In order to run a workflow (or activity) on the cluster it will have to be parallelized first! • A number of different parallelization strategies exist • Approach typically determined on a case-by-case basis • May lead to changes of activities, workflow structure, or the entire application. • Automated parallelization will only work to a certain degree • Trivial workflows can be deployed/executed using without requiring individual parallelization (wrapper approach). • SCAPE driver program for parallelizing Taverna workflows. • SCAPE template workflows for different institutional scenarios developed. 15 SCAPE SCAlable Preservation Environments Parallel Workflows • Are typically derived from sequential (conceptual) workflows created for desktop environment (but may differ substantially!). • Rely on MapReduce as the parallel programming model and Apache Hadoop as execution environment • Data decomposition is handled by Hadoop framework based on input format handlers (e.g text, warc, mets-xml, etc. ) • Can make use of a workflow engine (like Taverna and Oozie) for orchestrating complex (composite) processes. • May include interactions with data mgnt. sytems (repositories) and sequential (concurrently executed) tools. • Tools invocations are based on API or cmd-line interface and 16 performed as part of a MapReduce application. SCAPE SCAlable Preservation Environments MapRed Tool Wrapper SCAPE SCAlable Preservation Environments Tool Specification Language • The SCAPE Tool Specification Language (toolspec) provides a schema to formalize command line tool invocations. • Can be used to automate a complex tool invocation (many arguments) based on a keyword (e.g. ps2pdfs) • Provides a simple and flexible mechanism to define tool dependencies, for example of a workflow. • Can be resolved by the execution system using Linux packages. • The toolspec is minimalistic and can be easily created for individual tools and scripts. • Tools provided as SCAPE Debian packages come with a toolspec document by default. 18 SCAPE SCAlable Preservation Environments MapRed Toolwrapper • Hadoop provides scalability, reliability, and robustness supporting processing data that does not fit on a single machine. • Application must however be made compliant with the execution environment. • Our intention was to provide a wrapper allowing one to execute a command-line tool on the cluster in a similar way like on a desktop environment. • User simply specifies toolspec file, command name, and payload data. • Supports HDFS references and (optionally) standard IO streams. • Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface. 19 SCAPE SCAlable Preservation Environments Hadoop Streaming API • Hadoop streaming API supports the execution of scripts (e.g. bash or python) which are automatically translated and executed as MapReduce applications. • Can be used to process data with common UNIX filters using commands like echo, awk, tr. • Hadoop is designed to process its input based on key/value pairs. This means the input data is interpreted and split by the framework. • Perfect for processing text but difficult to process binary data. • The steaming API uses streams to read/write from/to HDFS. • Preservation tools typically do not support HDFS file pointers and/or IO streaming through stdin/sdout. • Hence, DP tools are difficult to use with streaming API 20 SCAPE SCAlable Preservation Environments Suitable Use-Cases • Use MapRed Toolwrapper when dealing with (a large number of) single files. • Be aware that this may not be an ideal strategy and there are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ). • However, practical and sufficient in many cases, as there is no additional application development required. • A typical example is file format migration on a moderate number of files (e.g. 100.000s), which can be included in a workflow with additional QA components. • Very helpful when payload is simply too big to be computed on a single machine. 21 SCAPE SCAlable Preservation Environments Example – Exploring an uncompressed WARC • Unpacked a 1GB WARC.GZ on local computer • 2.2 GB unpacked => 343.288 files • `ls` took ~40s, • count *.html files with `file` took ~4 hrs => 60.000 html files • Provided corresponding bash command as toolspec: • <command>if [ "$(file ${input} | awk "{print \$2}" )" == HTML ]; then echo "HTML" ; fi</command> • Moved data to HDFS and executed pt-mapred with toolspec. • 236min on local file system • 160min with 1 mapper on HDFS (this was a surprise!) • 85min (2), 52min (4), 27min (8) • 26min with 8 mappers and IO streaming (also a surprise) 22 SCAPE SCAlable Preservation Environments Ongoing Work • Source project and README on Github presently under openplanets/scape/pt-mapred* • Will be migrated to its own repository soon. • Presently required to generate an input file that specifies input file paths (along with optional output file names). • TODO: Input binary directly based on input directory path allowing Hadoop to take advantage of data locality. • Input/output steaming and piping between toolspec commands has already been implemented. • TODO: Add support for Hadoop Sequence Files. • Look into possible integration with Hadoop Streaming API. * https://github.com/openplanets/scape/tree/master/pt-mapred 23 SCAPE SCAlable Preservation Environments 24