SCAPE Slides Template

advertisement
SCAPE
The SCAPE Platform
Overview
Rainer Schmidt
SCAPE Training Event
September 16th – 17th, 2013
The British Library
SCAPE
SCAlable Preservation Environments
Goal of the SCAPE Platform
• Hardware and software platform to support scalable
preservation in terms of computation and storage.
• Employing an scale-out architecture to supporting
preservation activities against large amounts of data.
• Integration of existing tools, workflows, and
data sources and sinks.
• A data center service providing a scalable execution
and storage backend for different object management
systems.
• Based a minimal set of defined services for
• processing tools and/or queries closely to the data.
SCAPE
SCAlable Preservation Environments
Underlying Technologies
• The SCAPE Platform is built on top of existing data-intensive
computing technologies.
• Reference Implementation leverages Hadoop Software Stack (HDFS,
MapReduce, Hive, …)
• Virtualization and packaging model for dynamic deployments of
tools and environments
• Debian packages and IaaS suppot.
• Repository Integration and Services
• Data/Storage Connector API (Fedora and Lily)
• Object Exchange Format (METS/PREMIS representation)
• Workflow modeling, translation, and provisioning.
• Taverna Workbench and Component Catalogue
• Workflow Compiler and Job Submission Service
SCAPE
SCAlable Preservation Environments
Architectural Overview (Core)
Component
Lookup API
Component
Catalogue
Workflow Modeling
Environment
Component
Registration API
SCAPE
SCAlable Preservation Environments
Architectural Overview (Core)
Focus of this talk
Component
Lookup API
Component
Catalogue
Workflow Modeling
Environment
Component
Registration API
SCAPE
SCAlable Preservation Environments
Hadoop Overview
SCAPE
SCAlable Preservation Environments
The Framework
• Open-source software framework for large-scale dataintensive computations running on large clusters of
commodity hardware.
• Derived from publications Google File System and
MapReduce publications.
• Hadoop = MapReduce + HDFS
• MapReduce: Programming Model (Map, Shuffle/Sort,
Reduce) and Execution Environment.
• HDFS: Virtual distributed file system overlay on top of local
file systems.
SCAPE
SCAlable Preservation Environments
Programming Model
• Designed for write one read many times access model.
• Data IO is handled via HDFS.
• Data divided into blocks (typically 64MB) and distributed and
replicated over data nodes.
• Parallelization logic is strictly separated from user
program.
• Automated data decomposition and communication between
processing steps.
• Applications benefit from built-in support for data-locality and
fail-safety .
• Applications scale-out on big clusters processing very large data
volumes.
SCAPE
SCAlable Preservation Environments
Cluster Set-up
SCAPE
SCAlable Preservation Environments
Platform Deployment
• There is no prescribed deployment model
• Private, institutionally-shared, external data center
• Possible to deploy on “bare-metal” or using
virtualization and cloud middleware.
• Platform Environment packaged as VM image
• Automated and scalable deployment.
• Presently supporting Eucalyptus (and AWS) clouds.
• SCAPE provides two shared Platform instances
• Stable non-virtualized data-center cluster
• Private-cloud based development cluster
• Partitioning and dynamic reconfiguration
SCAPE
SCAlable Preservation Environments
Deploying Environments
• IaaS enabling packaging and dynamic deployment of (complex)
Software Environments
• But requires complex virtualization infrastructure
• Data-intensive technology is able to deal with a constantly
varying number of cluster nodes.
• Node failures are expected and automatically handled
• System can grow/shrink on demand
• Network Attached Storage solution can be used as data source
• But does not scalability and performance needs for computation
• SCAPE Hadoop Clusters
• Linux + Preservation tools + SCAPE Hadoop libraries
• Optionally Higher-level services (repository, workflow, …)
SCAPE
SCAlable Preservation Environments
Using the Cluster
SCAPE
SCAlable Preservation Environments
• Wrapping Sequential Tools
• Using a wrapper script (Hadoop Streaming API)
• PT’s generic Java wrapper allows one to use pre-defined
patterns (based on toolspec language)
• Works well for processing a moderate number of files
• e.g. applying migration tools or FITS.
• Writing a custom MapReduce application
• Much more powerful and usually performs better.
• Suitable for more complex problems and file formats, such
as Web archives.
• Using a High-level Language like Hive and Pig
• Very useful to perform analysis of (semi-)structured data,
e.g. characterization output.
13
SCAPE
SCAlable Preservation Environments
Available Tools
• Preservation tools and libraries are pre-packaged so they
can be automatically deployed on cluster nodes
• SCAPE Debian Packages
• Supporting SCAPE Tool Specification Language
• MapReduce libs for processing large container files
• For example METS and (W)arc RecordReader
• Application Scripts
• Based on Apache Hive, Pig, Mahout
• Software components to assemble a complex data-parallel
workflows
• Taverna and Oozie Workflows
SCAPE
SCAlable Preservation Environments
Sequential Workflows
• In order to run a workflow (or activity) on the cluster it will
have to be parallelized first!
• A number of different parallelization strategies exist
• Approach typically determined on a case-by-case basis
• May lead to changes of activities, workflow structure, or
the entire application.
• Automated parallelization will only work to a certain degree
• Trivial workflows can be deployed/executed using without
requiring individual parallelization (wrapper approach).
• SCAPE driver program for parallelizing Taverna workflows.
• SCAPE template workflows for different institutional
scenarios developed.
15
SCAPE
SCAlable Preservation Environments
Parallel Workflows
• Are typically derived from sequential (conceptual) workflows
created for desktop environment (but may differ
substantially!).
• Rely on MapReduce as the parallel programming model and
Apache Hadoop as execution environment
• Data decomposition is handled by Hadoop framework based
on input format handlers (e.g text, warc, mets-xml, etc. )
• Can make use of a workflow engine (like Taverna and Oozie)
for orchestrating complex (composite) processes.
• May include interactions with data mgnt. sytems (repositories)
and sequential (concurrently executed) tools.
• Tools invocations are based on API or cmd-line interface and
16
performed as part of a MapReduce application.
SCAPE
SCAlable Preservation Environments
MapRed Tool Wrapper
SCAPE
SCAlable Preservation Environments
Tool Specification Language
• The SCAPE Tool Specification Language (toolspec) provides a
schema to formalize command line tool invocations.
• Can be used to automate a complex tool invocation (many
arguments) based on a keyword (e.g. ps2pdfs)
• Provides a simple and flexible mechanism to define tool
dependencies, for example of a workflow.
• Can be resolved by the execution system using Linux
packages.
• The toolspec is minimalistic and can be easily created for
individual tools and scripts.
• Tools provided as SCAPE Debian packages come with a
toolspec document by default.
18
SCAPE
SCAlable Preservation Environments
MapRed Toolwrapper
• Hadoop provides scalability, reliability, and robustness
supporting processing data that does not fit on a single
machine.
• Application must however be made compliant with the
execution environment.
• Our intention was to provide a wrapper allowing one to
execute a command-line tool on the cluster in a similar way
like on a desktop environment.
• User simply specifies toolspec file, command name, and payload
data.
• Supports HDFS references and (optionally) standard IO streams.
• Supports the SCAPE toolspec to execute preinstalled tools or
other applications available via OS command-line interface.
19
SCAPE
SCAlable Preservation Environments
Hadoop Streaming API
• Hadoop streaming API supports the execution of scripts (e.g.
bash or python) which are automatically translated and
executed as MapReduce applications.
• Can be used to process data with common UNIX filters using
commands like echo, awk, tr.
• Hadoop is designed to process its input based on key/value
pairs. This means the input data is interpreted and split by the
framework.
• Perfect for processing text but difficult to process binary data.
• The steaming API uses streams to read/write from/to HDFS.
• Preservation tools typically do not support HDFS file pointers
and/or IO streaming through stdin/sdout.
• Hence, DP tools are difficult to use with streaming API
20
SCAPE
SCAlable Preservation Environments
Suitable Use-Cases
• Use MapRed Toolwrapper when dealing with (a large number
of) single files.
• Be aware that this may not be an ideal strategy and there
are more efficient ways to deal with many files on Hadoop
(Sequence Files, Hbase, etc. ).
• However, practical and sufficient in many cases, as there is
no additional application development required.
• A typical example is file format migration on a moderate
number of files (e.g. 100.000s), which can be included in a
workflow with additional QA components.
• Very helpful when payload is simply too big to be computed
on a single machine.
21
SCAPE
SCAlable Preservation Environments
Example – Exploring an uncompressed WARC
• Unpacked a 1GB WARC.GZ on local computer
• 2.2 GB unpacked => 343.288 files
• `ls` took ~40s,
• count *.html files with `file` took ~4 hrs => 60.000 html files
• Provided corresponding bash command as toolspec:
• <command>if [ "$(file ${input} | awk "{print \$2}" )" == HTML ]; then echo
"HTML" ; fi</command>
• Moved data to HDFS and executed pt-mapred with toolspec.
• 236min on local file system
• 160min with 1 mapper on HDFS (this was a surprise!)
• 85min (2), 52min (4), 27min (8)
• 26min with 8 mappers and IO streaming (also a surprise)
22
SCAPE
SCAlable Preservation Environments
Ongoing Work
• Source project and README on Github presently under
openplanets/scape/pt-mapred*
• Will be migrated to its own repository soon.
• Presently required to generate an input file that specifies input
file paths (along with optional output file names).
• TODO: Input binary directly based on input directory path
allowing Hadoop to take advantage of data locality.
• Input/output steaming and piping between toolspec
commands has already been implemented.
• TODO: Add support for Hadoop Sequence Files.
• Look into possible integration with Hadoop Streaming API.
* https://github.com/openplanets/scape/tree/master/pt-mapred
23
SCAPE
SCAlable Preservation Environments
24
Download