Satellite Image Processing And Production With

Satellite Image Processing And
Production With Apache Hadoop
U.S. Department of the Interior
U.S. Geological Survey
David V. Hill, Information Dynamics,
Contractor to USGS/EROS
 Apache Hadoop
 Applications, Environment and Use Case
 Log Processing Example
 EROS Science Processing Architecture (ESPA) and
ESPA Processing Example
ESPA Implementation Strategy
Performance Results
Thoughts, Notes and Takeaway
Apache Hadoop – What is it?
 Open source distributed processing system
 Designed to run on commodity hardware
 Widely used for solving “Big Data” challenges
 Has been deployed in clusters with thousands of
machines and petabytes of storage
Two primary subsystems: Hadoop Distributed File
System (HDFS) and the MapReduce engine
Hadoop’s Applications
 Web content indexing
 Data mining
 Machine learning
 Statistical analysis and modeling
 Trend analysis
 Search optimization
 … and of course, satellite image processing!
Hadoop’s Environment
 Linux and Unix
 Java based but relies on ssh for job distribution
 Jobs written in any language executable from shell
Java, C/C++, Perl, Python, Ruby, R, Bash, et. al.
Hadoop’s Use Case
Cluster of machines is configured into a Hadoop
Each contributes:
 Local compute resources to MapReduce
 Local storage resources to HDFS
Files are stored in HDFS
 File size is typically measured in gigabytes and terabytes
Job is run against an input file in HDFS
 Target input file is specified
 Code to run against input also specified
Hadoop’s Use Case
Unlike traditional systems which move data to the
code, Hadoop flips this and moves code to the data
Two software functions comprise a MapReduce job
 Map operation
 Reduce operation
Upon execution:
 Hadoop identifies input file chunk locations, moves the algorithms
and executes the code
 The “Map”
Sorts the Map results and aggregates final answer (single thread)
 The “Reduce”
Log Processing Example
ESPA and Hadoop
Hadoop map code runs parallel on the input (log file)
 Processes a single input file as quickly as possible
Reduce code runs on mapper output
ESPA processes satellite images, not text
 Algorithms cannot run parallel within an image
 Cannot use satellite images as the input
Solution: Use a text file with the image location as
input. Skip the reduce step
Rather than parallelize within an image, ESPA
handles many images at once
ESPA Processing Example
Implementation Strategy
LSRD is budget constrained for hardware
Other projects regularly excess old hardware
upon warranty expiration
Take ownership of these systems… if they
fail, they fail
Also ‘borrow’ compute and storage from
other projects
 Only network connectivity is necessary
Current cluster is 102 cores, minimal expense
 Cables, switches, etc
Performance Results
Original throughput requirement was 455
atmospherically corrected Landsat scenes per day
Currently able to process ~ 4800!
Biggest bottleneck is local machine storage
 Due to implementation of ftp’ing files instead of using HDFS as
Attempted to solve this with ram disk, not enough
Currently evaluating solid state disk
Thoughts and Notes
Number of splits on input file can be
controlled via the dfs.block.size parameter
 Therefore control number of jobs run against an input
ESPA-like implementation does not require
massive storage unlike other Hadoop
 Input files are very small
Robust internal job monitoring mechanisms
are usually custom-built
Thoughts and Notes
Jobs written for Hadoop Streaming may be
tested and run without Hadoop
 cat inputfile.txt | | sort | > out.txt
Projects can share resources
 Hadoop is tunable to restrict resource utilization on a
per machine basis
Provides instant productivity gains versus
internal development
 LSRD is all about science and science algorithms
 Minimal time and budget for building internal systems
 Hadoop is proven and tested
 Massively scalable out of the box
 Cloud based instances available from Amazon and
Shortest path to processing massive amounts of data
Extremely hardware failure tolerant
No specialized hardware or software needed
Flexible job API allows existing software skills to be
Industry adoption means support skills available