Satellite Image Processing And Production With

advertisement
Satellite Image Processing And
Production With Apache Hadoop
U.S. Department of the Interior
U.S. Geological Survey
David V. Hill, Information Dynamics,
Contractor to USGS/EROS
12/08/2011
Overview
 Apache Hadoop
 Applications, Environment and Use Case
 Log Processing Example
 EROS Science Processing Architecture (ESPA) and





Hadoop
ESPA Processing Example
ESPA Implementation Strategy
Performance Results
Thoughts, Notes and Takeaway
Questions
Apache Hadoop – What is it?
 Open source distributed processing system
 Designed to run on commodity hardware
 Widely used for solving “Big Data” challenges
 Has been deployed in clusters with thousands of

machines and petabytes of storage
Two primary subsystems: Hadoop Distributed File
System (HDFS) and the MapReduce engine
Hadoop’s Applications
 Web content indexing
 Data mining
 Machine learning
 Statistical analysis and modeling
 Trend analysis
 Search optimization
 … and of course, satellite image processing!
Hadoop’s Environment
 Linux and Unix
 Java based but relies on ssh for job distribution
 Jobs written in any language executable from shell

prompt
Java, C/C++, Perl, Python, Ruby, R, Bash, et. al.
Hadoop’s Use Case




Cluster of machines is configured into a Hadoop
cluster
Each contributes:
 Local compute resources to MapReduce
 Local storage resources to HDFS
Files are stored in HDFS
 File size is typically measured in gigabytes and terabytes
Job is run against an input file in HDFS
 Target input file is specified
 Code to run against input also specified
Hadoop’s Use Case



Unlike traditional systems which move data to the
code, Hadoop flips this and moves code to the data
Two software functions comprise a MapReduce job
 Map operation
 Reduce operation
Upon execution:
 Hadoop identifies input file chunk locations, moves the algorithms

and executes the code
 The “Map”
Sorts the Map results and aggregates final answer (single thread)
 The “Reduce”
Log Processing Example
ESPA and Hadoop





Hadoop map code runs parallel on the input (log file)
 Processes a single input file as quickly as possible
Reduce code runs on mapper output
ESPA processes satellite images, not text
 Algorithms cannot run parallel within an image
 Cannot use satellite images as the input
Solution: Use a text file with the image location as
input. Skip the reduce step
Rather than parallelize within an image, ESPA
handles many images at once
ESPA Processing Example
Implementation Strategy




LSRD is budget constrained for hardware
Other projects regularly excess old hardware
upon warranty expiration
Take ownership of these systems… if they
fail, they fail
Also ‘borrow’ compute and storage from
other projects
 Only network connectivity is necessary

Current cluster is 102 cores, minimal expense
 Cables, switches, etc
Performance Results



Original throughput requirement was 455
atmospherically corrected Landsat scenes per day
Currently able to process ~ 4800!
Biggest bottleneck is local machine storage
input/output
 Due to implementation of ftp’ing files instead of using HDFS as
intended


Attempted to solve this with ram disk, not enough
memory
Currently evaluating solid state disk
Thoughts and Notes

Number of splits on input file can be
controlled via the dfs.block.size parameter
 Therefore control number of jobs run against an input
file

ESPA-like implementation does not require
massive storage unlike other Hadoop
instances
 Input files are very small

Robust internal job monitoring mechanisms
are usually custom-built
Thoughts and Notes

Jobs written for Hadoop Streaming may be
tested and run without Hadoop
 cat inputfile.txt | mapper.py | sort | reducer.py > out.txt

Projects can share resources
 Hadoop is tunable to restrict resource utilization on a
per machine basis

Provides instant productivity gains versus
internal development
 LSRD is all about science and science algorithms
 Minimal time and budget for building internal systems
Takeaways
 Hadoop is proven and tested
 Massively scalable out of the box
 Cloud based instances available from Amazon and





others
Shortest path to processing massive amounts of data
Extremely hardware failure tolerant
No specialized hardware or software needed
Flexible job API allows existing software skills to be
leveraged
Industry adoption means support skills available
Questions
Download