Satellite Image Processing And Production With Apache Hadoop U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Overview Apache Hadoop Applications, Environment and Use Case Log Processing Example EROS Science Processing Architecture (ESPA) and Hadoop ESPA Processing Example ESPA Implementation Strategy Performance Results Thoughts, Notes and Takeaway Questions Apache Hadoop – What is it? Open source distributed processing system Designed to run on commodity hardware Widely used for solving “Big Data” challenges Has been deployed in clusters with thousands of machines and petabytes of storage Two primary subsystems: Hadoop Distributed File System (HDFS) and the MapReduce engine Hadoop’s Applications Web content indexing Data mining Machine learning Statistical analysis and modeling Trend analysis Search optimization … and of course, satellite image processing! Hadoop’s Environment Linux and Unix Java based but relies on ssh for job distribution Jobs written in any language executable from shell prompt Java, C/C++, Perl, Python, Ruby, R, Bash, et. al. Hadoop’s Use Case Cluster of machines is configured into a Hadoop cluster Each contributes: Local compute resources to MapReduce Local storage resources to HDFS Files are stored in HDFS File size is typically measured in gigabytes and terabytes Job is run against an input file in HDFS Target input file is specified Code to run against input also specified Hadoop’s Use Case Unlike traditional systems which move data to the code, Hadoop flips this and moves code to the data Two software functions comprise a MapReduce job Map operation Reduce operation Upon execution: Hadoop identifies input file chunk locations, moves the algorithms and executes the code The “Map” Sorts the Map results and aggregates final answer (single thread) The “Reduce” Log Processing Example ESPA and Hadoop Hadoop map code runs parallel on the input (log file) Processes a single input file as quickly as possible Reduce code runs on mapper output ESPA processes satellite images, not text Algorithms cannot run parallel within an image Cannot use satellite images as the input Solution: Use a text file with the image location as input. Skip the reduce step Rather than parallelize within an image, ESPA handles many images at once ESPA Processing Example Implementation Strategy LSRD is budget constrained for hardware Other projects regularly excess old hardware upon warranty expiration Take ownership of these systems… if they fail, they fail Also ‘borrow’ compute and storage from other projects Only network connectivity is necessary Current cluster is 102 cores, minimal expense Cables, switches, etc Performance Results Original throughput requirement was 455 atmospherically corrected Landsat scenes per day Currently able to process ~ 4800! Biggest bottleneck is local machine storage input/output Due to implementation of ftp’ing files instead of using HDFS as intended Attempted to solve this with ram disk, not enough memory Currently evaluating solid state disk Thoughts and Notes Number of splits on input file can be controlled via the dfs.block.size parameter Therefore control number of jobs run against an input file ESPA-like implementation does not require massive storage unlike other Hadoop instances Input files are very small Robust internal job monitoring mechanisms are usually custom-built Thoughts and Notes Jobs written for Hadoop Streaming may be tested and run without Hadoop cat inputfile.txt | mapper.py | sort | reducer.py > out.txt Projects can share resources Hadoop is tunable to restrict resource utilization on a per machine basis Provides instant productivity gains versus internal development LSRD is all about science and science algorithms Minimal time and budget for building internal systems Takeaways Hadoop is proven and tested Massively scalable out of the box Cloud based instances available from Amazon and others Shortest path to processing massive amounts of data Extremely hardware failure tolerant No specialized hardware or software needed Flexible job API allows existing software skills to be leveraged Industry adoption means support skills available Questions