Hadoop and Amazon Web Services Ken Krugler Hadoop and AWS Overview Welcome I’m Ken Krugler Using Hadoop since The Dark Ages (2006) Apache Tika committer Active developer and trainer Using Hadoop with AWS for… Large scale web crawling Machine learning/NLP ETL/Solr indexing Course Overview Assumes you know basics of Hadoop Focus is on how to use Elastic MapReduce From n00b to knowledgeable in 10 modules… Getting Started Running Jobs Clusters of Servers Dealing with Data Wikipedia Lab Command Line Tools Debugging Tips Hive and Pig Hive Lab Advanced Topics Why Use Elastic MapReduce? Reduce hardware & OPS/IT personnel costs Pay for what you actually use Don’t pay for people you don’t need Don’t pay for capacity you don’t need More agility, less wait time for hardware Don’t waste time buying/racking/configuring servers Many server classes to choose from (micro to massive) Less time doing Hadoop deployment & version mgmt Optimized Hadoop is pre-installed Hadoop and AWS Getting Started 30 Seconds of Terminology AWS – Amazon Web Services S3 – Simple Storage Service EC2 – Elastic Compute Cloud EMR – Elastic MapReduce The Three Faces of AWS Three ways to interact with AWS Via web browser – the AWS Console Via command line tools – e.g. “elastic-mapreduce” CLI Via the AWS API – Java, Python, Ruby, etc. We’re using the AWS Console for the intro The “Command Line Tools” module is later Details of CLI & API found in online documentation http://aws.amazon.com/documentation/elasticmapreduce/ Getting an Amazon Account All AWS services require an account Signing up is simple Email address/password Requires credit card, to pay for services Uses phone number to validate account End result is an Amazon account Has an account ID (looks like xxxx-yyyy-zzzz) Let’s go get us an account Go to http://aws.amazon.com Click the “Sign Up Now” button Credentials You have an account with a password This account has: An account name (AWS Test) An account id (8310-5790-6469) An access key id (AKIAID4SOXLXJSFNG6SA) A secret access key (jXw5qhiBrF…) A canonical user id (10d8c2962138…) Let’s go look at our account settings… http://console.aws.amazon.com Select “Security Credentials” from account menu Getting an EC2 Key Pair Go to https://console.aws.amazon.com/ec2 Click on the “Key Pairs” link at the bottom-left Click on the “Create Key Pair” button Enter a simple, short name for the key pair Click the “Create” button Let’s go make us a key pair… Amazon S3 Bucket EMR saves data to S3 Hadoop job results Hadoop job log files S3 data is organized as paths to files in a “bucket” You need to create a bucket before running a job Let’s go do that now… Summary At this point we are ready to run Hadoop jobs We have an AWS account - 8310-5790-6469 We created a key pair – aws-test We created an S3 bucket – aws-test-kk In the next module we’ll run a custom Hadoop job Hadoop and AWS Running a Hadoop Job Overview of Running a Job ① Upload job jar & input data to S3 ② Create a new Job Flow ③ Wait for completion, examine results Setting Up the S3 Bucket One bucket can hold all elements for job Hadoop job jar – aws-test-kk/job/wikipedia-ngrams.jar Input data – aws-test-kk/data/enwiki-split.xml Results – aws-test-kk/results/ Logs – aws-test-kk/logs/ We can use AWS Console to create directories And upload files too Let’s go set up the bucket now… Creating the Job Flow A Job Flow has many settings: A user-friendly name The type of the job (custom jar, streaming, Hive, Pig) The type and of number of servers The key pair to use Where to put log files And a few other less common settings Let’s go create a job flow… Monitoring a Job AWS Console displays information about the job State – starting, running, shutting down Elapsed time – duration Normalized Instance Hours – cost You can also terminate a job Let’s go watch our job run… Viewing Job Results My job puts its results into S3 (-outputdir s3n://xxx) The Hadoop cluster “goes away” at end of job So anything in HDFS will be tossed Persistent Job Flow doesn’t have this issue Hadoop writes job log files to S3 Using location specified for job (aws-test-kk/logs/) Let’s go look at the job results… Summary Jobs can be defined using the AWS Console Code and input data are loaded from S3 Results and log files are saved back to S3 In the next module we’ll explore server options Hadoop and AWS Clusters of Servers Servers for Clusters in EMR Based on EC2 instance type options Currently eleven to choose from See http://aws.amazon.com/ec2/instance-types/ Each instance type has regular and API name E.g. “Small (m1.small)” Each instance type has five attributes, including… Memory CPUs Local storage Server Details Uses Xen virtualization So sometimes a server “slows down” Currently m1.large uses: Linux version 2.6.21.7-2.fc8xen Debian 5.0.8 CPU has X virtual cores and Y “EC2 Compute Units” 1 compute unit ≈ 1GHz Xeon processor (circa 2007) E.g. 6.5 EC2 Compute Units • (2 virtual cores with 3.25 EC2 Compute Units each) Pricing Instance types have per-hour cost Price is combination of EC2 base cost + EMR extra http://aws.amazon.com/elasticmapreduce/pricing/ Some typical combined prices Small $0.10/hour Large $0.40/hour Extra Large $0.80/hour Spot pricing is based on demand The Large (m1.large) Instance Type Key attributes 7.5GB memory 2 virtual cores 850GB local disk (2 drives) 64-bit platform Default Hadoop configuration 4 mappers, 2 reducers 1600MB child JVM size 200MB sort buffer (io.sort.mb) Let’s go look at the server… Typical Configurations Use m1.small for the master NameNode & JobTracker don’t need lots of horsepower Up to 50 slaves, otherwise bump to m1.large Use m1.large for slaves - ‘balanced’ jobs Reasonable CPU, disk space, I/O performance Use m1.small for slaves – external bottlenecks E.g. web crawling, since most time spent waiting Slow disk I/O performance, slow CPU Cluster Compute Instances Lots of cores, faster network 10 Gigabit Ethernet Good for jobs with… Lots of CPU cycles – parsing, NLP, machine learning Lots of map-to-reduce data – many groupings Cluster Compute Eight Extra Large Instance 60GB memory 8 real cores (88 EC2 Compute Units) 3.3TB disk Hadoop and AWS Dealing with Data Data Sources & Sinks S3 – Simple Storage Service Primary source of data Other AWS Services SimpleDB, DynamoDB Relational Database Service (RDS) Elastic Block Store (EBS) External via APIs HTTP (web crawling) is most common S3 Basics Data stored as objects (files) in buckets <bucket>/<path> “key” to file is path No real directories, just path segments Great as persistent storage for data Reliable – up to 99.999999999% Scalable – up to petabytes of data Fast – highly parallel requests S3 Access Via HTTP REST interface Create (PUT/POST), Read (GET), Delete (DELETE) Java API/tools use this same API Various command line tools s3cmd – two different versions Or via your web browser S3 Access via Browser Browser-based AWS Management Console S3Fox Organizer – Firefox plug-in Let’s try out the browser-based solutions… S3 Buckets Name of the bucket… Must be unique across ALL users Should be DNS-compliant General limitations 100 buckets per account Can’t be nested – no buckets in buckets Not limited by Number of files/bucket Total data stored in bucket’s files S3 Files Every file (aka object) Lives in a bucket Has a path which acts as the file’s “key” Is identified via bucket + path General limitations Can’t be modified (no random write or append) Max size of 5TB (5GB per upload request) Fun with S3 Paths AWS Console uses <bucket>/<path> For specifying location of job jar AWS Console uses s3n://<bucket>/<path> For specifying location of log files s3cmd tool use s3://<bucket>/<path> S3 Pricing Varies by region – numbers below are “US Standard” Data in is (currently) free Data out is also free within same region Otherwise starts at $0.12/GB, drops w/volume Per-request cost varies, based on type of request E.g. $0.01 per 10K GET requests Storage cost is per GB-month Starts at $0.140/GB, drops w/volume S3 Access Control List (ACL) Read/Write permissions on per-bucket basis Read == listing objects in bucket Write == create/overwrite/delete objects in bucket Read permissions on per-object (file) basis Read == read object data & metadata Also read/write ACP permissions on bucket/object Reading & writing ACL for bucket or object FULL_CONTROL means all valid permissions S3 ACL Grantee Who has what ACLs for each bucket/object? Can be individual user Based on Canonical user ID Can be “looked up” via account’s email address Can be a pre-defined group Authenticated Users – any AWS user All Users – anybody, with or without authentication Let’s go look at some bucket & file ACLs… S3 ACL Problems Permissions set on bucket don’t propagate Objects created in bucket have ACLs set by creator Read permission on bucket ≠ able to read objects So you can “own” a bucket (have FULL_CONTROL) But you can’t read the objects in the bucket Though you can delete the objects in your bucket S3 and Hadoop Just another file system s3n://<bucket>/<path> But bucket name must be valid hostname Works with DistCp as source and/or destination E.g. hadoop distcp s3n://bucket1/ s3n://bucket2/ Tweaks for Elastic MapReduce Multi-part upload – files bigger than 5GB S3DistCp – file patterns, compression, grouping, etc. Hadoop and AWS Map-Reduce Lab Wikipedia Processing Lab Lab covers running typical Hadoop job using EMR Code parses Wikipedia dump (available in S3) <page><title>John Brisco</title>…</page> One page per line of text, thus no splitting issues Output is top bigrams (character pairs) and counts e.g. ‘th’ occurred 2,578,322 times Format is tab-separated value (TSV) text file Wikipedia Processing Lab - Requirements You should already have your AWS account Download & expand the Wikipedia Lab http://elasticmapreduce.s3.amazonaws.com/training/wik ipedia-lab.tgz Follow the instructions in the README file Located inside of expanded lab directory Let’s go do that now… Hadoop and AWS Command Line Tools Why Use Command Line Tools? Faster in some cases than AWS Console Able to automate via shell scripts More functionality E.g. dynamically expanding/shrinking cluster And have a job flow with more than one step Easier to do interactive development Launching cluster without a step Hive interactive mode Why Not Use Command Line Tools? Often requires Python or Ruby Extra local configuration Windows users have additional pain Putty & setting up private key for ssh access EMR Command Line Client Ruby script for command line interface (CLI) elastic-mapreduce <command> See http://aws.amazon.com/developertools/2264 Steps to install & configure Make sure you have Ruby 1.8 installed Download the CLI tool from the page above Edit you credentials.json file Using the elastic-mapreduce CLI Editing the credentials.json file Located inside of the elastic-mapreduce directory Enter your credentials (access id, private key, etc) Set the target AWS region Add elastic-mapreduce directory to your path E.g. in .bash_rc, add export PATH=$PATH:xxx Let’s give it a try… s3cmd Command Line Client Python script for interacting with S3 Supports all standard file operations List files or buckets – s3cmd ls s3://<bucket> Delete bucket – s3cmd rb s3://<bucket> Delete file – s3cmd del s3://<bucket>/<path> Put file – s3cmd put <local file> s3://<bucket> Get file – s3cmd get s3://<bucket>/<path> <local path> Etc… Using s3cmd Download it from: http://sourceforge.net/projects/s3tools/files/latest/downl oad?source=files Expand/install it: Add to your shell path Run `s3cmd --configure` Enter your credentials Let’s go try that… Hadoop and AWS Debugging Tips Launching ‘Alive’ Cluster with no Steps Lets you iteratively run Hadoop jobs Same thing for Hive sessions Avoids the dreaded 10 second failure Requires the command line tool and/or ssh ssh onto master for interactive Hive Use elastic-mapreduce to add steps for jobs Interactively Adding Job Steps Launch the cluster elastic-mapreduce --create --alive Wait for the cluster to start elastic-mapreduce --list --active Add a step elastic-mapreduce --j <job flow id> --jar <path to jar> … Don’t forget to terminate the cluster! Let’s try that now... Enabling Debugging Via AWS Console, during Job Flow Set “Enable Debugging” option to Yes (Advanced Options) Via elastic-mapreduce tools --enable-debugging parameter Stores extra information in SimpleDB Persistent access to some job/task data Accessible via [Debug] button in EMR console Let’s take a look… SSH Fun SSHing onto master server in your cluster Needs the private key (PEM) file you downloaded Key file privileges must be restricted • chmod 600 <xxx.pem> Use ssh client in terminal, or PuTTY on Windows Lets you immediately see log files And there’s that sexy Lynx browser Time to hop onto the master… SSHing to Slaves Handy way to look at slave log files And monitor load, active tasks, etc. But the master doesn’t have your PEM file Copy it to master first scp –i <pem file> <pem file> hadoop@<xxx>:~/ Then log into master, get slave name(s), ssh to them ssh –i <pem file> hadoop@<xxx> hadoop dfsadmin -report Inspecting Job Flow Description Some doh! errors don’t generate log output E.g. wrong location of job jar Inspecting the job flow shows the problem Via AWS Console Via CLI elastic-mapreduce --describe -j j-3MXSD6Q88CCDJ "LastStateChangeReason": "Jar doesn't exist: s3n:\/\/aws-test-kk\/job\/sensor-data.job” Hadoop GUI Standard Hadoop GUI But ports are blocked by security group And slaves use IP addresses or internal DNS names Requires proxy server ssh –i <pem file> -ND <port> hadoop@<public DNS> And FoxyProxy (for Firefox browser) Configuration details on AWS web site http://docs.amazonwebservices.com/ElasticMapReduce/lat est/DeveloperGuide/UsingtheHadoopUserInterface.html Hadoop GUI Details JobTracker is on public DNS, port 9100 NameNode is on public DNS, port 9101 You can also edit your security group Always called ElasticMapReduce-master Open up all ports for access from your computer’s IP But it’s hard(er) to use the slave daemon GUI Often it’s an IP address, so FoxyProxy doesn’t work External access can’t resolve IP or internal DNS Let’s take a look at a job… Hadoop and AWS Hive & Pig Using Hive & Pig in EMR Familiarity with Hive & Pig assumed Module instead covers using these tools in EMR context Advantages of EMR for Hive & Pig jobs Instances have tools pre-installed & configured Simplified job submission & control Amazon-developed extensions like JSON SerDe Running a Hive Job Flow ① Upload Hive script & input data to S3 ② Create a new Hive Job Flow ③ Wait for completion, examine results Hive Job Flow vs. Custom Jar Both via the AWS Management Console elastic-mapreduce CLI also works “Code” (Hive script) pulled from S3 Source data loaded from S3 Results saved in S3 Setting Up the S3 Bucket One bucket can hold all elements for job flow Hive script – aws-test-kk/script/wikipedia-authors.hql Input data – aws-test-kk/data/enwiki-split.json Results – aws-test-kk/hive-results/ Logs – aws-test-kk/logs/ We can use AWS Console to create directories And upload files too Let’s go set up the bucket now… Creating the Job Flow A Job Flow has many settings: A user-friendly name (Wikipedia Authors) The type of the job (Hive) The type and of number of servers (m1.small, 2 slaves) The key pair to use (aws-test) Where to put log files And a few other less common settings Let’s go create a job flow… Monitoring a Job AWS Console displays information about the job State – starting, running, shutting down Elapsed time – duration Normalized Instance Hours – cost You can also terminate a job Let’s go watch our job run… Viewing Job Results My job puts its results into S3 (-outputdir s3n://xxx) The Hadoop cluster “goes away” at end of job So anything in HDFS will be tossed Persistent Job Flow doesn’t have this issue Hadoop writes job log files to S3 Using location specified for job (aws-test-kk/logs/) Let’s go look at the job results… Interactive vs. Batch Job Flow Batch works well for production But developing Hive scripts is often trial & error And you don’t want to pay the 10 second penalty Cluster launches, script fails, cluster terminates You pay for 1 hour * size of your cluster And you spend several minutes waiting… Interacting with Hive via CLI Create an EMR cluster that stays “alive” SSH into master node Use the Hive interpreter Set up your environment Interactively execute Hive queries Terminate the job flow Let’s give that a try… Pig Job Flows Almost identical to Hive Job Flow: Interactive mode is used to develop the script Batch mode executes the script, loaded from S3 Differences are: It’s a Pig Job Flow, not a Hive Job Flow The script file contains Pig Latin, not Hive QL Hadoop and AWS Hive Lab Clicked Impressions Lab Lab covers running typical Hive job using EMR Read two JSON-format log files from S3 Impressions (impressionId, requestBeginTime, etc.) Clicks (impressionId, etc.) Join input tables on impressionId Output table (Impressions fields plus “clicked” boolean) Date format conversion Partitioned by date & hour Clicked Impressions Lab - Requirements You should already have your AWS account Download & expand the Clicked Impressions Lab http://xxx: Follow the instructions in the README file Located inside of expanded lab directory Let’s go do that now… Hadoop and AWS Advanced Elastic MapReduce Bootstrap Actions Scripts that are run before starting Hadoop Altering the Hadoop configuration Installing additional software Scripts are loaded from S3 Using s3n://<bucket>/<path> syntax Several built-in scripts Configure Daemons Configure Hadoop Install Ganglia Add swap file Specifying Bootstrap Actions Via AWS Console Part of defining Job Flow Pick built-in or custom Via elastic-mapreduce --bootstrap-action <path to script in s3> --args <args> Multiple bootstrap actions are possible configure-hadoop Bootstrap Action Most common action to use Tweak default settings of cluster E.g. increase io.sort.mb to reduce map task spills Can merge in xxx-site.xml file in S3 -C <path to core-site.xml file> -H <path to hdfs-site.xml file> -M <path to mapred-site.xml file> File to be merged must contain appropriate params Setting params with configure-hadoop Specify individual Hadoop parameters to change Update to core-site.xml -c <key>=<value> Update to hdfs-site.xml -h <key>=<value> Update to mapred-site.xml -m <key>=<value> E.g. -m io.sort.mb=600 Spot Pricing You bid for servers Specify your max rate per hour Might not get servers if rate is too low You pay the current spot rate, not your bid Servers “go away” if spot rate > bid Typical spot price is 1/3 of on-demand price But prices can spike to > on-demand When to Use Spot Pricing If you don’t care when the cluster dies Then use spot pricing for all slaves Best to use on-demand for master Save data processing checkpoints If you can’t have the cluster die Then use spot pricing for “task-only” slaves The “core” slaves run HDFS using on-demand More details on that in a bit How to Use Spot Pricing Via AWS Console Via elastic-mapreduce --bid-price <hourly rate> Can bid separately on master, core, task groups The Task Group Optional third group, beyond “master” and “core” Servers in cluster that only run TaskTracker Thus no HDFS data is stored Useful with spot pricing No data lost if they go away Some impact on efficiency of task-only slaves Also useful for dynamic cluster sizing Specifying Task Groups Via the AWS Console Via elastic-mapreduce --instance-group task Resizing Your Cluster Can’t be done via AWS Console You can add a task group --add-instance-group task <specify type, count, bid> You can change the # of servers --set-num-core-group-instances <new count> --set-num-task-group-instances <new count> But you can’t decrease the core group count