EMR Training - Amazon Web Services

Hadoop and Amazon Web Services
Ken Krugler
Hadoop and AWS
 I’m Ken Krugler
 Using Hadoop since The Dark Ages (2006)
 Apache Tika committer
 Active developer and trainer
 Using Hadoop with AWS for…
 Large scale web crawling
 Machine learning/NLP
 ETL/Solr indexing
Course Overview
 Assumes you know basics of Hadoop
 Focus is on how to use Elastic MapReduce
 From n00b to knowledgeable in 10 modules…
Getting Started
Running Jobs
Clusters of Servers
Dealing with Data
Wikipedia Lab
Command Line Tools
Debugging Tips
Hive and Pig
Hive Lab
Advanced Topics
Why Use Elastic MapReduce?
 Reduce hardware & OPS/IT personnel costs
 Pay for what you actually use
 Don’t pay for people you don’t need
 Don’t pay for capacity you don’t need
 More agility, less wait time for hardware
 Don’t waste time buying/racking/configuring servers
 Many server classes to choose from (micro to massive)
 Less time doing Hadoop deployment & version mgmt
 Optimized Hadoop is pre-installed
Hadoop and AWS
Getting Started
30 Seconds of Terminology
 AWS – Amazon Web Services
 S3 – Simple Storage Service
 EC2 – Elastic Compute Cloud
 EMR – Elastic MapReduce
The Three Faces of AWS
 Three ways to interact with AWS
 Via web browser – the AWS Console
 Via command line tools – e.g. “elastic-mapreduce” CLI
 Via the AWS API – Java, Python, Ruby, etc.
 We’re using the AWS Console for the intro
 The “Command Line Tools” module is later
 Details of CLI & API found in online documentation
 http://aws.amazon.com/documentation/elasticmapreduce/
Getting an Amazon Account
 All AWS services require an account
 Signing up is simple
 Email address/password
 Requires credit card, to pay for services
 Uses phone number to validate account
 End result is an Amazon account
 Has an account ID (looks like xxxx-yyyy-zzzz)
 Let’s go get us an account
 Go to http://aws.amazon.com
 Click the “Sign Up Now” button
 You have an account with a password
 This account has:
An account name (AWS Test)
An account id (8310-5790-6469)
An access key id (AKIAID4SOXLXJSFNG6SA)
A secret access key (jXw5qhiBrF…)
A canonical user id (10d8c2962138…)
 Let’s go look at our account settings…
 http://console.aws.amazon.com
 Select “Security Credentials” from account menu
Getting an EC2 Key Pair
Go to https://console.aws.amazon.com/ec2
Click on the “Key Pairs” link at the bottom-left
Click on the “Create Key Pair” button
Enter a simple, short name for the key pair
Click the “Create” button
Let’s go make us a key pair…
Amazon S3 Bucket
 EMR saves data to S3
 Hadoop job results
 Hadoop job log files
 S3 data is organized as paths to files in a “bucket”
 You need to create a bucket before running a job
 Let’s go do that now…
 At this point we are ready to run Hadoop jobs
 We have an AWS account - 8310-5790-6469
 We created a key pair – aws-test
 We created an S3 bucket – aws-test-kk
 In the next module we’ll run a custom Hadoop job
Hadoop and AWS
Running a Hadoop Job
Overview of Running a Job
① Upload job jar & input data to S3
② Create a new Job Flow
③ Wait for completion, examine results
Setting Up the S3 Bucket
 One bucket can hold all elements for job
Hadoop job jar – aws-test-kk/job/wikipedia-ngrams.jar
Input data – aws-test-kk/data/enwiki-split.xml
Results – aws-test-kk/results/
Logs – aws-test-kk/logs/
 We can use AWS Console to create directories
 And upload files too
 Let’s go set up the bucket now…
Creating the Job Flow
 A Job Flow has many settings:
A user-friendly name
The type of the job (custom jar, streaming, Hive, Pig)
The type and of number of servers
The key pair to use
Where to put log files
And a few other less common settings
 Let’s go create a job flow…
Monitoring a Job
 AWS Console displays information about the job
 State – starting, running, shutting down
 Elapsed time – duration
 Normalized Instance Hours – cost
 You can also terminate a job
 Let’s go watch our job run…
Viewing Job Results
 My job puts its results into S3 (-outputdir s3n://xxx)
 The Hadoop cluster “goes away” at end of job
 So anything in HDFS will be tossed
 Persistent Job Flow doesn’t have this issue
 Hadoop writes job log files to S3
 Using location specified for job (aws-test-kk/logs/)
 Let’s go look at the job results…
 Jobs can be defined using the AWS Console
 Code and input data are loaded from S3
 Results and log files are saved back to S3
 In the next module we’ll explore server options
Hadoop and AWS
Clusters of Servers
Servers for Clusters in EMR
 Based on EC2 instance type options
 Currently eleven to choose from
 See http://aws.amazon.com/ec2/instance-types/
 Each instance type has regular and API name
 E.g. “Small (m1.small)”
 Each instance type has five attributes, including…
 Memory
 CPUs
 Local storage
Server Details
 Uses Xen virtualization
 So sometimes a server “slows down”
 Currently m1.large uses:
 Linux version
 Debian 5.0.8
 CPU has X virtual cores and Y “EC2 Compute Units”
 1 compute unit ≈ 1GHz Xeon processor (circa 2007)
 E.g. 6.5 EC2 Compute Units
• (2 virtual cores with 3.25 EC2 Compute Units each)
 Instance types have per-hour cost
 Price is combination of EC2 base cost + EMR extra
 http://aws.amazon.com/elasticmapreduce/pricing/
 Some typical combined prices
 Small
 Large
 Extra Large $0.80/hour
 Spot pricing is based on demand
The Large (m1.large) Instance Type
 Key attributes
7.5GB memory
2 virtual cores
850GB local disk (2 drives)
64-bit platform
 Default Hadoop configuration
 4 mappers, 2 reducers
 1600MB child JVM size
 200MB sort buffer (io.sort.mb)
 Let’s go look at the server…
Typical Configurations
 Use m1.small for the master
 NameNode & JobTracker don’t need lots of horsepower
 Up to 50 slaves, otherwise bump to m1.large
 Use m1.large for slaves - ‘balanced’ jobs
 Reasonable CPU, disk space, I/O performance
 Use m1.small for slaves – external bottlenecks
 E.g. web crawling, since most time spent waiting
 Slow disk I/O performance, slow CPU
Cluster Compute Instances
 Lots of cores, faster network
 10 Gigabit Ethernet
 Good for jobs with…
 Lots of CPU cycles – parsing, NLP, machine learning
 Lots of map-to-reduce data – many groupings
 Cluster Compute Eight Extra Large Instance
 60GB memory
 8 real cores (88 EC2 Compute Units)
 3.3TB disk
Hadoop and AWS
Dealing with Data
Data Sources & Sinks
 S3 – Simple Storage Service
 Primary source of data
 Other AWS Services
 SimpleDB, DynamoDB
 Relational Database Service (RDS)
 Elastic Block Store (EBS)
 External via APIs
 HTTP (web crawling) is most common
S3 Basics
 Data stored as objects (files) in buckets
 <bucket>/<path>
 “key” to file is path
 No real directories, just path segments
 Great as persistent storage for data
 Reliable – up to 99.999999999%
 Scalable – up to petabytes of data
 Fast – highly parallel requests
S3 Access
 Via HTTP REST interface
 Create (PUT/POST), Read (GET), Delete (DELETE)
 Java API/tools use this same API
 Various command line tools
 s3cmd – two different versions 
 Or via your web browser
S3 Access via Browser
 Browser-based
 AWS Management Console
 S3Fox Organizer – Firefox plug-in
 Let’s try out the browser-based solutions…
S3 Buckets
 Name of the bucket…
 Must be unique across ALL users
 Should be DNS-compliant
 General limitations
 100 buckets per account
 Can’t be nested – no buckets in buckets
 Not limited by
 Number of files/bucket
 Total data stored in bucket’s files
S3 Files
 Every file (aka object)
 Lives in a bucket
 Has a path which acts as the file’s “key”
 Is identified via bucket + path
 General limitations
 Can’t be modified (no random write or append)
 Max size of 5TB (5GB per upload request)
Fun with S3 Paths
 AWS Console uses <bucket>/<path>
 For specifying location of job jar
 AWS Console uses s3n://<bucket>/<path>
 For specifying location of log files
 s3cmd tool use s3://<bucket>/<path>
S3 Pricing
 Varies by region – numbers below are “US Standard”
 Data in is (currently) free
 Data out is also free within same region
 Otherwise starts at $0.12/GB, drops w/volume
 Per-request cost varies, based on type of request
 E.g. $0.01 per 10K GET requests
 Storage cost is per GB-month
 Starts at $0.140/GB, drops w/volume
S3 Access Control List (ACL)
 Read/Write permissions on per-bucket basis
 Read == listing objects in bucket
 Write == create/overwrite/delete objects in bucket
 Read permissions on per-object (file) basis
 Read == read object data & metadata
 Also read/write ACP permissions on bucket/object
 Reading & writing ACL for bucket or object
 FULL_CONTROL means all valid permissions
S3 ACL Grantee
 Who has what ACLs for each bucket/object?
 Can be individual user
 Based on Canonical user ID
 Can be “looked up” via account’s email address
 Can be a pre-defined group
 Authenticated Users – any AWS user
 All Users – anybody, with or without authentication
 Let’s go look at some bucket & file ACLs…
S3 ACL Problems
 Permissions set on bucket don’t propagate
 Objects created in bucket have ACLs set by creator
 Read permission on bucket ≠ able to read objects
 So you can “own” a bucket (have FULL_CONTROL)
 But you can’t read the objects in the bucket
 Though you can delete the objects in your bucket
S3 and Hadoop
 Just another file system
 s3n://<bucket>/<path>
 But bucket name must be valid hostname
 Works with DistCp as source and/or destination
 E.g. hadoop distcp s3n://bucket1/ s3n://bucket2/
 Tweaks for Elastic MapReduce
 Multi-part upload – files bigger than 5GB
 S3DistCp – file patterns, compression, grouping, etc.
Hadoop and AWS
Map-Reduce Lab
Wikipedia Processing Lab
 Lab covers running typical Hadoop job using EMR
 Code parses Wikipedia dump (available in S3)
 <page><title>John Brisco</title>…</page>
 One page per line of text, thus no splitting issues
 Output is top bigrams (character pairs) and counts
 e.g. ‘th’ occurred 2,578,322 times
 Format is tab-separated value (TSV) text file
Wikipedia Processing Lab - Requirements
 You should already have your AWS account
 Download & expand the Wikipedia Lab
 http://elasticmapreduce.s3.amazonaws.com/training/wik
 Follow the instructions in the README file
 Located inside of expanded lab directory
 Let’s go do that now…
Hadoop and AWS
Command Line Tools
Why Use Command Line Tools?
 Faster in some cases than AWS Console
 Able to automate via shell scripts
 More functionality
 E.g. dynamically expanding/shrinking cluster
 And have a job flow with more than one step
 Easier to do interactive development
 Launching cluster without a step
 Hive interactive mode
Why Not Use Command Line Tools?
 Often requires Python or Ruby
 Extra local configuration
 Windows users have additional pain
 Putty & setting up private key for ssh access
EMR Command Line Client
 Ruby script for command line interface (CLI)
 elastic-mapreduce <command>
 See http://aws.amazon.com/developertools/2264
 Steps to install & configure
 Make sure you have Ruby 1.8 installed
 Download the CLI tool from the page above
 Edit you credentials.json file
Using the elastic-mapreduce CLI
 Editing the credentials.json file
 Located inside of the elastic-mapreduce directory
 Enter your credentials (access id, private key, etc)
 Set the target AWS region
 Add elastic-mapreduce directory to your path
 E.g. in .bash_rc, add export PATH=$PATH:xxx
 Let’s give it a try…
s3cmd Command Line Client
 Python script for interacting with S3
 Supports all standard file operations
List files or buckets – s3cmd ls s3://<bucket>
Delete bucket – s3cmd rb s3://<bucket>
Delete file – s3cmd del s3://<bucket>/<path>
Put file – s3cmd put <local file> s3://<bucket>
Get file – s3cmd get s3://<bucket>/<path> <local path>
Using s3cmd
 Download it from:
 http://sourceforge.net/projects/s3tools/files/latest/downl
 Expand/install it:
 Add to your shell path
 Run `s3cmd --configure`
 Enter your credentials
 Let’s go try that…
Hadoop and AWS
Debugging Tips
Launching ‘Alive’ Cluster with no Steps
 Lets you iteratively run Hadoop jobs
 Same thing for Hive sessions
 Avoids the dreaded 10 second failure
 Requires the command line tool and/or ssh
 ssh onto master for interactive Hive
 Use elastic-mapreduce to add steps for jobs
Interactively Adding Job Steps
 Launch the cluster
 elastic-mapreduce --create --alive
 Wait for the cluster to start
 elastic-mapreduce --list --active
 Add a step
 elastic-mapreduce --j <job flow id> --jar <path to jar> …
 Don’t forget to terminate the cluster!
 Let’s try that now...
Enabling Debugging
 Via AWS Console, during Job Flow
 Set “Enable Debugging” option to Yes (Advanced Options)
 Via elastic-mapreduce tools
 --enable-debugging parameter
 Stores extra information in SimpleDB
 Persistent access to some job/task data
 Accessible via [Debug] button in EMR console
 Let’s take a look…
 SSHing onto master server in your cluster
 Needs the private key (PEM) file you downloaded
 Key file privileges must be restricted
• chmod 600 <xxx.pem>
 Use ssh client in terminal, or PuTTY on Windows
 Lets you immediately see log files
 And there’s that sexy Lynx browser
 Time to hop onto the master…
SSHing to Slaves
 Handy way to look at slave log files
 And monitor load, active tasks, etc.
 But the master doesn’t have your PEM file
 Copy it to master first
 scp –i <pem file> <pem file> hadoop@<xxx>:~/
 Then log into master, get slave name(s), ssh to them
 ssh –i <pem file> hadoop@<xxx>
 hadoop dfsadmin -report
Inspecting Job Flow Description
 Some doh! errors don’t generate log output
 E.g. wrong location of job jar
 Inspecting the job flow shows the problem
 Via AWS Console
 Via CLI
 elastic-mapreduce --describe -j j-3MXSD6Q88CCDJ
 "LastStateChangeReason": "Jar doesn't exist:
Hadoop GUI
 Standard Hadoop GUI
 But ports are blocked by security group
 And slaves use IP addresses or internal DNS names
 Requires proxy server
 ssh –i <pem file> -ND <port> hadoop@<public DNS>
 And FoxyProxy (for Firefox browser)
 Configuration details on AWS web site
 http://docs.amazonwebservices.com/ElasticMapReduce/lat
Hadoop GUI Details
 JobTracker is on public DNS, port 9100
 NameNode is on public DNS, port 9101
 You can also edit your security group
 Always called ElasticMapReduce-master
 Open up all ports for access from your computer’s IP
 But it’s hard(er) to use the slave daemon GUI
 Often it’s an IP address, so FoxyProxy doesn’t work
 External access can’t resolve IP or internal DNS
 Let’s take a look at a job…
Hadoop and AWS
Hive & Pig
Using Hive & Pig in EMR
 Familiarity with Hive & Pig assumed
 Module instead covers using these tools in EMR context
 Advantages of EMR for Hive & Pig jobs
 Instances have tools pre-installed & configured
 Simplified job submission & control
 Amazon-developed extensions like JSON SerDe
Running a Hive Job Flow
① Upload Hive script & input data to S3
② Create a new Hive Job Flow
③ Wait for completion, examine results
Hive Job Flow vs. Custom Jar
 Both via the AWS Management Console
 elastic-mapreduce CLI also works
 “Code” (Hive script) pulled from S3
 Source data loaded from S3
 Results saved in S3
Setting Up the S3 Bucket
 One bucket can hold all elements for job flow
Hive script – aws-test-kk/script/wikipedia-authors.hql
Input data – aws-test-kk/data/enwiki-split.json
Results – aws-test-kk/hive-results/
Logs – aws-test-kk/logs/
 We can use AWS Console to create directories
 And upload files too
 Let’s go set up the bucket now…
Creating the Job Flow
 A Job Flow has many settings:
A user-friendly name (Wikipedia Authors)
The type of the job (Hive)
The type and of number of servers (m1.small, 2 slaves)
The key pair to use (aws-test)
Where to put log files
And a few other less common settings
 Let’s go create a job flow…
Monitoring a Job
 AWS Console displays information about the job
 State – starting, running, shutting down
 Elapsed time – duration
 Normalized Instance Hours – cost
 You can also terminate a job
 Let’s go watch our job run…
Viewing Job Results
 My job puts its results into S3 (-outputdir s3n://xxx)
 The Hadoop cluster “goes away” at end of job
 So anything in HDFS will be tossed
 Persistent Job Flow doesn’t have this issue
 Hadoop writes job log files to S3
 Using location specified for job (aws-test-kk/logs/)
 Let’s go look at the job results…
Interactive vs. Batch Job Flow
 Batch works well for production
 But developing Hive scripts is often trial & error
 And you don’t want to pay the 10 second penalty
 Cluster launches, script fails, cluster terminates
 You pay for 1 hour * size of your cluster
 And you spend several minutes waiting…
Interacting with Hive via CLI
 Create an EMR cluster that stays “alive”
 SSH into master node
 Use the Hive interpreter
 Set up your environment
 Interactively execute Hive queries
 Terminate the job flow
 Let’s give that a try…
Pig Job Flows
 Almost identical to Hive Job Flow:
 Interactive mode is used to develop the script
 Batch mode executes the script, loaded from S3
 Differences are:
 It’s a Pig Job Flow, not a Hive Job Flow
 The script file contains Pig Latin, not Hive QL
Hadoop and AWS
Hive Lab
Clicked Impressions Lab
 Lab covers running typical Hive job using EMR
 Read two JSON-format log files from S3
 Impressions (impressionId, requestBeginTime, etc.)
 Clicks (impressionId, etc.)
 Join input tables on impressionId
 Output table (Impressions fields plus “clicked” boolean)
 Date format conversion
 Partitioned by date & hour
Clicked Impressions Lab - Requirements
 You should already have your AWS account
 Download & expand the Clicked Impressions Lab
 http://xxx:
 Follow the instructions in the README file
 Located inside of expanded lab directory
 Let’s go do that now…
Hadoop and AWS
Advanced Elastic MapReduce
Bootstrap Actions
 Scripts that are run before starting Hadoop
 Altering the Hadoop configuration
 Installing additional software
 Scripts are loaded from S3
 Using s3n://<bucket>/<path> syntax
 Several built-in scripts
Configure Daemons
Configure Hadoop
Install Ganglia
Add swap file
Specifying Bootstrap Actions
 Via AWS Console
 Part of defining Job Flow
 Pick built-in or custom
 Via elastic-mapreduce
 --bootstrap-action <path to script in s3> --args <args>
 Multiple bootstrap actions are possible
configure-hadoop Bootstrap Action
 Most common action to use
 Tweak default settings of cluster
 E.g. increase io.sort.mb to reduce map task spills
 Can merge in xxx-site.xml file in S3
 -C <path to core-site.xml file>
 -H <path to hdfs-site.xml file>
 -M <path to mapred-site.xml file>
 File to be merged must contain appropriate params
Setting params with configure-hadoop
 Specify individual Hadoop parameters to change
 Update to core-site.xml
 -c <key>=<value>
 Update to hdfs-site.xml
 -h <key>=<value>
 Update to mapred-site.xml
 -m <key>=<value>
 E.g. -m io.sort.mb=600
Spot Pricing
 You bid for servers
Specify your max rate per hour
Might not get servers if rate is too low
You pay the current spot rate, not your bid
Servers “go away” if spot rate > bid
 Typical spot price is 1/3 of on-demand price
 But prices can spike to > on-demand
When to Use Spot Pricing
 If you don’t care when the cluster dies
 Then use spot pricing for all slaves
 Best to use on-demand for master
 Save data processing checkpoints
 If you can’t have the cluster die
 Then use spot pricing for “task-only” slaves
 The “core” slaves run HDFS using on-demand
 More details on that in a bit
How to Use Spot Pricing
 Via AWS Console
 Via elastic-mapreduce
 --bid-price <hourly rate>
 Can bid separately on master, core, task groups
The Task Group
 Optional third group, beyond “master” and “core”
 Servers in cluster that only run TaskTracker
 Thus no HDFS data is stored
 Useful with spot pricing
 No data lost if they go away
 Some impact on efficiency of task-only slaves
 Also useful for dynamic cluster sizing
Specifying Task Groups
 Via the AWS Console
 Via elastic-mapreduce
 --instance-group task
Resizing Your Cluster
 Can’t be done via AWS Console
 You can add a task group
 --add-instance-group task <specify type, count, bid>
 You can change the # of servers
 --set-num-core-group-instances <new count>
 --set-num-task-group-instances <new count>
 But you can’t decrease the core group count