What is Hadoop?

advertisement
THE ELEPHANT IN THE ROOM
SAS & HADOOP
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP ABOUT THE PRESENTER
•
•
•
•
Jim Watson
SAS Education, Canberra
Background in SAS Programming, SQL programming, Database Processing,
Grid Processing, et al
With SAS since 1999
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP LIST OF TOPICS
•
What is Hadoop?
• How SAS integrates with Hadoop
•
HDFS
• LIBNAME Engine
• Explicit Pass-through
• High Performance Analytics
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP WHAT IS HADOOP?
•
•
•
•
Apache Hadoop is an Open Source Software Framework
Written in Java
For Distributed Storage and processing of very large datasets on computer
clusters
Built from Commodity Hardware
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP ADVANTAGES OF HADOOP
•
Some characteristics of Hadoop include:
•
•
•
•
•
•
Open-source
Simple to use distributed file system
Supports highly parallel processing
It’s scalable, so it’s suitable for massive amounts
of data
It is designed to work on low-cost hardware
It’s fault tolerant (redundant) at the data level
•
•
automatic replication of data
automatic fail-over
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP HADOOP FUNDAMENTALS
•
HDFS – “Hadoop Distributed File System”
•
•
Files are distributed across the Hadoop cluster
Hadoop YARN
a framework for job scheduling and cluster resource management
•
MapReduce
•
Files are processed locally and in parallel
• Based on YARN
These modules handle the process of reading/writing & processing large files in
a distributed environment. This allows the data to be exploited as if it were a
single massively powerful server.
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP HADOOP DISTRIBUTED FILE SYSTEM
•
HDFS is hierarchical with LINUX style paths and file ownership and
permissions.
• HADOOP FS commands are similar to LINUX commands.
• HDFS in not built into the operating system.
• Files are append-only after they are written.
$ hadoop fs –ls /user/student
Found 4 items
drwxr-xr-x - student1 sasapp 0 2014-05-30
drwx------ - student1 sasapp 0 2014-05-30
drwxr-xr-x - student1 sasapp 0 2014-05-28
drwxr-xr-x - student1 sasapp 0 2014-05-28
$ hadoop fs –mkdir /user/student1/newdir
$
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
20:00
10:05
15:25
13:59
/user/student1/.Trash
/user/student1/.stage
/user/student1/data
/user/student1/users
SAS & HADOOP MAPREDUCE
•
MapReduce is a framework written in Java that is built into Hadoop. It
automates the distributed processing of data files.
map
processing of individual rows (filtering, row
calculations)
shuffle
and sort
grouping rows for summarisation
reduce
summary calculations within groups
The MapReduce framework coordinates multiple mapping, sorting, and
reducing tasks that execute in parallel across the computer cluster.
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP WHAT’S INSIDE
SAS metadata
server
SAS
Client
SAS workspace
server
Hadoop
NameNode
Hive
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Hadoop
DataNode 1
Hadoop
DataNode 2
Hadoop
DataNode 3
SAS & HADOOP PARALLEL PROCESSING EXAMPLE
•
A MapReduce Example: Summarise a detailed order table to derive total
revenue by state. The table is already distributed in HDFS.
id
1
2
3
4
5
6
7
8
...
...
st rev
NC
10
GA
12
VA
8
NC
9
VA
22
NC
18
NC
2
GA
53
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
st totrev
GA
65
NC
39
VA
30
...
...
SAS & HADOOP PARALLEL PROCESSING EXAMPLE
File blocks
map
id
1
2
3
4
st
NSW
QLD
VIC
NSW
rev
10
12
8
9
st
NSW
QLD
VIC
NSW
id
5
6
7
8
st
VIC
NSW
NSW
QLD
.
.
.
rev
22
18
2
53
st
VIC
NSW
NSW
QLD
Block n
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
shuffle
reduce
rev
10
12
8
9
ct
1
1
1
1
st
NSW
NSW
NSW
NSW
rev
10
9
18
2
ct
1
1
1
1
st
NSW
totrev
39
rev
22
18
2
53
.
.
.
.......
ct
1
1
1
1
st
VIC
VIC
rev
8
22
ct
1
1
st
VIC
totrev
30
.
.
.
.......
output
output
.
.
.
.......
output
SAS & HADOOP “PIG” & “HIVE”
Pig
A platform for data analysis that includes stepwise
procedural programming that converts to
MapReduce.
Hive
A data warehousing framework to query
and manage large data sets stored in Hadoop.
Provides a mechanism to structure the data
and query the data using an SQL-like language
called HiveQL. Most HiveQL queries are compiled
into MapReduce programs.
Pig and Hive provide less complex higher-level programming methods for parallel
processing of Hadoop data files.
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP THE HADOOP ECOSYSTEM
The Apache Hadoop core technologies of HDFS, Yarn, and MapReduce along
with additional projects including Pig, Hive, and others are collectively called
the Hadoop ecosystem.
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP EXPLOITING THE HDFS
•
The Hadoop FILENAME engine
•
Upload local data to Hadoop
• Read data from Hadoop
• Use normal SAS PROC & DATA Steps
•
PROC HADOOP
•
Submit HDFS Commands
• Submit MapReduce & PIG programs
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP THE FILENAME STATEMENT & HDFS
filename hadconfg
"/workshop/hadoop_config.xml';
filename mapres hadoop
"/user/&std/data/mapoutput" concat
cfg=hadconfg user="&std";
data work.commonwords;
infile mapres dlm='09'x;
input word $ count;
…
run;
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP PROC HADOOP
•
PROC HADOOP submits
•
•
•
Hadoop file system (HDFS) commands
MapReduce programs
PIG language code.
PROC HADOOP <Hadoop-server-option(s)>;
HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>;
MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
PIG <Hadoop-server-option(s)> <pig-code-option(s)>;
PROPERTIES <configuration-properties>;
RUN;
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP PROC HADOOP – HDFS STATEMENTS
HDFS COPYFROMLOCAL='local-file' OUT='output-location'
<DELETESOURCE> < OVERWRITE>;
HDFS COPYTOLOCAL='HDFS-file'
OUT='output-location'
<DELETESOURCE> < OVERWRITE> < KEEPCRC>;
HDFS DELETE='HDFS-file' <NOWARN>;
HDFS MKDIR='HDFS-path';
HDFS RENAME='HDFS-file' OUT='new-name';
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP ACCESS HIVE TABLES VIA SAS
Two main methods to exploit Hadoop Hive tables in SAS:
•
The LIBNAME Engine (aka “Implicit Pass Through”)
•
•
•
Assign a LIBREF to Hive and use SAS code upon the LIBREF
SAS Code is automatically converted to Hive
Explicit Pass Through
•
Hive code is embedded in SAS code and is submitted verbatim to Hadoop
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP THE HADOOP LIBNAME ENGINE
libname hivedb hadoop server=namenode
subprotocol=hive2
port=10000 schema=diacchad
user=studentX pw=StudentX;
LIBNAME libref engine-name <connection options>
<LIBNAME-options>;
23
libname hivedb hadoop server=namenode
24
subprotocol=hive2
25
port=10000 schema=diacchad
26
user="&std" pw="&stdpw";
NOTE: Libref HIVEDB was successfully assigned as
follows:
Engine:
HADOOP
Physical Name:
jdbc:hive2://namenode:10000/diacchad
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP LIBNAME ENGINE EXAMPLE
options sastrace=',,,d' sastraceloc=saslog
nostsuffix;
proc means data=hivedb.order_fact sum mean;
var total_retail_price;
run;
proc freq data=hivedb.order_fact;
tables order_type;
run;
options sastrace=off;
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP LIBNAME ENGINE EXAMPLE
NOTE: SQL generation will be used to perform the initial
summarization.
HADOOP_41: Executed: on connection 7
select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from
( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2,
COUNT(TXT_1.`total_retail_price`) as ZSQL3,
SUM(TXT_1.`total_retail_price`) as ZSQL4
from `ORDER_FACT` TXT_1 ) T1
where T1.ZSQL1 > 0
ACCESS ENGINE:
data.
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SQL statement was passed to the DBMS for fetching
SAS & HADOOP EXPLICIT PASS THROUGH
proc sql;
connect to hadoop
(server=namenode subprotocol=hive2
schema=diacchad user="&std");
select *
from connection to hadoop
(select employee_name,salary
from salesstaff
where emp_hire_date between
'2011-01-01' and '2011-12-31'
);
disconnect from hadoop;
quit;
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP HIGH PERFORMANCE ANALYTICS
Interface
Purpose
Product
HighPerformance
Analytics
Procedures
Perform complex analytical computations on Hadoop tables within
the data nodes of the Hadoop distribution via SAS procedure
language. HPDS2 allows for manipulation of data structure
(column derivation).
SAS HighPerformance
Analytics
Solutions
SAS Visual
Analytics
A web interface to generate graphical visualizations of data
distributions and relationships on Hadoop tables pre-loaded into
memory within the data nodes of the Hadoop distribution.
SAS Visual
Analytics
PROC IMSTAT
A programming interface to perform complex analytical
calculations on Hadoop tables pre-loaded into memory within the
data nodes of the Hadoop distribution.
SAS
In-Memory
Statistics
DS2
A SAS proprietary language for table manipulation that translates
to database language and executes
in parallel in the data nodes of a distributed database.
SAS
In-Database
Code
Accelerators
Data loader for
hadoop
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS & HADOOP HIGH PERFORMANCE ANALYTICS
SAS Client
SAS metadata
server
SAS processes in each HDFS data node execute in
parallel.
SAS workspace
server
SAS High
Performance
Analytics
Root Node
Hadoop
NameNode
Hive
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS High
Performance
Analytics
Worker Node
SAS High
Performance
Analytics
Worker Node
SAS High
Performance
Analytics
Worker Node
Hadoop
DataNode 1
Hadoop
DataNode 2
Hadoop
DataNode 3
SAS & HADOOP LEARNING MORE
•
SAS Website
• SAS Education
•
Introduction to SAS & Hadoop
•
•
DS2 Programming: Essentials
•
•
2 Day course requiring some SAS Programming & SQL knowledge
2 Day course, requires intermediate SAS Programming knowledge
DS2 Programming Essentials with Hadoop
•
1 ½ day course, requires intermediate SAS Programming knowledge
C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Download