THE ELEPHANT IN THE ROOM SAS & HADOOP C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP ABOUT THE PRESENTER • • • • Jim Watson SAS Education, Canberra Background in SAS Programming, SQL programming, Database Processing, Grid Processing, et al With SAS since 1999 C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP LIST OF TOPICS • What is Hadoop? • How SAS integrates with Hadoop • HDFS • LIBNAME Engine • Explicit Pass-through • High Performance Analytics C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP WHAT IS HADOOP? • • • • Apache Hadoop is an Open Source Software Framework Written in Java For Distributed Storage and processing of very large datasets on computer clusters Built from Commodity Hardware C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP ADVANTAGES OF HADOOP • Some characteristics of Hadoop include: • • • • • • Open-source Simple to use distributed file system Supports highly parallel processing It’s scalable, so it’s suitable for massive amounts of data It is designed to work on low-cost hardware It’s fault tolerant (redundant) at the data level • • automatic replication of data automatic fail-over C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP HADOOP FUNDAMENTALS • HDFS – “Hadoop Distributed File System” • • Files are distributed across the Hadoop cluster Hadoop YARN a framework for job scheduling and cluster resource management • MapReduce • Files are processed locally and in parallel • Based on YARN These modules handle the process of reading/writing & processing large files in a distributed environment. This allows the data to be exploited as if it were a single massively powerful server. C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP HADOOP DISTRIBUTED FILE SYSTEM • HDFS is hierarchical with LINUX style paths and file ownership and permissions. • HADOOP FS commands are similar to LINUX commands. • HDFS in not built into the operating system. • Files are append-only after they are written. $ hadoop fs –ls /user/student Found 4 items drwxr-xr-x - student1 sasapp 0 2014-05-30 drwx------ - student1 sasapp 0 2014-05-30 drwxr-xr-x - student1 sasapp 0 2014-05-28 drwxr-xr-x - student1 sasapp 0 2014-05-28 $ hadoop fs –mkdir /user/student1/newdir $ C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . 20:00 10:05 15:25 13:59 /user/student1/.Trash /user/student1/.stage /user/student1/data /user/student1/users SAS & HADOOP MAPREDUCE • MapReduce is a framework written in Java that is built into Hadoop. It automates the distributed processing of data files. map processing of individual rows (filtering, row calculations) shuffle and sort grouping rows for summarisation reduce summary calculations within groups The MapReduce framework coordinates multiple mapping, sorting, and reducing tasks that execute in parallel across the computer cluster. C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP WHAT’S INSIDE SAS metadata server SAS Client SAS workspace server Hadoop NameNode Hive C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Hadoop DataNode 1 Hadoop DataNode 2 Hadoop DataNode 3 SAS & HADOOP PARALLEL PROCESSING EXAMPLE • A MapReduce Example: Summarise a detailed order table to derive total revenue by state. The table is already distributed in HDFS. id 1 2 3 4 5 6 7 8 ... ... st rev NC 10 GA 12 VA 8 NC 9 VA 22 NC 18 NC 2 GA 53 C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . st totrev GA 65 NC 39 VA 30 ... ... SAS & HADOOP PARALLEL PROCESSING EXAMPLE File blocks map id 1 2 3 4 st NSW QLD VIC NSW rev 10 12 8 9 st NSW QLD VIC NSW id 5 6 7 8 st VIC NSW NSW QLD . . . rev 22 18 2 53 st VIC NSW NSW QLD Block n C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . shuffle reduce rev 10 12 8 9 ct 1 1 1 1 st NSW NSW NSW NSW rev 10 9 18 2 ct 1 1 1 1 st NSW totrev 39 rev 22 18 2 53 . . . ....... ct 1 1 1 1 st VIC VIC rev 8 22 ct 1 1 st VIC totrev 30 . . . ....... output output . . . ....... output SAS & HADOOP “PIG” & “HIVE” Pig A platform for data analysis that includes stepwise procedural programming that converts to MapReduce. Hive A data warehousing framework to query and manage large data sets stored in Hadoop. Provides a mechanism to structure the data and query the data using an SQL-like language called HiveQL. Most HiveQL queries are compiled into MapReduce programs. Pig and Hive provide less complex higher-level programming methods for parallel processing of Hadoop data files. C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP THE HADOOP ECOSYSTEM The Apache Hadoop core technologies of HDFS, Yarn, and MapReduce along with additional projects including Pig, Hive, and others are collectively called the Hadoop ecosystem. C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP EXPLOITING THE HDFS • The Hadoop FILENAME engine • Upload local data to Hadoop • Read data from Hadoop • Use normal SAS PROC & DATA Steps • PROC HADOOP • Submit HDFS Commands • Submit MapReduce & PIG programs C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP THE FILENAME STATEMENT & HDFS filename hadconfg "/workshop/hadoop_config.xml'; filename mapres hadoop "/user/&std/data/mapoutput" concat cfg=hadconfg user="&std"; data work.commonwords; infile mapres dlm='09'x; input word $ count; … run; C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP PROC HADOOP • PROC HADOOP submits • • • Hadoop file system (HDFS) commands MapReduce programs PIG language code. PROC HADOOP <Hadoop-server-option(s)>; HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>; MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>; PIG <Hadoop-server-option(s)> <pig-code-option(s)>; PROPERTIES <configuration-properties>; RUN; C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP PROC HADOOP – HDFS STATEMENTS HDFS COPYFROMLOCAL='local-file' OUT='output-location' <DELETESOURCE> < OVERWRITE>; HDFS COPYTOLOCAL='HDFS-file' OUT='output-location' <DELETESOURCE> < OVERWRITE> < KEEPCRC>; HDFS DELETE='HDFS-file' <NOWARN>; HDFS MKDIR='HDFS-path'; HDFS RENAME='HDFS-file' OUT='new-name'; C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP ACCESS HIVE TABLES VIA SAS Two main methods to exploit Hadoop Hive tables in SAS: • The LIBNAME Engine (aka “Implicit Pass Through”) • • • Assign a LIBREF to Hive and use SAS code upon the LIBREF SAS Code is automatically converted to Hive Explicit Pass Through • Hive code is embedded in SAS code and is submitted verbatim to Hadoop C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP THE HADOOP LIBNAME ENGINE libname hivedb hadoop server=namenode subprotocol=hive2 port=10000 schema=diacchad user=studentX pw=StudentX; LIBNAME libref engine-name <connection options> <LIBNAME-options>; 23 libname hivedb hadoop server=namenode 24 subprotocol=hive2 25 port=10000 schema=diacchad 26 user="&std" pw="&stdpw"; NOTE: Libref HIVEDB was successfully assigned as follows: Engine: HADOOP Physical Name: jdbc:hive2://namenode:10000/diacchad C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP LIBNAME ENGINE EXAMPLE options sastrace=',,,d' sastraceloc=saslog nostsuffix; proc means data=hivedb.order_fact sum mean; var total_retail_price; run; proc freq data=hivedb.order_fact; tables order_type; run; options sastrace=off; C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP LIBNAME ENGINE EXAMPLE NOTE: SQL generation will be used to perform the initial summarization. HADOOP_41: Executed: on connection 7 select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from ( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2, COUNT(TXT_1.`total_retail_price`) as ZSQL3, SUM(TXT_1.`total_retail_price`) as ZSQL4 from `ORDER_FACT` TXT_1 ) T1 where T1.ZSQL1 > 0 ACCESS ENGINE: data. C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SQL statement was passed to the DBMS for fetching SAS & HADOOP EXPLICIT PASS THROUGH proc sql; connect to hadoop (server=namenode subprotocol=hive2 schema=diacchad user="&std"); select * from connection to hadoop (select employee_name,salary from salesstaff where emp_hire_date between '2011-01-01' and '2011-12-31' ); disconnect from hadoop; quit; C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP HIGH PERFORMANCE ANALYTICS Interface Purpose Product HighPerformance Analytics Procedures Perform complex analytical computations on Hadoop tables within the data nodes of the Hadoop distribution via SAS procedure language. HPDS2 allows for manipulation of data structure (column derivation). SAS HighPerformance Analytics Solutions SAS Visual Analytics A web interface to generate graphical visualizations of data distributions and relationships on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution. SAS Visual Analytics PROC IMSTAT A programming interface to perform complex analytical calculations on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution. SAS In-Memory Statistics DS2 A SAS proprietary language for table manipulation that translates to database language and executes in parallel in the data nodes of a distributed database. SAS In-Database Code Accelerators Data loader for hadoop C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS & HADOOP HIGH PERFORMANCE ANALYTICS SAS Client SAS metadata server SAS processes in each HDFS data node execute in parallel. SAS workspace server SAS High Performance Analytics Root Node Hadoop NameNode Hive C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS High Performance Analytics Worker Node SAS High Performance Analytics Worker Node SAS High Performance Analytics Worker Node Hadoop DataNode 1 Hadoop DataNode 2 Hadoop DataNode 3 SAS & HADOOP LEARNING MORE • SAS Website • SAS Education • Introduction to SAS & Hadoop • • DS2 Programming: Essentials • • 2 Day course requiring some SAS Programming & SQL knowledge 2 Day course, requires intermediate SAS Programming knowledge DS2 Programming Essentials with Hadoop • 1 ½ day course, requires intermediate SAS Programming knowledge C op yr i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .