ETL with Hadoop and MapReduce Jan Pieter Posthuma –SQL Zaterdag 17 november 2012 Agenda Big Data Why do we need it? Hadoop MapReduce Pig and Hive Demo’s Expectations What to cover Simple idea’s of how to use MapReduce for ETL Different ways to achieve this What not to cover Best Practice Internals of Hadoop Deep analysis of Big Data Big Data Too much data transformed to insight in a traditional BI way Why do we need it? Classical BI solution Source Stage DWH ±10Gb ±10Gb ±10Gb Filter Data mart Report ±100Mb ±10Kb Σ ±30Gb Big Data is about reducing time to insight: - No ETL - No Cleansing - No Load ‘Analyze data when it arrives’ Hadoop Replaces the need of additional Staging, DWH and ETL – Additional storage needed for highly unstructured data Easy retrieval for (structured) data – Pig – Hive – SQOOP – ODBC for Hive – Polybase (HDFS) Big Data ecosystem BI tools Reports Excel Dashboards (Virtual) datamarts Hive & Pig Sqoop Map/ Reduce HDFS Hadoop Relational Databases MapReduce Map function: var map = function (key, value, context) {} Reduce function: var reduce = function (key, values, context) {} var map = function (key, value, context) MapReduce.js distributed and scheduled multiple times to all nodes ⁞ var map = function (key, value, context) ⁞ var map = function (key, value, context) Processing data segments context := (key, value) var reduce =function (key, values, context) Hive and Pig Principle is the same: easy data retrieval Both use MapReduce Different founders Facebook (Hive) and Yahoo (PIG) Different language SQL like (Hive) and more procedural (PIG) ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’ Demo Query RDW Data RDW data with Hive table Pig and MapReduce to get data from KNMI KNMI.js 1/2 var map = function (key, value, context) { if (value[0] != '#') { var allValues = value.split(','); if (allValues[7] != '') { context.write(allValues[0]+'-'+allValues[1], allValues[0] + ',' + allValues[1] + ',' + allValues[7]); } } }; KNMI.js 2/2 var reduce = function (key, values, context) { var mMax = -9999; var mMin = 9999; var mKey = key.split('-'); while (values.hasNext()) { var mValues = values.next().split(','); mMax = mValues[2] > mMax ? mValues[2] : mMax; mMin = mValues[2] < mMin ? mValues[2] : mMin; } context.write(key.trim(), mKey[0].toString() + '\t' + mKey[1].toString() + '\t' + mMax.toString() + '\t' + mMin.toString()); };