<Insert Picture Here> Learning R Series Session 6: Oracle R Connector Hadoop 2.0 Mark Hornick, Senior Manager, Development Oracle Advanced Analytics ©2013 Oracle – All Rights Reserved Learning R Series 2012 Session Title Session 1 Introduction to Oracle's R Technologies and Oracle R Enterprise 1.3 Session 2 Oracle R Enterprise 1.3 Transparency Layer Session 3 Oracle R Enterprise 1.3 Embedded R Execution Session 4 Oracle R Enterprise 1.3 Predictive Analytics Session 5 Oracle R Enterprise 1.3 Integrating R Results and Images with OBIEE Dashboards Session 6 Oracle R Connector for Hadoop 2.0 New features and Use Cases ©2013 Oracle – All Rights Reserved 2 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle. 3 Topics • • • • • • What is Hadoop? Oracle R Connector for Hadoop Predictive Analytics on Hadoop ORCHhive Comparison of RHIPE with ORCH Summary ©2013 Oracle – All Rights Reserved 4 ©2013 Oracle – All Rights Reserved 5 What is Hadoop? • Scalable fault-tolerant distributed system for data storage and processing • Enables analysis of Big Data – Can store huge volumes of unstructured data, e.g.,weblogs, transaction data, social media data – Enables massive data aggregation – Highly scalable and robust – Problems move from processor bound (small data, complex computations) to data bound (huge data, often simple computations) • Originally sponsored by Yahoo! Apache project Cloudera – Open source under Apache license • Based on Google's GFS and Big Table whitepaper (2006) ©2013 Oracle – All Rights Reserved 6 What is Hadoop? • Consists of two key services – Hadoop Distributed File System (HDFS) – MapReduce • Other Projects based on core Hadoop – Hive, Pig, Hbase, Flume, Oozie, Sqoop, and others ©2013 Oracle – All Rights Reserved 7 Classic Hadoop-type problems • • • • • Modeling true risk Customer churn analysis Recommendation engine Ad targeting PoS transaction analysis • Analyzing network data to predict failure • Thread analysis • Trade surveillance • Search quality • Data “sandbox” http://www.slideshare.net/cloudera/20100806-cloudera-10-hadoopable-problems-webinar-4931616 ©2013 Oracle – All Rights Reserved 8 Type of analysis using Hadoop • • • • Text mining Index building Graph creation and analysis Pattern recognition • • • • Collaborative filtering Prediction models Sentiment analysis Risk assessment ©2013 Oracle – All Rights Reserved 9 Hadoop Publicized Examples • Lineberger Comprehensive Cancer Center • A9.com (Amazon) – Product search indices – Analyzes Next Generation Sequence Data for Cancer Genome Atlas • Adobe – Social services & structured data store • NAVTEQ Media Solutions – Optimizes ad selection based on user interactions • EBay – Search optimization & research • Twitter • Facebook – User growth, page views, ad campaign analysis – Stores and processes Tweets • Yahoo! • Journey Dynamics – Forecast traffic speeds from GPS data – Research into ad systems & web search ©2013 Oracle – All Rights Reserved 10 Hadoop for data-bound problems, examples • Facebook – over 70 Pb of data, 3000+ nodes, unified storage, uses Hive extensively • eBay – over 5 Pb of data, 500+ nodes – Clickstreams data, enterprise data warehouses, product descriptions, and images – produce search indices, analytics reports, mining models – Enhance the search relevance for eBay’s items, use Hadoop to build ranking functions that take multiple factors into account, like price, listing format, seller track record, and relevance – Hadoop enables adding new factors, to see if they improve overall search relevance • Tracing fraud backward – Store more of the data to track fraud more easily – Generate terabytes per hour, keep it online for analysis • Characteristics – Complex data from multiple data sources, with data being generated at terabytes per day – Batch processing in parallel, with computation taken to the data Note: Moving 100Gb of data can take well over 20 minutes ©2013 Oracle – All Rights Reserved 11 Key features of Hadoop • • • • • Support for partial failures Data recoverability Component recovery Consistency Scalability • Applications written in high-level language • Shared nothing architecture • Computation occurs where the data reside, whenever possible ©2013 Oracle – All Rights Reserved 12 MapReduce • Provides parallelization and distribution with fault tolerance • MapReduce programs provide access to data on Hadoop • "Map" phase – Map task typically operates on one HDFS block of data – Map tasks process smaller problems, store results in HDFS, and report success “jobtracker” • "Reduce" phase – Reduce task receives sorted subsets of Map task results – One or more reducers compute answers to form final answer – Final results stored in HDFS • Computational processing can occur on unstructured or structured data • Abstracts all “housekeeping” away from the developer ©2013 Oracle – All Rights Reserved 13 Map Reduce Example – Graphically Speaking HDFS DataNode HDFS DataNode (key, values…) map map (key A, values…) (key B, values…) (key, values…) (key C, values…) (key A, values…) (key B, values…) (key C, values…) shuffle and sort – aggregates intermediate values by output key (key A, intermediate values…) (key B, intermediate values…) (key C, intermediate values…) reduce reduce reduce Final key A values Final key B values Final key C values ©2013 Oracle – All Rights Reserved 14 Text analysis example • Count the number of times each word occurs in a corpus of documents reduce map Shuffle and Sort One mapper per “block” of data Outputs each word and its count: 1 each time a word is encountered Documents divided into blocks in HDFS Key Value The 1 One or more reducers combining results One reducer receives only the key-value pairs for the word “Big” and sums up the counts Key Value Big 1 ... Big 1 Data 1 word 1 count 1 It then outputs the final key-value result example 1 Key Value Big 2040 ©2013 Oracle – All Rights Reserved Big 1 15 Map Reduce Example – Graphically Speaking For “Word Count” HDFS DataNode HDFS DataNode (key, values…) map map (key A, values…) (key B, values…) (key, values…) (key C, values…) (key A, values…) There’s no key, only value as input to mapper Mapper output is a set of key-value pairs where key is the word and value is the count=1 (key B, values…) (key C, values…) shuffle and sort – aggregates intermediate values by output key (key A, intermediate values…) (key B, intermediate values…) (key C, intermediate values…) reduce reduce reduce Final key A values Final key B values Final key C values ©2013 Oracle – All Rights Reserved Each reducer receives values for each word key is the word value is a set of counts Outputs key as the word and value as the sum 16 Oracle R Connector for Hadoop ©2013 Oracle – All Rights Reserved 17 Mapper and reducer code in ORCH for “Word Count” corpus corpus input res <<<<- scan("corpus.dat", what=" ",quiet= TRUE, sep="\n") gsub("([/\\\":,#.@-])", " ", corpus) hdfs.put(corpus) hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out }, reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) }, config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) Load the R data.frame into HDFS Specify and invoke map-reduce job Split words and output each word Sum the count of each word res hdfs.get(res) ©2013 Oracle – All Rights Reserved 18 Big Data Appliance Oracle R Connector for Hadoop Hadoop Cluster (BDA) R script ORD {CRAN packages} R HDFS Hadoop Job R sqoop R MapReduce MapReduce Nodes HDFS Nodes {CRAN packages} Mapper ORD Reducer R Client Oracle Database • • • • • Provide transparent access to Hadoop Cluster: MapReduce and HDFS-resident data Access and manipulate data in HDFS, database, and file system - all from R Write MapReduce functions using R and execute through natural R interface Leverage CRAN R packages to work on HDFS-resident data Transition work from lab to production deployment on a Hadoop cluster without requiring knowledge of Hadoop internals, Hadoop CLI, or IT infrastructure ©2013 Oracle – All Rights Reserved 19 Oracle R Distribution Ability to dynamically load Intel Math Kernel Library (MKL) AMD Core Math Library (ACML) Oracle Support Solaris Sun Performance Library • Improve scalability at client and database for embedded R execution • Enhanced linear algebra performance using Intel’s MKL, AMD’s ACML, and Sun Performance Library for Solaris • Enterprise support for customers of Oracle Advanced Analytics option, Big Data Appliance, and Oracle Enterprise Linux • Free download • Oracle to contribute bug fixes and enhancements to open source R ©2013 Oracle – All Rights Reserved 20 Exploring Available Data • HDFS, Oracle Database, file system HDFS Database hdfs.pwd() ore.ls() hdfs.ls() names(ONTIME_S) hdfs.mkdir("xq") head(ONTIME_S,3) hdfs.cd("xq") File System hdfs.ls() getwd() hdfs.size("ontime_s") dir() hdfs.parts("ontime_s") dir.create("/home/oracle/orch") hdfs.sample("ontime_s",lines=3) setwd("/home/oracle/orch") dat # or list.files() <- read.csv("ontime_s.dat") head(dat) ©2013 Oracle – All Rights Reserved 21 Load data in HDFS Data from File Use hdfs.upload Key is first column: YEAR hdfs.rm('ontime_File') ontime.dfs_File <- hdfs.upload('ontime_s2000.dat', dfs.name='ontime_File') hdfs.exists('ontime_File') hdfs.rm('ontime_DB') Data from Database Table ontime.dfs_D <- hdfs.push(ontime_s2000, key='DEST', Use hdfs.push dfs.name='ontime_DB') Key column: DEST hdfs.exists('ontime_DB') hdfs.rm('ontime_R') Data from R data.frame Use hdfs.put ontime <- ore.pull(ontime_s2000) ontime.dfs_R <- hdfs.put(ontime, key='DEST', Key column: DEST dfs.name='ontime_R') hdfs.exists('ontime_R') ©2013 Oracle – All Rights Reserved 22 hadoop.exec() concepts 1. Mapper 1. 2. 3. 4. 5. Receives set of rows from HDFS file as (key,value) pairs Key has the same data type as that of the input Value can be of type list or data.frame Mapper outputs (key,value) pairs using orch.keyval() Value can be ANY R object ‘packed’ using orch.pack() 2. Reducer 1. Receives (packed) input of type generated by a mapper 2. Reducer outputs (key, value) pairs using orch.keyval() 3. Value can be ANY R object ‘packed’ using orch.pack() 3. Variables from R environment can be exported to Hadoop environment mappers and reducers using orch.export() (optional) 4. Job configuration (optional) ©2013 Oracle – All Rights Reserved 23 ORCH Dry Run • Enables R users to test R code locally on laptop before submitting job to Hadoop Cluster – Supports testing/debugging of scripts – orch.dryrun(TRUE) • Hadoop Cluster is not required for a dry run • Sequential execution of mapper and reducer code • Creates row streams from HDFS input into mapper and reducer – Constrained by the memory available to R – Recommend to subset / sample input data to fit in memory • Upon job success, resulting data put in HDFS • No change in the R code is required for dry run ©2013 Oracle – All Rights Reserved 24 Example: Test script in “dry run” mode Take the average arrival delay for all flights to SFO orch.dryrun(T) dfs <- hdfs.attach('ontime_R') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumAD <- 0 count <- 0 for (x in vals) { if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1} } res <- sumAD / count keyval(key, res) } ) res hdfs.get(res) ©2013 Oracle – All Rights Reserved 25 Example: Test script on Hadoop Cluster – one change Take the average arrival delay for all flights to SFO orch.dryrun(F) dfs <- hdfs.attach('ontime_R') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumAD <- 0 count <- 0 for (x in vals) { if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1} } res <- sumAD / count keyval(key, res) } ) res hdfs.get(res) ©2013 Oracle – All Rights Reserved 26 Executing Hadoop Job in Dry-Run Mode Linux Client Retrieve data from HDFS Execute script locally in laptop R engine BDA Oracle R Connector For Hadoop Client Hadoop Cluster Software Oracle R Distribution Oracle R Distribution Oracle R Connector For Hadoop Driver Package ©2013 Oracle – All Rights Reserved 27 Executing Hadoop Job on Hadoop Cluster Linux Client Submit MapReduce job to Hadoop Cluster Execute Mappers and Reducers using R instances on BDA task nodes BDA Oracle R Connector For Hadoop Client Hadoop Cluster Software Oracle R Distribution Oracle R Distribution Oracle R Connector For Hadoop Driver Package ©2013 Oracle – All Rights Reserved 28 Example – script Take the average distance for flights to SFO by airline ontime <- ore.pull(ONTIME_S[ONTIME_S$YEAR==2000,]) ontime.dfs <- hdfs.put(ontime, key='UNIQUECARRIER' ) res <- NULL res <- hadoop.run( ontime.dfs, mapper = function(key, ontime) { if (ontime$DEST == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumAD <- 0; count <- 0 for (x in vals) { if (!is.na(x$DISTANCE)) { sumAD <- sumAD + x$DISTANCE; count <- count + 1 } } if (count > 0) { res <- sumAD / count } else {res <- 0 } keyval(key, res) } ) hdfs.get(res) ©2013 Oracle – All Rights Reserved • Output is one value pair per airline • Map function returns key-value pairs where column UNIQUECARRIER is key • Reduce function produces the mean of distance per airline key val1 1 AA 1361.4643 2 AS 3 CO 2507.2857 4 DL 1601.6154 5 HP 6 NW 2009.7273 7 TW 1906.0000 8 UA 1134.0821 9 US 2387.5000 10 WN 515.8000 549.4286 541.1538 29 ORCH Details • Explore files in HDFS – hdfs.cd(), hdfs.ls(), hdfs.pwd(), hdfs.mkdir() – hdfs.mv(), hdfs.cp(), hdfs.size(), hdfs.sample() • Interact with HDFS content in ORCH environment – Metadata discovery : hdfs.attach() or hand-create metadata – Working with in-memory R objects: hdfs.get(), hdfs.put() database objects: hdfs.push(), hdfs.pull() local files on laptop: hdfs.upload(), hdfs.download() • Obtain ORCH metadata descriptor – hdfs.attach() discovers metadata from CSV files – Or, create metadata, named __ORCHMETA__, for large files by hand and copy it to directory containing CSV file ©2013 Oracle – All Rights Reserved 30 Viewing Metadata File from command line with hadoop Metadata file __ORCHMETA__ structure ©2013 Oracle – All Rights Reserved 31 Viewing Metadata from R Metadata file contents ©2013 Oracle – All Rights Reserved 32 ORCH-required HDFS Metadata Structure • ORCH hdfs.* functions take HDFS directories (not files when accessing HDFS data) • Expects a file called __ORCHMETA__ – Contains metadata about the data in the part-* files – If __ORCHMETA__ doesn’t exist, it is created during hdfs.attach() by sampling input files and parsing rows – Auto-generation may be time consuming if record length > 1K • Since HDFS tail returns only 1K data, ORCH must copy the whole file locally to sample • Alternative to manually create __ORCHMETA__ ©2013 Oracle – All Rights Reserved 33 __ORCHMETA__ structure __ORCHMETA__ Field orch.kvs orch.names orch.class orch.types orch.dim orch.keyi Description or value orch.rownamei Index of column used for rownames 0 means no rownames TRUE (the data is key-value type) Column names,e.g., “speed”,”dist” “data.frame” Column types, e.g., “numeric”,”numeric” Data dimensions (optional), e.g., 50,2 Index of column treated as key 0 – key is null (‘\t’ character at start of row) -1 – mean key not available (no tab at start of row) • There are more fields, but these are sufficient for manually-created metadata ©2013 Oracle – All Rights Reserved 34 Working with Hadoop using ORCH framework • ORCH supports CSV files • Simple interaction – Run R code in parallel on different chunks of HDFS file… think ore.rowApply • hadoop.exec(file, mapper={…}, reducer={orch.keyvals(k,vv)}) – Run R code in parallel on partitions (by key) of HDFS file… think ore.groupApply • hadoop.exec(file, mapper={orch.keyval(k,v)}, reducer={…}) • Full MapReduce interaction – hadoop.exec(file, mapper={…}, reducer={…}) – hadoop.exec(file, mapper={…}, reducer={…}, config={…}) ©2013 Oracle – All Rights Reserved 35 ORCH Job Configuration R> jobconfig = new("mapred.config") R> class(jobconfig) [1] "mapred.config" attr(,"package") [1] ".GlobalEnv" R> jobconfig Object of class "mapred.config" data frame with 0 columns and 0 rows Slot "job.name": User defined job name Slot "map.output": Schema of mapper output data frame with 0 columns and 0 rows Slot "reduce.output": Schema of reducer output data frame with 0 columns and 0 rows Should key be included Slot "map.valkey": in mapper value? [1] FALSE Slot "reduce.valkey": Should key be included [1] "" Slot "map.tasks": Desired #mappers [1] -1 Slot "reduce.tasks": Desired #reducers [1] FALSE [1] -1 Slot "map.input": Desired minimum Slot "min.split.size": # rows sent to [1] "vector" mapper [1] -1 in reducer value? Slot "map.split": Max chunk size desired [1] 1 by mapper Slot "reduce.input": [1] "list" Slot "reduce.split": [1] 0 Slot "verbose": diagnostic [1] FALSE Should info be generated Data type of val that is input to mapper : data.frame or list ©2013 Oracle – All Rights Reserved 36 Predictive Analytics on Hadoop ©2013 Oracle – All Rights Reserved 37 ORCH Analytic Functions Function Description orch.lm Fits a linear model using tall-and-skinny QR (TSQR) factorization and parallel distribution. The function computes the same statistical parameters as the Oracle R Enterprise ore.lm function. orch.lmf Fits a low rank matrix factorization model using either the jellyfish algorithm or the Mahout alternating least squares with weighted regularization (ALS-WR) algorithm. orch.neural Provides a neural network to model complex, nonlinear relationships between inputs and outputs, or to find patterns in the data. orch.nmf Provides the main entry point to create a nonnegative matrix factorization model using the jellyfish algorithm. This function can work on much larger data sets than the R NMF package, because the input does not need to fit into memory. 38 orch.lm Motivation • • • • LM implementation for Hadoop Scalable: Add more machines, linear decrease in run times Ability to process >1000 columns and unrestricted # rows Match R user experience – Functions print, summary – Object of class "lm", "orch.lm" ©2013 Oracle – All Rights Reserved 39 orch.lm Interface • orch.lm( formula, dfs.data, nReducers = 1 ) – formula - an object of class '"formula“’, a symbolic description of the model to be fitted – dfs.data - HDFS dataset – nReducers - the number of reducers (performance / MR tree related parameter) – Returns an object of class "orch.lm" which is a list containing the following components: • coefficients: a named vector of coefficients • rank: the numeric rank of the fitted linear model • call: the matched call • terms: the 'terms' object used • summary (r.squared, adj.r.squared, df, sigma, fstatistic, cov.unscaled) • print.orch.lm( fit ) – fit an object of class "orch.lm" (returned by orch.lm() function) • print.summary.orch.lm( fit ) – fit object returned by orch.lm() function ©2013 Oracle – All Rights Reserved 40 orch.lm Examples formula <- 'Petal.Width ~ I(Sepal.Length^3) + (Sepal.Width + Petal.Length)^2‘ dfs.dat <- hdfs.put(iris) fit = orch.lm(formula, dfs.dat) print(fit) R> print(fit) Call: orch.lm(formula = formula, dfs.dat = dfs.dat) Coefficients: (Intercept) I(Sepal.Length^3) Sepal.Width -0.558951258 -0.001808531 0.076544835 R> Petal.Length 0.374865543 ©2013 Oracle – All Rights Reserved Sepal.Width:Petal.Length 0.044639138 41 orch.lm Example R> summary(fit) Call: orch.lm(formula = formula, dfs.dat = dfs.dat) Residuals: Min -0.5787561 Max 0.5982218 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.558951258 0.3114271138 -1.7948060 7.476772e-02 I(Sepal.Length^3) -0.001808531 0.0003719886 -4.8617906 2.990386e-06 Sepal.Width 0.076544835 0.0936509172 0.8173421 4.150739e-01 Petal.Length 0.374865543 0.0813489249 4.6081192 8.829319e-06 Sepal.Width:Petal.Length 0.044639138 0.0244578742 1.8251438 7.003728e-02 Multiple R-squared: 0.9408, Adjusted R-squared: 0.9392 F-statistic: 576.6 on 4 and 145 DF R> ©2013 Oracle – All Rights Reserved 42 orch.neural Characteristics • General feed-forward neural network for regression • Enables > 1000 input columns • State-of-the-art numerical optimization engine – robustness, accuracy, and small number of data reads • Scalable: more machines proportional decrease in run time ©2013 Oracle – All Rights Reserved 43 orch.neural Interface • orch.neural(dfs.data, targetSize, hiddenSize, hiddenActivation = 'bSigmoid', outputActivation = 'linear‘,maxit = 100, lowerBound = -1, upperBound = 1) – – – – – – – dfs.data - Input HDFS comma-separated value file targetSize : Number of output (target) neurons, must be a positive integer hiddenSize : Number of hidden neurons, must be a positive integer maxit : Maximum number of iterations (nonlinear optimization steps) lowerBound : Lower bound in weight initialization upperBound : Upper bound in weight initialization hiddenActivation : Hidden activation function. Possible values: atan, bipolarSigmoid, cos, gaussian, gompertz, linear, sigmoid, reciprocal, sin, square, tanh – outputActivation : Output activation function – Returns an object of class "orch.neural” ©2013 Oracle – All Rights Reserved 44 orch.neural - Example ## XOR is true if either A or B is true, but not both: ## XOR A B ## 1 1 0 ## 1 0 1 ## 0 1 1 ## 0 0 0 xorData <- data.frame( XOR = c(1, 1, 0, 0), A = c(1, 0, 1, 0), B = c(0, 1, 1, 0)) 1 1 0 dfsData <- hdfs.put(xorData) ## Predict XOR on A and B inputs with two hidden neurons fit <- orch.neural(dfs.data = dfsData, targetSize = 1, hiddenSize = 2, hiddenActivation = 'bSigmoid', outputActivation = 'linear', maxit = 30, lowerBound = -1, upperBound = 1) pred <- predict(fit, newdata = dfsNewData) pred ©2013 Oracle – All Rights Reserved 45 orch.neural - Results R> print(fit) Call: orch.neural(dfs.data = dfsData, targetSize = 1, hiddenSize = 2, hiddenActivation = "bSigmoid", outputActivation = "linear", maxit = 30, lowerBound = -1, upperBound = 1) Number Number Number Hidden Output of input nodes of hidden nodes of output nodes activation function activation function Weights: 1 2 3 4 5 6 7 8 9 V1 1.4664790 -1.1654687 -1.0649696 0.5173203 6.4870265 3.2183494 -1.6225545 1.7001784 2.3405659 2 2 1 bSigmoid linear R> pred <- predict(fit, newdata = dfsData) R> hdfs.get(pred) pred_XOR XOR A B 1 0.96773529 1 1 0 2 0.94574037 1 0 1 3 0.09825227 0 1 1 4 0.03239104 0 0 0 ©2013 Oracle – All Rights Reserved 46 orch.neural Performance • Number of obs 155671 Number of columns 46 columns Missing values yes # hidden neurons 10 20 30 40 50 • Elapsed time (sec) nnet 934.176 1861.812 2634.434 3674.379 4400.551 Observations • nnet Elapsed time (sec) orch.neural 44.181 44.969 35.196 39.217 49.527 Hardware spec: Single BDA node MemTotal: 49GB CPUs: 24 (3058MHz each) – Invoked with linear outputs to accommodate unscaled target – Enables fair comparison • orch – For data sets < 1.5 gigabyte (e.g. this dataset) orch.neural automatically invokes in-memory multi-threaded algorithm – For big data, uses MapReduce 47 orch.neural Performance • 5000 Number of obs 155671 Number 4500 of columns 46 columns Missing values yes Observations • nnet Seconds 4000 • # hidden 3500 neurons 3000 10 2500 20 30 2000 40 1500 50 Elapsed time (sec) nnet 934.176 1861.812 2634.434 3674.379 4400.551 Elapsed time (sec) orch.neural 44.181 44.969 nnet 35.196 39.217 orch.neural 49.527 1000 Hardware spec: Single BDA node 500 MemTotal: 49GB CPUs: 24 (3058MHz each) 0 10 20 30 40 – invoked with linear outputs to accommodate unscaled target – provide fair comparison • orch – For data sets < 1.5 gigabyte (e.g. this dataset) orch.neural automatically invokes in-memory multi-threaded algorithm – For big data, uses MapReduce 50 Number of Hidden Layer Nodes 48 Demo Programs ©2013 Oracle – All Rights Reserved 49 Hadoop Analytics Available demo programs using ORCH • Bagged Clustering – "bagging" with two level clustering – the mapper performs kmeans clustering for subset of data – the reducer combines centroids generated from mappers into a hierarchical cluster (single linkage) and stores the hclust object on HDFS, which can then be used by the R user for voting • Item-Item similarity: distance.R – Calculate Euclidian distance similarity matrix among movies based on user ratings • K-Means – clustering using a mapReduce version of Lloyd's algorithm (requires ORE) • PCA – computes essential statistics needed for PCA calculation of a dataset by creating a tree of multiple mapreduce jobs, reducing the number of records to process at each stage – The final map-reduce does the final merge of its inputs to generate the statistics for the whole data set • Pearson Correlation – Calculate Pearson's correlation among movies based on user ratings • Logistic Regression by Gradient Descent ©2013 Oracle – All Rights Reserved 50 Example : One Dimension Logistic Regression by Gradient Descent input <- hdfs.put(cars) mapred.logreg <- function(input, iterations=3, alpha=0.1) Initialize separating plane { plane and g available to mapper and reducer functions plane <- 0 Define logistic function export = orch.export(plane, g), g <- function(z) 1/(1 + exp(-z)) Specify form of map output config = new("mapred.config", mapf <- data.frame(val1=1, val2=1) job.name = "logistic.regression", for (i in 1:iterations) { map.output = mapf)) gradient <- hadoop.run( Computing the gradient of the gradient <- hdfs.get(gradient) input, loss function plane <plane + alpha * gradient$val2 mapper = function(k, v) { orch.dlogv2(gradient, plane) orch.keyval(1, v$speed * v$dist * g(-v$speed * (plane * v$dist))) Update the } }, separating plane Since one key, only a single (plane) reducer = function(k, vv) { reducer } vv <- sapply(vv, unlist) Return plane when iterations complete orch.keyval(k, sum(vv)) }, # Invocation plane <- mapred.logreg(input) print(plane) ©2013 Oracle – All Rights Reserved 51 ORCHhive ©2013 Oracle – All Rights Reserved 52 What is Hive? • • • • • SQL-like abstraction on Hadoop Becoming de facto standard for SQL based apps on Hadoop Converts SQL queries to MapReduce jobs to be run on Hadoop Provides simple query language (HQL) based on SQL Enables non-Java users to leverage Hadoop via SQL-like interfaces ©2013 Oracle – All Rights Reserved 53 Motivation for ORCHhive • “Big data” scalability and performance for R users on Hadoop • Enable R users to clean, explore, and prepare HIVE data transparently • Ready data for analytic techniques using ORCH MapReduce framework • ORE provides transparent access to database tables and views from R based on SQL mapping • Since Hive is SQL-based, it is a natural extension to provide ORE-type transparency on top of Hive HQL for R users to HDFS data ©2013 Oracle – All Rights Reserved 54 Supported R Functions • – • – • <, >, ==, <=, >=, !, xor, ifelse, and, or – show, attach, [, $, $<-, [[, [[<-, head, tail, length, nrow, ncol, NROW, NCOL, dim, names, names<-, colnames, colnames<-,as.list, unlist, summary, rbind, cbind, data.frame, as.data.frame, as.env, eval, +, -, *, ^, %%, %/%, /, Compare, Logic, !, xor, is.na, is.finite, is.infinite, is.nan,abs, sign, sqrt, ceiling, floor, trunc, log, log10, log2, log1p, logb, acos, asin, atan, exp, expm1, cos, sin, tan, round, Summary, rowSums, colSums, rowMeans, colMeans, unique, by, merge Aggregate functions – ore.number methods nchar, tolower, toupper, casefold, gsub, substr, substring ore.frame methods show, length, c, is.vector, as.vector, as.character, as.numeric, as.integer, as.logical, "[", "[<-", I, Compare, ore.recode, is.na, "%in%", unique, sort, table, paste, tapply, by, head, tail ore.logical methods – • • ore.vector methods – • is.ore.frame, is.ore.vector, is.ore.logical, is.ore.integer, is.ore.numeric, is.ore.character, is.ore, as.ore.frame, as.ore.vector, as.ore.logical, as.ore.integer, as.ore.numeric, as.ore.character, as.ore ore.character methods – ore.create, ore.drop, ore.push, ore.pull, ore.get Methods – • • Storage methods OREStats: fivenum, aggregate, quantile, sd, var(only for vectors), median, IQR +, -, *, ^, %%, %/%, /, is.finite, is.infinite, is.nan, abs, sign, sqrt, ceiling, floor, trunc, log, log10, log2, log1p, logb, acos, asin atan, exp, expm1, cos, sin, tan, zapsmall, round, Summary, summary, mean ©2013 Oracle – All Rights Reserved 55 Example using OREhive ore.connect(type="HIVE") #PetalBins is now a derived column of the HIVE object ore.attach() > names(IRIS_TABLE) [1] "Sepal.Length" "Sepal.Width" # create a Hive table by pushing the numeric [4] "Petal.Width" "Petal.Length" "PetalBins" # columns of the iris data set IRIS_TABLE <- ore.push(iris[1:4]) # Based on the bins, generate summary statistics for each group aggregate(IRIS_TABLE$Petal.Length, # Create bins based on Petal Length by = list(PetalBins = IRIS_TABLE$PetalBins), IRIS_TABLE$PetalBins = + FUN = summary) ifelse(IRIS_TABLE$Petal.Length < 2.0, "SMALL PETALS", 1 LARGE PETALS 6 6.025000 6.200000 6.354545 6.612500 6.9 0 + ifelse(IRIS_TABLE$Petal.Length < 4.0, "MEDIUM PETALS", 2 MEDIUM LARGE PETALS 4 4.418750 4.820000 4.888462 5.275000 5.9 0 + ifelse(IRIS_TABLE$Petal.Length < 6.0, 3 MEDIUM PETALS 3 3.262500 3.550000 3.581818 3.808333 3.9 0 + "MEDIUM LARGE PETALS", "LARGE PETALS"))) 4 SMALL PETALS 1 1.311538 1.407692 1.462000 1.507143 1.9 0 Warning message: ORE object has no unique key - using random order ©2013 Oracle – All Rights Reserved 56 Comparison of RHIPE and ORCH ©2013 Oracle – All Rights Reserved 57 Comparison of R RHIPE and ORCH RHIPE ORCH Architecture – External Dependencies Requires Google Protocol Buffers to be deployed on every Hadoop node No Dependency. Architecture – Database support No database support. Can work only on HDFS resident data 1. 2. Ability to test and debug the same MapReduce R code on local system with small data – before execution on Hadoop cluster using full HDFS data No Support Requires significant rewrite of R code to make it “Hadoop” compatible. . 1. 2. 3. 4. In addition to HDFS, can source data from an Oracle database and place results back in an Oracle database Data written back to Oracle database is allowed for further processing using Oracle R Enterprise framework No rewrite required Supports local execution of MapReduce R functions for debugging , providing detailed feedback on the MapReduce execution Local execution occurs on sample of HDFS data Enables execution of MapReduce R code when disconnected from Hadoop cluster Convenience of mapper and reducer specification Mappers and reducers specified as R expression objects instead of functions. Requires use of hardcoded names, e.g., map.values and reduce.values, in mapper and reducer. 1. 2. Use cases supported Restricted to using Hadoop for use cases that are apriori determined and supported by IT through explicit job configuration on Hadoop Allows R users to INTERACTIVELY use Hadoop for problems they see fit WITHOUT requiring any apriori job management setup.Expands use cases supported as superset of those possible with other solutions ©2013 Oracle – All Rights Reserved Mappers and reducers specified as functions Allows the specification of user-specified parameter names 58 Comparison of R RHIPE and ORCH Support for HDFS Data Discovery RHIPE ORCH Limited. 1. Full HDFS support. 2. Supports data sampling to explore data too big for R memory hdfs.rm hdfs.ls hdfs.download hdfs.upload hdfs.cp hdfs.put hdfs.get + hdfs.attach (metadata discovery) hdfs.sample (data sample) hdfs.cd (unix like feel from within R) hdfs.push/pull (DB connectivity) rhdel <-> hdfs.rm rkls <-> hdfs.ls rhget <-> hdfs.download rhput <-> hdfs.upload rhcp <-> hdfs.cp rhwrite <-> hdfs.put rhread <-> hdfs.get Encoding of metadata corresponding to data in an HDFS file Requires metadata encoding along with data. Existing HDFS files MUST first be augmented with row-wise metadata before they can be processed using RHIPE. Metadata encoding along with data will expand the size of HDFS files several fold and affect performance Treats metadata and data separately. Metadata is derived by sampling HDFS file, and is created and maintained separately from data. This has a huge positive impact on performance Ability to exploit inherent structure in data resident in HDFS files for better performance. No support. Does not allow use of existing data in HDFS as direct input to your MapReduce job Passes data as a list to mapper and reducer. This then must be converted to a data.frame, which results in performance and memory issues Requires data preprocessing where R objects are created and data is serialized as data.frames before being written to HDFS. Otherwise, such conversion must be done in the main MapReduce R scripts Handles HDFS data.frames implicitly from HDFS. Supports data.frames natively as one possible input type to the mapper and reducer functions, resulting in improved performance. Use hdfs.attach() for true metadata discovery ©2013 Oracle – All Rights Reserved 59 Comparison of R RHIPE and ORCH RHIPE ORCH API ease of use (see next slide for examples). Hadoop internals get in the way of program intent Focus remains on problem being solved. R code is streamlined and clear Support for Hive No support Works directly with HIVE tables. R users can clean, explore, and prepare data in HDFS using ORCH's HIVE transparency layer thus making the data more conducive for applying analytical techniques using ORCH's map-reduce framework Proprietary Hadoop based predictive techniques No support Matrix Factorization used for recommendations, Linear regression and Neural networks used for predictions with high velocity data Support for passing R objects from the R session to MapReduce jobs and access them in MR functions No support Users can pass any R object from the R environment into the MapReduce job for use by the mapper and reducer functions. Use orch.export() Limitations on <key, value> data sizes on read Imposes a limit of 256MB on the <key, value> data read into a mapper or reducer No limit imposed. Can be as large as allowed by R memory ©2013 Oracle – All Rights Reserved 60 Code Example - RHIPE and ORCH library(Rhipe) library(ORCH) map <- expression({ words_vector <- x <- hdfs.put("/etc/passwd”) unlist(strsplit(unlist(map.values),"")) lapply(words_vector, function(i) {rhcollect(i,1)}) xn <- hadoop.run(x, mapper = function(key, val) { }) words.in.line <- length(strsplit(val, ' ')[[1]]) reduce <- expression(pre = {total=0}, orch.keyval(NULL, words.in.line) reduce = {total <- sum(total, }, unlist(reduce.values))}, reducer = function(key, vals) { post = {rhcollect(reduce.key,total)} ) mapred <- list(rhipe_map_buff_size=20, cnt <- 0 for (val in vals) { cnt <- cnt + val mapred.job.tracker='local') = map, } reduce = reduce, orch.keyvals(key, cnt) inout = c("text","sequence"), job_object <- rhmr(map } ifolder = "/sample_1", ) ofolder = "/output_02", hdfs.get(xn) mapred = mapred, jobname = "word_count") rhex(job_object) ©2013 Oracle – All Rights Reserved 61 Using ORCH and ORE Together ©2013 Oracle – All Rights Reserved 62 ORCH and ORE Interaction • If ORE is installed on the R client with ORCH – Copy ore.frames (data tables) to HDFS – Perform ORE pre-processing on data fed to MR jobs – Perform ORE post-processing on results of MR jobs once data moved from HDFS to Oracle Database • If ORE is installed on (BDA) task nodes – Include ORE calls in mapper and reducer functions (care should be taken not to overwhelm a single database server with too many mapper tasks) • If ORCH is installed on Oracle Database server – Use embedded R execution to invoke ORCH functionality – Schedule database jobs (DBMS_SCHEDULER) to automatically execute scripts containing ORCH function calls ©2013 Oracle – All Rights Reserved 63 Summary • Oracle R Connector for Hadoop allows R users to leverage a Hadoop Cluster with HDFS and MapReduce from R • Mapper and reducer functions written in R • ORCH HDFS interface works transparently with database data, file data, and R data.frames • MapReduce jobs can be submitted for non-cluster (local) execution, or execution in the Hadoop Cluster • Advanced analytics algorithms packaged with ORCH • Manipulate HIVE data transparently from R ©2013 Oracle – All Rights Reserved 64 ©2013 Oracle – All Rights Reserved 65 66