<Insert Picture Here>
Learning R Series
Session 6: Oracle R Connector Hadoop 2.0
Mark Hornick, Senior Manager, Development
Oracle Advanced Analytics
©2013 Oracle – All Rights Reserved
Learning R Series 2012
Session
Title
Session 1 Introduction to Oracle's R Technologies and Oracle R Enterprise 1.3
Session 2 Oracle R Enterprise 1.3 Transparency Layer
Session 3 Oracle R Enterprise 1.3 Embedded R Execution
Session 4 Oracle R Enterprise 1.3 Predictive Analytics
Session 5 Oracle R Enterprise 1.3 Integrating R Results and Images with OBIEE Dashboards
Session 6 Oracle R Connector for Hadoop 2.0 New features and Use Cases
©2013 Oracle – All Rights Reserved
2
The following is intended to outline our general product direction. It
is intended for information purposes only, and may not be
incorporated into any contract. It is not a commitment to deliver
any material, code, or functionality, and should not be relied upon
in making purchasing decisions.
The development, release, and timing of any features or
functionality described for Oracle’s products remain at the sole
discretion of Oracle.
3
Topics
•
•
•
•
•
•
What is Hadoop?
Oracle R Connector for Hadoop
Predictive Analytics on Hadoop
ORCHhive
Comparison of RHIPE with ORCH
Summary
©2013 Oracle – All Rights Reserved
4
©2013 Oracle – All Rights Reserved
5
What is Hadoop?
• Scalable fault-tolerant distributed system for data storage and processing
• Enables analysis of Big Data
– Can store huge volumes of unstructured data,
e.g.,weblogs, transaction data, social media data
– Enables massive data aggregation
– Highly scalable and robust
– Problems move from processor bound (small data, complex computations) to
data bound (huge data, often simple computations)
• Originally sponsored by Yahoo!  Apache project  Cloudera
– Open source under Apache license
• Based on Google's GFS and Big Table whitepaper (2006)
©2013 Oracle – All Rights Reserved
6
What is Hadoop?
• Consists of two key services
– Hadoop Distributed File System (HDFS)
– MapReduce
• Other Projects based on core Hadoop
– Hive, Pig, Hbase, Flume, Oozie, Sqoop, and others
©2013 Oracle – All Rights Reserved
7
Classic Hadoop-type problems
•
•
•
•
•
Modeling true risk
Customer churn analysis
Recommendation engine
Ad targeting
PoS transaction analysis
• Analyzing network data to
predict failure
• Thread analysis
• Trade surveillance
• Search quality
• Data “sandbox”
http://www.slideshare.net/cloudera/20100806-cloudera-10-hadoopable-problems-webinar-4931616
©2013 Oracle – All Rights Reserved
8
Type of analysis using Hadoop
•
•
•
•
Text mining
Index building
Graph creation and analysis
Pattern recognition
•
•
•
•
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment
©2013 Oracle – All Rights Reserved
9
Hadoop Publicized Examples
• Lineberger Comprehensive Cancer
Center
• A9.com (Amazon)
– Product search indices
– Analyzes Next Generation Sequence Data for
Cancer Genome Atlas
• Adobe
– Social services & structured data store
• NAVTEQ Media Solutions
– Optimizes ad selection based on user
interactions
• EBay
– Search optimization & research
• Twitter
• Facebook
– User growth, page views, ad campaign analysis
– Stores and processes Tweets
• Yahoo!
• Journey Dynamics
– Forecast traffic speeds from GPS data
– Research into ad systems & web search
©2013 Oracle – All Rights Reserved
10
Hadoop for data-bound problems, examples
• Facebook – over 70 Pb of data, 3000+ nodes, unified storage, uses Hive extensively
• eBay – over 5 Pb of data, 500+ nodes
– Clickstreams data, enterprise data warehouses, product descriptions, and images
– produce search indices, analytics reports, mining models
– Enhance the search relevance for eBay’s items, use Hadoop to build ranking functions that
take multiple factors into account, like price, listing format, seller track record, and relevance
– Hadoop enables adding new factors, to see if they improve overall search relevance
• Tracing fraud backward
– Store more of the data to track fraud more easily
– Generate terabytes per hour, keep it online for analysis
• Characteristics
– Complex data from multiple data sources, with data being generated at terabytes per day
– Batch processing in parallel, with computation taken to the data
Note: Moving 100Gb of data can take well over 20 minutes
©2013 Oracle – All Rights Reserved
11
Key features of Hadoop
•
•
•
•
•
Support for partial failures
Data recoverability
Component recovery
Consistency
Scalability
• Applications written in high-level language
• Shared nothing architecture
• Computation occurs where the data reside, whenever possible
©2013 Oracle – All Rights Reserved
12
MapReduce
• Provides parallelization and distribution with fault tolerance
• MapReduce programs provide access to data on Hadoop
• "Map" phase
– Map task typically operates on one HDFS block of data
– Map tasks process smaller problems, store results in HDFS, and report success “jobtracker”
• "Reduce" phase
– Reduce task receives sorted subsets of Map task results
– One or more reducers compute answers to form final answer
– Final results stored in HDFS
• Computational processing can occur on unstructured or structured data
• Abstracts all “housekeeping” away from the developer
©2013 Oracle – All Rights Reserved
13
Map Reduce Example – Graphically Speaking
HDFS
DataNode
HDFS
DataNode
(key, values…)
map
map
(key A, values…)
(key B, values…)
(key, values…)
(key C, values…)
(key A, values…)
(key B, values…)
(key C, values…)
shuffle and sort – aggregates intermediate values by output key
(key A, intermediate values…)
(key B, intermediate values…)
(key C, intermediate values…)
reduce
reduce
reduce
Final key A values
Final key B values
Final key C values
©2013 Oracle – All Rights Reserved
14
Text analysis example
• Count the number of times each word occurs in a corpus of documents
reduce
map
Shuffle and
Sort
One mapper per “block” of data
Outputs each word and its count: 1
each time a word is encountered
Documents divided into blocks in HDFS
Key
Value
The
1
One or more reducers combining results
One reducer receives only the key-value pairs
for the word “Big” and sums up the counts
Key
Value
Big
1
...
Big
1
Data
1
word
1
count
1
It then outputs the final key-value result
example
1
Key
Value
Big
2040
©2013 Oracle – All Rights Reserved
Big
1
15
Map Reduce Example – Graphically Speaking
For “Word Count”
HDFS
DataNode
HDFS
DataNode
(key, values…)
map
map
(key A, values…)
(key B, values…)
(key, values…)
(key C, values…)
(key A, values…)
There’s no key, only value
as input to mapper
Mapper output is a set of
key-value pairs where
key is the word and
value is the count=1
(key B, values…)
(key C, values…)
shuffle and sort – aggregates intermediate values by output key
(key A, intermediate values…)
(key B, intermediate values…)
(key C, intermediate values…)
reduce
reduce
reduce
Final key A values
Final key B values
Final key C values
©2013 Oracle – All Rights Reserved
Each reducer receives
values for each word
key is the word
value is a set of counts
Outputs key as the word
and value as the sum
16
Oracle R Connector for Hadoop
©2013 Oracle – All Rights Reserved
17
Mapper and reducer code in ORCH for “Word Count”
corpus
corpus
input
res
<<<<-
scan("corpus.dat", what=" ",quiet= TRUE, sep="\n")
gsub("([/\\\":,#.@-])", " ", corpus)
hdfs.put(corpus)
hadoop.exec(dfs.id = input,
mapper = function(k,v) {
x <- strsplit(v[[1]], " ")[[1]]
x <- x[x!='']
out <- NULL
for(i in 1:length(x))
out <- c(out, orch.keyval(x[i],1))
out
},
reducer = function(k,vv) {
orch.keyval(k, sum(unlist(vv)))
},
config = new("mapred.config",
job.name
= "wordcount",
map.output
= data.frame(key='', val=0),
reduce.output = data.frame(key='', val=0) )
)
Load the R data.frame into HDFS
Specify and invoke map-reduce job
Split words and output each word
Sum the count of each word
res
hdfs.get(res)
©2013 Oracle – All Rights Reserved
18
Big Data
Appliance
Oracle R Connector for Hadoop
Hadoop Cluster (BDA)
R script
ORD
{CRAN
packages}
R  HDFS
Hadoop Job
R  sqoop
R  MapReduce
MapReduce
Nodes
HDFS
Nodes
{CRAN
packages}
Mapper
ORD
Reducer
R Client
Oracle Database
•
•
•
•
•
Provide transparent access to Hadoop Cluster: MapReduce and HDFS-resident data
Access and manipulate data in HDFS, database, and file system - all from R
Write MapReduce functions using R and execute through natural R interface
Leverage CRAN R packages to work on HDFS-resident data
Transition work from lab to production deployment on a Hadoop cluster without requiring
knowledge of Hadoop internals, Hadoop CLI, or IT infrastructure
©2013 Oracle – All Rights Reserved
19
Oracle R Distribution
Ability to dynamically load
Intel Math Kernel Library (MKL)
AMD Core Math Library (ACML)
Oracle
Support
Solaris Sun Performance Library
• Improve scalability at client and database for embedded R execution
• Enhanced linear algebra performance using Intel’s MKL, AMD’s ACML,
and Sun Performance Library for Solaris
• Enterprise support for customers of Oracle Advanced Analytics option,
Big Data Appliance, and Oracle Enterprise Linux
• Free download
• Oracle to contribute bug fixes and enhancements to open source R
©2013 Oracle – All Rights Reserved
20
Exploring Available Data
• HDFS, Oracle Database, file system
HDFS
Database
hdfs.pwd()
ore.ls()
hdfs.ls()
names(ONTIME_S)
hdfs.mkdir("xq")
head(ONTIME_S,3)
hdfs.cd("xq")
File System
hdfs.ls()
getwd()
hdfs.size("ontime_s")
dir()
hdfs.parts("ontime_s")
dir.create("/home/oracle/orch")
hdfs.sample("ontime_s",lines=3)
setwd("/home/oracle/orch")
dat
# or list.files()
<- read.csv("ontime_s.dat")
head(dat)
©2013 Oracle – All Rights Reserved
21
Load data in HDFS
Data from File
Use hdfs.upload
Key is first column: YEAR
hdfs.rm('ontime_File')
ontime.dfs_File <- hdfs.upload('ontime_s2000.dat',
dfs.name='ontime_File')
hdfs.exists('ontime_File')
hdfs.rm('ontime_DB')
Data from Database Table
ontime.dfs_D <- hdfs.push(ontime_s2000,
key='DEST',
Use hdfs.push
dfs.name='ontime_DB')
Key column: DEST
hdfs.exists('ontime_DB')
hdfs.rm('ontime_R')
Data from R data.frame
Use hdfs.put
ontime <- ore.pull(ontime_s2000)
ontime.dfs_R <- hdfs.put(ontime,
key='DEST',
Key column: DEST
dfs.name='ontime_R')
hdfs.exists('ontime_R')
©2013 Oracle – All Rights Reserved
22
hadoop.exec() concepts
1. Mapper
1.
2.
3.
4.
5.
Receives set of rows from HDFS file as (key,value) pairs
Key has the same data type as that of the input
Value can be of type list or data.frame
Mapper outputs (key,value) pairs using orch.keyval()
Value can be ANY R object ‘packed’ using orch.pack()
2. Reducer
1. Receives (packed) input of type generated by a mapper
2. Reducer outputs (key, value) pairs using orch.keyval()
3. Value can be ANY R object ‘packed’ using orch.pack()
3. Variables from R environment can be exported to Hadoop environment
mappers and reducers using orch.export() (optional)
4. Job configuration (optional)
©2013 Oracle – All Rights Reserved
23
ORCH Dry Run
• Enables R users to test R code locally on laptop
before submitting job to Hadoop Cluster
– Supports testing/debugging of scripts
– orch.dryrun(TRUE)
• Hadoop Cluster is not required for a dry run
• Sequential execution of mapper and reducer code
• Creates row streams from HDFS input into mapper and reducer
– Constrained by the memory available to R
– Recommend to subset / sample input data to fit in memory
• Upon job success, resulting data put in HDFS
• No change in the R code is required for dry run
©2013 Oracle – All Rights Reserved
24
Example: Test script in “dry run” mode
Take the average arrival delay for all flights to SFO
orch.dryrun(T)
dfs <- hdfs.attach('ontime_R')
res <- NULL
res <- hadoop.run(
dfs,
mapper = function(key, ontime) {
if (key == 'SFO') {
keyval(key, ontime)
}
},
reducer = function(key, vals) {
sumAD <- 0
count <- 0
for (x in vals) {
if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1}
}
res <- sumAD / count
keyval(key, res)
}
)
res
hdfs.get(res)
©2013 Oracle – All Rights Reserved
25
Example: Test script on Hadoop Cluster – one change
Take the average arrival delay for all flights to SFO
orch.dryrun(F)
dfs <- hdfs.attach('ontime_R')
res <- NULL
res <- hadoop.run(
dfs,
mapper = function(key, ontime) {
if (key == 'SFO') {
keyval(key, ontime)
}
},
reducer = function(key, vals) {
sumAD <- 0
count <- 0
for (x in vals) {
if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1}
}
res <- sumAD / count
keyval(key, res)
}
)
res
hdfs.get(res)
©2013 Oracle – All Rights Reserved
26
Executing Hadoop Job in Dry-Run Mode
Linux Client
Retrieve data from HDFS
Execute script locally in laptop R engine
BDA
Oracle
R Connector
For Hadoop
Client
Hadoop Cluster
Software
Oracle
R Distribution
Oracle
R Distribution
Oracle
R Connector
For Hadoop
Driver Package
©2013 Oracle – All Rights Reserved
27
Executing Hadoop Job on Hadoop Cluster
Linux Client
Submit MapReduce job to Hadoop Cluster
Execute Mappers and Reducers using R
instances on BDA task nodes
BDA
Oracle
R Connector
For Hadoop
Client
Hadoop Cluster
Software
Oracle
R Distribution
Oracle
R Distribution
Oracle
R Connector
For Hadoop
Driver Package
©2013 Oracle – All Rights Reserved
28
Example – script
Take the average distance for flights to SFO by airline
ontime
<- ore.pull(ONTIME_S[ONTIME_S$YEAR==2000,])
ontime.dfs <- hdfs.put(ontime, key='UNIQUECARRIER' )
res <- NULL
res <- hadoop.run(
ontime.dfs,
mapper = function(key, ontime) {
if (ontime$DEST == 'SFO') {
keyval(key, ontime)
}
},
reducer = function(key, vals) {
sumAD <- 0; count <- 0
for (x in vals) {
if (!is.na(x$DISTANCE)) {
sumAD <- sumAD + x$DISTANCE; count <- count + 1
}
}
if (count > 0) { res <- sumAD / count }
else
{res <- 0 }
keyval(key, res)
}
)
hdfs.get(res)
©2013 Oracle – All Rights Reserved
• Output is one value pair per airline
• Map function returns key-value
pairs where column UNIQUECARRIER
is key
• Reduce function produces the
mean of distance per airline
key
val1
1
AA 1361.4643
2
AS
3
CO 2507.2857
4
DL 1601.6154
5
HP
6
NW 2009.7273
7
TW 1906.0000
8
UA 1134.0821
9
US 2387.5000
10
WN
515.8000
549.4286
541.1538
29
ORCH Details
• Explore files in HDFS
– hdfs.cd(), hdfs.ls(), hdfs.pwd(), hdfs.mkdir()
– hdfs.mv(), hdfs.cp(), hdfs.size(), hdfs.sample()
• Interact with HDFS content in ORCH environment
– Metadata discovery : hdfs.attach() or hand-create metadata
– Working with
in-memory R objects: hdfs.get(), hdfs.put()
database objects:
hdfs.push(), hdfs.pull()
local files on laptop: hdfs.upload(), hdfs.download()
• Obtain ORCH metadata descriptor
– hdfs.attach() discovers metadata from CSV files
– Or, create metadata, named __ORCHMETA__, for large files by hand and copy it to
directory containing CSV file
©2013 Oracle – All Rights Reserved
30
Viewing Metadata File from command line with hadoop
Metadata file __ORCHMETA__ structure
©2013 Oracle – All Rights Reserved
31
Viewing Metadata from R
Metadata file contents
©2013 Oracle – All Rights Reserved
32
ORCH-required HDFS Metadata Structure
• ORCH hdfs.* functions take HDFS directories
(not files when accessing HDFS data)
• Expects a file called __ORCHMETA__
– Contains metadata about the data in the part-* files
– If __ORCHMETA__ doesn’t exist, it is created during hdfs.attach() by
sampling input files and parsing rows
– Auto-generation may be time consuming if record length > 1K
• Since HDFS tail returns only 1K data, ORCH must copy the whole
file locally to sample
• Alternative to manually create __ORCHMETA__
©2013 Oracle – All Rights Reserved
33
__ORCHMETA__ structure
__ORCHMETA__ Field
orch.kvs
orch.names
orch.class
orch.types
orch.dim
orch.keyi
Description or value
orch.rownamei
Index of column used for rownames
0 means no rownames
TRUE (the data is key-value type)
Column names,e.g., “speed”,”dist”
“data.frame”
Column types, e.g., “numeric”,”numeric”
Data dimensions (optional), e.g., 50,2
Index of column treated as key
0 – key is null (‘\t’ character at start of row)
-1 – mean key not available (no tab at start of row)
• There are more fields, but these are sufficient for manually-created metadata
©2013 Oracle – All Rights Reserved
34
Working with Hadoop using ORCH framework
• ORCH supports CSV files
• Simple interaction
– Run R code in parallel on different chunks of HDFS file… think ore.rowApply
• hadoop.exec(file, mapper={…}, reducer={orch.keyvals(k,vv)})
– Run R code in parallel on partitions (by key) of HDFS file… think ore.groupApply
• hadoop.exec(file, mapper={orch.keyval(k,v)}, reducer={…})
• Full MapReduce interaction
– hadoop.exec(file, mapper={…}, reducer={…})
– hadoop.exec(file, mapper={…}, reducer={…}, config={…})
©2013 Oracle – All Rights Reserved
35
ORCH Job Configuration
R> jobconfig = new("mapred.config")
R> class(jobconfig)
[1] "mapred.config"
attr(,"package")
[1] ".GlobalEnv"
R> jobconfig
Object of class "mapred.config"
data frame with 0 columns and 0
rows
Slot "job.name":
User defined job name
Slot "map.output": Schema of mapper output
data frame with 0 columns and 0 rows
Slot "reduce.output": Schema of reducer output
data frame with 0 columns and 0 rows
Should key be included
Slot "map.valkey":
in mapper value?
[1] FALSE
Slot "reduce.valkey": Should key be included
[1] ""
Slot "map.tasks": Desired #mappers
[1] -1
Slot "reduce.tasks": Desired #reducers [1] FALSE
[1] -1
Slot "map.input":
Desired minimum
Slot "min.split.size": # rows sent to [1] "vector"
mapper
[1] -1
in reducer value?
Slot "map.split":
Max chunk size desired
[1] 1
by mapper
Slot "reduce.input":
[1] "list"
Slot "reduce.split":
[1] 0
Slot "verbose":
diagnostic
[1] FALSE Should
info be generated
Data type of val that is input
to mapper : data.frame or list
©2013 Oracle – All Rights Reserved
36
Predictive Analytics on Hadoop
©2013 Oracle – All Rights Reserved
37
ORCH Analytic Functions
Function
Description
orch.lm
Fits a linear model using tall-and-skinny QR (TSQR) factorization and parallel
distribution. The function computes the same statistical parameters as the Oracle R
Enterprise ore.lm function.
orch.lmf
Fits a low rank matrix factorization model using either the jellyfish algorithm or the
Mahout alternating least squares with weighted regularization (ALS-WR) algorithm.
orch.neural
Provides a neural network to model complex, nonlinear relationships between inputs
and outputs, or to find patterns in the data.
orch.nmf
Provides the main entry point to create a nonnegative matrix factorization model using
the jellyfish algorithm. This function can work on much larger data sets than the R
NMF package, because the input does not need to fit into memory.
38
orch.lm
Motivation
•
•
•
•
LM implementation for Hadoop
Scalable: Add more machines, linear decrease in run times
Ability to process >1000 columns and unrestricted # rows
Match R user experience
– Functions print, summary
– Object of class "lm", "orch.lm"
©2013 Oracle – All Rights Reserved
39
orch.lm
Interface
• orch.lm( formula, dfs.data, nReducers = 1 )
– formula - an object of class '"formula“’, a symbolic description of the model to be fitted
– dfs.data - HDFS dataset
– nReducers - the number of reducers (performance / MR tree related parameter)
– Returns an object of class "orch.lm" which is a list containing the following components:
• coefficients: a named vector of coefficients
• rank: the numeric rank of the fitted linear model
• call: the matched call
• terms: the 'terms' object used
• summary (r.squared, adj.r.squared, df, sigma, fstatistic, cov.unscaled)
• print.orch.lm( fit )
– fit an object of class "orch.lm" (returned by orch.lm() function)
• print.summary.orch.lm( fit )
– fit object returned by orch.lm() function
©2013 Oracle – All Rights Reserved
40
orch.lm
Examples
formula <- 'Petal.Width ~ I(Sepal.Length^3) + (Sepal.Width + Petal.Length)^2‘
dfs.dat <- hdfs.put(iris)
fit = orch.lm(formula, dfs.dat)
print(fit)
R> print(fit)
Call:
orch.lm(formula = formula, dfs.dat = dfs.dat)
Coefficients:
(Intercept)
I(Sepal.Length^3)
Sepal.Width
-0.558951258
-0.001808531
0.076544835
R>
Petal.Length
0.374865543
©2013 Oracle – All Rights Reserved
Sepal.Width:Petal.Length
0.044639138
41
orch.lm
Example
R> summary(fit)
Call:
orch.lm(formula = formula, dfs.dat = dfs.dat)
Residuals:
Min
-0.5787561
Max
0.5982218
Coefficients:
Estimate
Std. Error
t value
Pr(>|t|)
(Intercept)
-0.558951258 0.3114271138 -1.7948060 7.476772e-02
I(Sepal.Length^3)
-0.001808531 0.0003719886 -4.8617906 2.990386e-06
Sepal.Width
0.076544835 0.0936509172 0.8173421 4.150739e-01
Petal.Length
0.374865543 0.0813489249 4.6081192 8.829319e-06
Sepal.Width:Petal.Length 0.044639138 0.0244578742 1.8251438 7.003728e-02
Multiple R-squared: 0.9408,
Adjusted R-squared: 0.9392
F-statistic: 576.6 on 4 and 145 DF
R>
©2013 Oracle – All Rights Reserved
42
orch.neural
Characteristics
• General feed-forward neural network for regression
• Enables > 1000 input columns
• State-of-the-art numerical optimization engine
– robustness, accuracy, and small number of data reads
• Scalable: more machines  proportional decrease in run time
©2013 Oracle – All Rights Reserved
43
orch.neural
Interface
• orch.neural(dfs.data, targetSize, hiddenSize, hiddenActivation = 'bSigmoid',
outputActivation = 'linear‘,maxit = 100, lowerBound = -1, upperBound = 1)
–
–
–
–
–
–
–
dfs.data - Input HDFS comma-separated value file
targetSize : Number of output (target) neurons, must be a positive integer
hiddenSize : Number of hidden neurons, must be a positive integer
maxit : Maximum number of iterations (nonlinear optimization steps)
lowerBound : Lower bound in weight initialization
upperBound : Upper bound in weight initialization
hiddenActivation : Hidden activation function. Possible values:
atan, bipolarSigmoid, cos, gaussian, gompertz, linear, sigmoid, reciprocal, sin, square, tanh
– outputActivation : Output activation function
– Returns an object of class "orch.neural”
©2013 Oracle – All Rights Reserved
44
orch.neural - Example
## XOR is true if either A or B is true, but not both:
## XOR A B
## 1 1 0
## 1 0 1
## 0 1 1
## 0 0 0
xorData <- data.frame(
XOR = c(1, 1, 0, 0),
A
= c(1, 0, 1, 0),
B
= c(0, 1, 1, 0))
1
1
0
dfsData <- hdfs.put(xorData)
## Predict XOR on A and B inputs with two hidden neurons
fit <- orch.neural(dfs.data = dfsData, targetSize = 1, hiddenSize = 2,
hiddenActivation = 'bSigmoid', outputActivation = 'linear',
maxit = 30, lowerBound = -1, upperBound = 1)
pred <- predict(fit, newdata = dfsNewData)
pred
©2013 Oracle – All Rights Reserved
45
orch.neural - Results
R> print(fit)
Call:
orch.neural(dfs.data = dfsData, targetSize = 1, hiddenSize = 2,
hiddenActivation = "bSigmoid", outputActivation = "linear",
maxit = 30, lowerBound = -1, upperBound = 1)
Number
Number
Number
Hidden
Output
of input nodes
of hidden nodes
of output nodes
activation function
activation function
Weights:
1
2
3
4
5
6
7
8
9
V1
1.4664790
-1.1654687
-1.0649696
0.5173203
6.4870265
3.2183494
-1.6225545
1.7001784
2.3405659
2
2
1
bSigmoid
linear
R> pred <- predict(fit, newdata = dfsData)
R>
hdfs.get(pred)
pred_XOR XOR A B
1 0.96773529
1 1 0
2 0.94574037
1 0 1
3 0.09825227
0 1 1
4 0.03239104
0 0 0
©2013 Oracle – All Rights Reserved
46
orch.neural Performance
•
Number of obs
155671
Number of columns 46 columns
Missing values
yes
# hidden
neurons
10
20
30
40
50
•
Elapsed time
(sec) nnet
934.176
1861.812
2634.434
3674.379
4400.551
Observations
• nnet
Elapsed time
(sec) orch.neural
44.181
44.969
35.196
39.217
49.527
Hardware spec: Single BDA node
MemTotal:
49GB
CPUs:
24 (3058MHz each)
– Invoked with linear outputs to accommodate
unscaled target
– Enables fair comparison
• orch
– For data sets < 1.5 gigabyte (e.g. this
dataset) orch.neural automatically invokes
in-memory multi-threaded algorithm
– For big data, uses MapReduce
47
orch.neural Performance
•
5000
Number of obs
155671
Number
4500 of columns 46 columns
Missing values
yes
Observations
• nnet
Seconds
4000
•
#
hidden
3500
neurons
3000
10
2500
20
30
2000
40
1500
50
Elapsed time
(sec) nnet
934.176
1861.812
2634.434
3674.379
4400.551
Elapsed time
(sec) orch.neural
44.181
44.969
nnet
35.196
39.217
orch.neural
49.527
1000
Hardware spec: Single BDA node
500
MemTotal:
49GB
CPUs:
24 (3058MHz each)
0
10
20
30
40
– invoked with linear outputs to accommodate
unscaled target
– provide fair comparison
• orch
– For data sets < 1.5 gigabyte (e.g. this
dataset) orch.neural automatically invokes
in-memory multi-threaded algorithm
– For big data, uses MapReduce
50
Number of Hidden Layer Nodes
48
Demo Programs
©2013 Oracle – All Rights Reserved
49
Hadoop Analytics
Available demo programs using ORCH
• Bagged Clustering
– "bagging" with two level clustering
– the mapper performs kmeans clustering for subset of data
– the reducer combines centroids generated from mappers into a hierarchical cluster (single linkage) and stores
the hclust object on HDFS, which can then be used by the R user for voting
• Item-Item similarity: distance.R
– Calculate Euclidian distance similarity matrix among movies based on user ratings
• K-Means
– clustering using a mapReduce version of Lloyd's algorithm (requires ORE)
• PCA
–
computes essential statistics needed for PCA calculation of a dataset by creating a tree of multiple mapreduce jobs, reducing the number of records to process at each stage
– The final map-reduce does the final merge of its inputs to generate the statistics for the whole data set
• Pearson Correlation
– Calculate Pearson's correlation among movies based on user ratings
• Logistic Regression by Gradient Descent
©2013 Oracle – All Rights Reserved
50
Example : One Dimension Logistic Regression by Gradient Descent
input <- hdfs.put(cars)
mapred.logreg <- function(input, iterations=3, alpha=0.1)
Initialize separating plane
{
plane and g available
to mapper and
reducer functions
plane <- 0
Define logistic function
export = orch.export(plane, g),
g <- function(z) 1/(1 + exp(-z))
Specify form of map output
config = new("mapred.config",
mapf <- data.frame(val1=1, val2=1)
job.name = "logistic.regression",
for (i in 1:iterations) {
map.output = mapf))
gradient <- hadoop.run(
Computing the gradient of the
gradient
<- hdfs.get(gradient)
input,
loss function
plane
<plane
+ alpha * gradient$val2
mapper = function(k, v) {
orch.dlogv2(gradient, plane)
orch.keyval(1, v$speed * v$dist * g(-v$speed * (plane * v$dist)))
Update the
}
},
separating plane
Since
one
key,
only
a
single
(plane)
reducer = function(k, vv) {
reducer
}
vv <- sapply(vv, unlist)
Return plane when
iterations complete
orch.keyval(k, sum(vv))
},
# Invocation
plane <- mapred.logreg(input)
print(plane)
©2013 Oracle – All Rights Reserved
51
ORCHhive
©2013 Oracle – All Rights Reserved
52
What is Hive?
•
•
•
•
•
SQL-like abstraction on Hadoop
Becoming de facto standard for SQL based apps on Hadoop
Converts SQL queries to MapReduce jobs to be run on Hadoop
Provides simple query language (HQL) based on SQL
Enables non-Java users to leverage Hadoop via SQL-like interfaces
©2013 Oracle – All Rights Reserved
53
Motivation for ORCHhive
• “Big data” scalability and performance for R users on Hadoop
• Enable R users to clean, explore, and prepare HIVE data transparently
• Ready data for analytic techniques using ORCH MapReduce
framework
• ORE provides transparent access to database tables and views from R
based on SQL mapping
• Since Hive is SQL-based, it is a natural extension to provide ORE-type
transparency on top of Hive HQL for R users to HDFS data
©2013 Oracle – All Rights Reserved
54
Supported R Functions
•
–
•
–
•
<, >, ==, <=, >=, !, xor, ifelse, and, or
–
show, attach, [, $, $<-, [[, [[<-, head, tail, length, nrow,
ncol, NROW, NCOL, dim, names, names<-, colnames,
colnames<-,as.list, unlist, summary, rbind, cbind,
data.frame, as.data.frame, as.env, eval, +, -, *, ^, %%,
%/%, /, Compare, Logic, !, xor, is.na, is.finite, is.infinite,
is.nan,abs, sign, sqrt, ceiling, floor, trunc, log, log10,
log2, log1p, logb, acos, asin, atan, exp, expm1, cos,
sin, tan, round, Summary, rowSums, colSums,
rowMeans, colMeans, unique, by, merge
Aggregate functions
–
ore.number methods
nchar, tolower, toupper, casefold, gsub, substr,
substring
ore.frame methods
show, length, c, is.vector, as.vector, as.character,
as.numeric, as.integer, as.logical, "[", "[<-", I, Compare,
ore.recode, is.na, "%in%", unique, sort, table, paste,
tapply, by, head, tail
ore.logical methods
–
•
•
ore.vector methods
–
•
is.ore.frame, is.ore.vector, is.ore.logical, is.ore.integer,
is.ore.numeric, is.ore.character, is.ore, as.ore.frame,
as.ore.vector, as.ore.logical, as.ore.integer,
as.ore.numeric, as.ore.character, as.ore
ore.character methods
–
ore.create, ore.drop, ore.push, ore.pull, ore.get
Methods
–
•
•
Storage methods
OREStats: fivenum, aggregate, quantile, sd, var(only
for vectors), median, IQR
+, -, *, ^, %%, %/%, /, is.finite, is.infinite, is.nan, abs,
sign, sqrt, ceiling, floor, trunc, log, log10, log2, log1p,
logb, acos, asin atan, exp, expm1, cos, sin, tan,
zapsmall, round, Summary, summary, mean
©2013 Oracle – All Rights Reserved
55
Example using OREhive
ore.connect(type="HIVE")
#PetalBins is now a derived column of the HIVE object
ore.attach()
> names(IRIS_TABLE)
[1] "Sepal.Length" "Sepal.Width"
# create a Hive table by pushing the numeric
[4] "Petal.Width"
"Petal.Length"
"PetalBins"
# columns of the iris data set
IRIS_TABLE <- ore.push(iris[1:4])
# Based on the bins, generate summary statistics for each group
aggregate(IRIS_TABLE$Petal.Length,
# Create bins based on Petal Length
by = list(PetalBins = IRIS_TABLE$PetalBins),
IRIS_TABLE$PetalBins =
+
FUN = summary)
ifelse(IRIS_TABLE$Petal.Length < 2.0, "SMALL PETALS",
1
LARGE PETALS
6 6.025000 6.200000 6.354545 6.612500
6.9 0
+
ifelse(IRIS_TABLE$Petal.Length < 4.0, "MEDIUM PETALS",
2 MEDIUM LARGE PETALS
4 4.418750 4.820000 4.888462 5.275000
5.9 0
+
ifelse(IRIS_TABLE$Petal.Length < 6.0,
3
MEDIUM PETALS
3 3.262500 3.550000 3.581818 3.808333
3.9 0
+
"MEDIUM LARGE PETALS", "LARGE PETALS")))
4
SMALL PETALS
1 1.311538 1.407692 1.462000 1.507143
1.9 0
Warning message:
ORE object has no unique key - using random order
©2013 Oracle – All Rights Reserved
56
Comparison of RHIPE and ORCH
©2013 Oracle – All Rights Reserved
57
Comparison of R RHIPE and ORCH
RHIPE
ORCH
Architecture – External Dependencies
Requires Google Protocol Buffers to be deployed on every Hadoop
node
No Dependency.
Architecture – Database support
No database support. Can work only on HDFS resident data
1.
2.
Ability to test and debug the same
MapReduce R code on local system
with small data – before execution on
Hadoop cluster using full HDFS data
No Support
Requires significant rewrite of R code to make it “Hadoop”
compatible.
.
1.
2.
3.
4.
In addition to HDFS, can source data from an Oracle database
and place results back in an Oracle database
Data written back to Oracle database is allowed for further
processing using Oracle R Enterprise framework
No rewrite required
Supports local execution of MapReduce R functions for
debugging , providing detailed feedback on the MapReduce
execution
Local execution occurs on sample of HDFS data
Enables execution of MapReduce R code when disconnected
from Hadoop cluster
Convenience of mapper and reducer
specification
Mappers and reducers specified as R expression objects instead of
functions. Requires use of hardcoded names, e.g., map.values and
reduce.values, in mapper and reducer.
1.
2.
Use cases supported
Restricted to using Hadoop for use cases that are apriori
determined and supported by IT through explicit job configuration
on Hadoop
Allows R users to INTERACTIVELY use Hadoop for problems they
see fit WITHOUT requiring any apriori job management
setup.Expands use cases supported as superset of those possible
with other solutions
©2013 Oracle – All Rights Reserved
Mappers and reducers specified as functions
Allows the specification of user-specified parameter names
58
Comparison of R RHIPE and ORCH
Support for HDFS Data Discovery
RHIPE
ORCH
Limited.
1. Full HDFS support.
2. Supports data sampling to explore data too big for R memory
hdfs.rm
hdfs.ls
hdfs.download
hdfs.upload
hdfs.cp
hdfs.put
hdfs.get
+
hdfs.attach (metadata discovery)
hdfs.sample (data sample)
hdfs.cd (unix like feel from within R)
hdfs.push/pull (DB connectivity)
rhdel <-> hdfs.rm
rkls <-> hdfs.ls
rhget <-> hdfs.download
rhput <-> hdfs.upload
rhcp <-> hdfs.cp
rhwrite <-> hdfs.put
rhread <-> hdfs.get
Encoding of metadata corresponding to
data in an HDFS file
Requires metadata encoding along with data. Existing HDFS files
MUST first be augmented with row-wise metadata before they can
be processed using RHIPE. Metadata encoding along with data will
expand the size of HDFS files several fold and affect performance
Treats metadata and data separately. Metadata is derived by
sampling HDFS file, and is created and maintained separately from
data. This has a huge positive impact on performance
Ability to exploit inherent structure in
data resident in HDFS files for better
performance.
No support. Does not allow use of existing data in HDFS as direct
input to your MapReduce job
Passes data as a list to mapper and reducer. This then must be
converted to a data.frame, which results in performance and
memory issues
Requires data preprocessing where R objects are created and data
is serialized as data.frames before being written to HDFS.
Otherwise, such conversion must be done in the main MapReduce
R scripts
Handles HDFS data.frames implicitly from HDFS. Supports
data.frames natively as one possible input type to the mapper and
reducer functions, resulting in improved performance.
Use hdfs.attach() for true metadata discovery
©2013 Oracle – All Rights Reserved
59
Comparison of R RHIPE and ORCH
RHIPE
ORCH
API ease of use
(see next slide for examples).
Hadoop internals get in the way of program intent
Focus remains on problem being solved.
R code is streamlined and clear
Support for Hive
No support
Works directly with HIVE tables. R users can clean, explore, and
prepare data in HDFS using ORCH's HIVE transparency layer thus
making the data more conducive for applying analytical techniques
using ORCH's map-reduce framework
Proprietary Hadoop based predictive
techniques
No support
Matrix Factorization used for recommendations, Linear regression
and Neural networks used for predictions with high velocity data
Support for passing R objects from the
R session to MapReduce jobs and
access them in MR functions
No support
Users can pass any R object from the R environment into the
MapReduce job for use by the mapper and reducer functions. Use
orch.export()
Limitations on <key, value> data sizes
on read
Imposes a limit of 256MB on the <key, value> data read into a
mapper or reducer
No limit imposed. Can be as large as allowed by R memory
©2013 Oracle – All Rights Reserved
60
Code Example - RHIPE and ORCH
library(Rhipe)
library(ORCH)
map <- expression({
words_vector <-
x <- hdfs.put("/etc/passwd”)
unlist(strsplit(unlist(map.values),""))
lapply(words_vector, function(i) {rhcollect(i,1)})
xn <- hadoop.run(x,
mapper = function(key, val) {
})
words.in.line <- length(strsplit(val, ' ')[[1]])
reduce <- expression(pre = {total=0},
orch.keyval(NULL, words.in.line)
reduce = {total <- sum(total,
},
unlist(reduce.values))},
reducer = function(key, vals) {
post = {rhcollect(reduce.key,total)}
)
mapred <- list(rhipe_map_buff_size=20,
cnt <- 0
for (val in vals) {
cnt <- cnt + val
mapred.job.tracker='local')
= map,
}
reduce
= reduce,
orch.keyvals(key, cnt)
inout
= c("text","sequence"),
job_object <- rhmr(map
}
ifolder = "/sample_1",
)
ofolder = "/output_02",
hdfs.get(xn)
mapred
= mapred,
jobname = "word_count")
rhex(job_object)
©2013 Oracle – All Rights Reserved
61
Using ORCH and ORE Together
©2013 Oracle – All Rights Reserved
62
ORCH and ORE Interaction
• If ORE is installed on the R client with ORCH
– Copy ore.frames (data tables) to HDFS
– Perform ORE pre-processing on data fed to MR jobs
– Perform ORE post-processing on results of MR jobs once data moved from
HDFS to Oracle Database
• If ORE is installed on (BDA) task nodes
– Include ORE calls in mapper and reducer functions (care should be taken not
to overwhelm a single database server with too many mapper tasks)
• If ORCH is installed on Oracle Database server
– Use embedded R execution to invoke ORCH functionality
– Schedule database jobs (DBMS_SCHEDULER) to automatically execute
scripts containing ORCH function calls
©2013 Oracle – All Rights Reserved
63
Summary
• Oracle R Connector for Hadoop allows R users to leverage a
Hadoop Cluster with HDFS and MapReduce from R
• Mapper and reducer functions written in R
• ORCH HDFS interface works transparently with database data, file
data, and R data.frames
• MapReduce jobs can be submitted for non-cluster (local) execution,
or execution in the Hadoop Cluster
• Advanced analytics algorithms packaged with ORCH
• Manipulate HIVE data transparently from R
©2013 Oracle – All Rights Reserved
64
©2013 Oracle – All Rights Reserved
65
66