Ricardo: Integrating R and Hadoop

advertisement
IBM Almaden Research Center
Ricardo: Integrating R and
Hadoop
Sudipto Das1, Yannis Sismanis2, Kevin S Beyer2,
Rainer Gemulla2, Peter J. Haas2, John McPherson2
UC Santa Barbara
2 IBM Almaden Research Center
1
Presented by:
Luyuang Zhang
Yuguan Li
© 2010 IBM Corporation
IBM Almaden Research
Outline
 Motivation & Background
 Architecture & Components
 Trading with Ricardo
– Simple Trading
– Complex Trading
 Evaluation
 Conclusion
2
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Deep Analytics on Big Data
 Enterprises collect huge amounts of data
– Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …
– User interaction data and history
– Click and Transaction logs
 Deep analysis critical for competitive edge
– Understanding/Modeling data
– Recommendations to users
– Ad placement
 Challenge: Enable Deep Analysis and Understanding over
massive data volumes
– Exploiting data to its full potential
3
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Motivating Examples
 Data Exploration/Model
Evaluation/Outlier Detection
 Personalized Recommendations
– For each individual customer/product
– Many applications to Netflix, Amazon,
eBay, iTunes, …
 Difficulty: Discern particular
customer preferences
– Sampling loses Competitive advantage
 Application Scenario: Movie
Recommendations
– Millions of Customers
– Hundreds of thousands of Movies
– Billions of Movie Ratings
4
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Analyst’s Workflow
 Data Exploration
– Deal with raw data
 Data Modeling
– Deal with processed data
– Use assigned method to build model fits the data
 Model Evaluation
– Deal with built model
– Use data to test the accuracy of model
5
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Big Data and Deep Analytics – The Gap
 R, SPSS, SAS – A Statistician’s toolbox
– Rich statistical, modeling, visualization functionality
– Thousands of sophisticated add-on packages developed by hundreds
of statistical experts and available through CRAN
– Operate on small data amounts entirely in memory on a single
server
– Extensions for data handling cumbersome
 Hadoop – Scalable Data Management Systems
– Scalable, Fault-Tolerant, Elastic, …
– “Magnetic”: easy to store data
– Limited deep analytics: mostly descriptive analytics
6
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Filling the Gap: Existing Approaches
 Reducing Data size by Sampling
– Approximations might result in losing competitive advantage
– Loses important features of the long tail of data distributions
[Cohen et al., VLDB 2009]
 Scaling out R
– Efforts from statistics community to parallel and distributed
variants [SNOW, Rmpi]
– Main memory based in most cases
– Re-implementing DBMS and distributed processing functionality
 Deep Analysis within a DBMS
– Port statistical functionality into a DBMS [Cohen et al., VLDB 2009],
[Apache Mahout]
– Not Sustainable – missing out from R’s community development and
rich libraries
7
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Ricardo: Bridging the Gap
 David Ricardo, famous economist from 19th century
– “Comparative Advantage”
 Deep Analytics decomposable in “large part” and “small part”
[Chu et al., NIPS ‘06]
– Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA
– Recommender Systems/Latent Factorization [our paper]
– A key requirement for Ricardo is that the amount of data that must be
communicated between both systems be sufficiently small
 Large-part includes joins, group bys, distributive aggregations
– Hadoop + Jaql: excellent scalability to large-scale data management
 Small-part includes matrix/vector operations
– R: excellent support for numerically stable matrix inversions,
factorizations, optimizations, eigenvector decompositions,etc.
 Ricardo: Establishes “trade” between R and Hadoop/Jaql
8
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Ricardo: Bridging the Gap
– Trade
– R send aggregation-processing queries (written in Jaql) to Hadoop
– Hadoop send aggregated data to R for advanced satistical processing
9
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
R in a Nutshell
3.9
3.9
Rating
3.8
3.8
3.7
3.7
3.6
3.6
3.5
1950
1960
1970
1980
1990
2000
2010
Year of Release
10
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
R in a Nutshell
3.9
3.9
Rating
3.8
3.8
3.7
3.7
3.6
3.6
3.5
1950
1960
1970
1980
1990
Year of Release
11
Ricardo: Integrating R and Hadoop
2000
2010
 R supports Rich
statistical
functionality
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Jaql in a Nutshell
 JSON View of the data:
 Jaql Example:
12
Ricardo: Integrating R and Hadoop
 Scalable Descriptive
Analysis using Hadoop
 Jaql a representative
declarative interface
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Ricardo: The Trading Architecture
Complexity of Trade between R and Hadoop
13
―
Simple Trading: Data Exploration
―
Complex Trading: Data Modeling
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Simple Trading: Exploratory Analytics
 Gain insights about data
 Example - top-k outliers
for a model
– Identify data items on which
the model performed most
poorly
 Helpful for improving
accuracy of model
 The trade:
– Use complex statistical models
using rich R functionality
– Parallelize processing over
entire data using Hadoop/Jaql
14
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Complex Trading: Latent Factors
SVD-like matrix factorization
Minimize Square Error: Σi,j (piqj - rij)2
q
p
The trade:
15
―
Use complex statistical models in R
―
Parallelize aggregate computations using Hadoop/Jaql
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Complex Trading: Latent Factors
However, in real world………
A vector of factors for each customer and item!
16
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Latent Factor Models with Ricardo
q
 Goal
)2
– Minimize Square Error: e = Σi,j (piqj - rij
– Numerical methods needed (large, sparse matrix)
p
 Pseudocode
1. Start with initial guess of parameters pi and qj.
2. Compute error & gradient
Data intensive,
– E.g., de/dpi = Σj 2qj (piqj – rij)
but parallelizable!
3. Update parameters.
– R implements many different optimization algorithms
4. Repeat steps 2 and 3 until convergence.
17
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
The R Component
• Parameters
e:
de:
pq:
squared error
gradients
concatenation of the latent factors for users
and items
• R code
optim( c(p,q), fe, fde, method="L-BFGS-B" )
• Goal
Keeps updating pq until it reaches convergence
18
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
The Hadoop and Jaql Component
•
Dataset
• Goal
19
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
The Hadoop and Jaql Component
20
•
Calculate the squared errors
•
Calculate the gradients
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Computing the Model
i
pi
e = Σi,j (piqj - rij)2
j
qj
3 way join to match
rij, pi, and qj,
then aggregate
Movie Parameters
Customer Parameters
i
j
rij
Similarly compute
the gradients
Movie Ratings
21
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Aggregation In Jaql/Hadoop
res = jaqlTable(channel, "
ratings
 hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q
})
 hashJoin( fn(r) r.i, custPars,
fn(c) c.i,
fn(r, c) { r.*, c.p
})
 transform { $.*, diff: $.rating - $.p*$.q }
 expand [ { value: pow($.diff, 2.0) },
{ $.i, value: -2.0 * $.diff * $.p },
{ $.j, value: -2.0 * $.diff * $.q } ]
Result in R
 group by g={ $.i, $.j }
i
j
gradient
into { g.*, gradient: sum($[*].value) }
---- ---- -------")
null null 325235
1
2
…
null
null
…
22
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
null 21
null 357
1
2
9
64
© 2010 IBM Corporation
IBM Almaden Research
Integrating the Components
Remember…..
We would be running optim( c(p,q), fe, fde, method="L-BFGS-B" ) in R
process.
23
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Experimental Evaluation
Number of Rating Tuples
Data Size in GB
500 Million
104.33
1 Billion
208.68
3 Billion
625.99
5 Billion
1043.23
 50 nodes at EC2
 Each node: 8 cores, 7GB Memory, 320GB Disk
 Total: 400 cores, 320GB Memory, 70TB Disk Space
24
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Leveraging Hadoop’s Scalability
Time (in seconds)
3500
3000
Hadoop (handtuned)
2500
Jaql
2000
1500
1000
500
0
0
1
2
3
4
5
Number of Ratings (in billions)
25
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Root Mean Square Error
Leveraging R’s Rich Functionality
1.2
Conjugate Gradient
L-BFGS
1.15
1.1
1.05
1
0.95
0.9
0
5
10
15
20
25
Number of Iterations
– optim( c(p,q), fe, fde, method=“CG" )
– optim( c(p,q), fe, fde, method="L-BFGS-B" )
26
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Conclusion
 Scaled Latent Factor Models to Terabytes of data
 Provided a bridge for other algorithms with
Summation Form can be mapped and scaled
– Many Algorithms have Summation Form
– Decompose into “large part” and “small part”
– [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means,
logistic regression, neural network, PCA, ICA, EM, SVM
 Future & Current Work
– Tighter language integration
– More algorithms
– Performance tuning
27
Ricardo: Integrating R and Hadoop
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
IBM Almaden Research
Questions?
Ricardo: Integrating R and Hadoop
Comments?
Sudipto Das {sudipto@cs.ucsb.edu}
© 2010 IBM Corporation
Download