IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das1, Yannis Sismanis2, Kevin S Beyer2, Rainer Gemulla2, Peter J. Haas2, John McPherson2 UC Santa Barbara 2 IBM Almaden Research Center 1 Presented by: Luyuang Zhang Yuguan Li © 2010 IBM Corporation IBM Almaden Research Outline Motivation & Background Architecture & Components Trading with Ricardo – Simple Trading – Complex Trading Evaluation Conclusion 2 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Deep Analytics on Big Data Enterprises collect huge amounts of data – Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … – User interaction data and history – Click and Transaction logs Deep analysis critical for competitive edge – Understanding/Modeling data – Recommendations to users – Ad placement Challenge: Enable Deep Analysis and Understanding over massive data volumes – Exploiting data to its full potential 3 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Motivating Examples Data Exploration/Model Evaluation/Outlier Detection Personalized Recommendations – For each individual customer/product – Many applications to Netflix, Amazon, eBay, iTunes, … Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage Application Scenario: Movie Recommendations – Millions of Customers – Hundreds of thousands of Movies – Billions of Movie Ratings 4 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Analyst’s Workflow Data Exploration – Deal with raw data Data Modeling – Deal with processed data – Use assigned method to build model fits the data Model Evaluation – Deal with built model – Use data to test the accuracy of model 5 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Big Data and Deep Analytics – The Gap R, SPSS, SAS – A Statistician’s toolbox – Rich statistical, modeling, visualization functionality – Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN – Operate on small data amounts entirely in memory on a single server – Extensions for data handling cumbersome Hadoop – Scalable Data Management Systems – Scalable, Fault-Tolerant, Elastic, … – “Magnetic”: easy to store data – Limited deep analytics: mostly descriptive analytics 6 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Filling the Gap: Existing Approaches Reducing Data size by Sampling – Approximations might result in losing competitive advantage – Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009] Scaling out R – Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] – Main memory based in most cases – Re-implementing DBMS and distributed processing functionality Deep Analysis within a DBMS – Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] – Not Sustainable – missing out from R’s community development and rich libraries 7 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Ricardo: Bridging the Gap David Ricardo, famous economist from 19th century – “Comparative Advantage” Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] – Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA – Recommender Systems/Latent Factorization [our paper] – A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently small Large-part includes joins, group bys, distributive aggregations – Hadoop + Jaql: excellent scalability to large-scale data management Small-part includes matrix/vector operations – R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc. Ricardo: Establishes “trade” between R and Hadoop/Jaql 8 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Ricardo: Bridging the Gap – Trade – R send aggregation-processing queries (written in Jaql) to Hadoop – Hadoop send aggregated data to R for advanced satistical processing 9 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research R in a Nutshell 3.9 3.9 Rating 3.8 3.8 3.7 3.7 3.6 3.6 3.5 1950 1960 1970 1980 1990 2000 2010 Year of Release 10 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research R in a Nutshell 3.9 3.9 Rating 3.8 3.8 3.7 3.7 3.6 3.6 3.5 1950 1960 1970 1980 1990 Year of Release 11 Ricardo: Integrating R and Hadoop 2000 2010 R supports Rich statistical functionality Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Jaql in a Nutshell JSON View of the data: Jaql Example: 12 Ricardo: Integrating R and Hadoop Scalable Descriptive Analysis using Hadoop Jaql a representative declarative interface Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Ricardo: The Trading Architecture Complexity of Trade between R and Hadoop 13 ― Simple Trading: Data Exploration ― Complex Trading: Data Modeling Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Simple Trading: Exploratory Analytics Gain insights about data Example - top-k outliers for a model – Identify data items on which the model performed most poorly Helpful for improving accuracy of model The trade: – Use complex statistical models using rich R functionality – Parallelize processing over entire data using Hadoop/Jaql 14 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Complex Trading: Latent Factors SVD-like matrix factorization Minimize Square Error: Σi,j (piqj - rij)2 q p The trade: 15 ― Use complex statistical models in R ― Parallelize aggregate computations using Hadoop/Jaql Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Complex Trading: Latent Factors However, in real world……… A vector of factors for each customer and item! 16 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Latent Factor Models with Ricardo q Goal )2 – Minimize Square Error: e = Σi,j (piqj - rij – Numerical methods needed (large, sparse matrix) p Pseudocode 1. Start with initial guess of parameters pi and qj. 2. Compute error & gradient Data intensive, – E.g., de/dpi = Σj 2qj (piqj – rij) but parallelizable! 3. Update parameters. – R implements many different optimization algorithms 4. Repeat steps 2 and 3 until convergence. 17 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research The R Component • Parameters e: de: pq: squared error gradients concatenation of the latent factors for users and items • R code optim( c(p,q), fe, fde, method="L-BFGS-B" ) • Goal Keeps updating pq until it reaches convergence 18 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research The Hadoop and Jaql Component • Dataset • Goal 19 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research The Hadoop and Jaql Component 20 • Calculate the squared errors • Calculate the gradients Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Computing the Model i pi e = Σi,j (piqj - rij)2 j qj 3 way join to match rij, pi, and qj, then aggregate Movie Parameters Customer Parameters i j rij Similarly compute the gradients Movie Ratings 21 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q }) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p }) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ] Result in R group by g={ $.i, $.j } i j gradient into { g.*, gradient: sum($[*].value) } ---- ---- -------") null null 325235 1 2 … null null … 22 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} null 21 null 357 1 2 9 64 © 2010 IBM Corporation IBM Almaden Research Integrating the Components Remember….. We would be running optim( c(p,q), fe, fde, method="L-BFGS-B" ) in R process. 23 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Experimental Evaluation Number of Rating Tuples Data Size in GB 500 Million 104.33 1 Billion 208.68 3 Billion 625.99 5 Billion 1043.23 50 nodes at EC2 Each node: 8 cores, 7GB Memory, 320GB Disk Total: 400 cores, 320GB Memory, 70TB Disk Space 24 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Leveraging Hadoop’s Scalability Time (in seconds) 3500 3000 Hadoop (handtuned) 2500 Jaql 2000 1500 1000 500 0 0 1 2 3 4 5 Number of Ratings (in billions) 25 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Root Mean Square Error Leveraging R’s Rich Functionality 1.2 Conjugate Gradient L-BFGS 1.15 1.1 1.05 1 0.95 0.9 0 5 10 15 20 25 Number of Iterations – optim( c(p,q), fe, fde, method=“CG" ) – optim( c(p,q), fe, fde, method="L-BFGS-B" ) 26 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Conclusion Scaled Latent Factor Models to Terabytes of data Provided a bridge for other algorithms with Summation Form can be mapped and scaled – Many Algorithms have Summation Form – Decompose into “large part” and “small part” – [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM Future & Current Work – Tighter language integration – More algorithms – Performance tuning 27 Ricardo: Integrating R and Hadoop Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation IBM Almaden Research Questions? Ricardo: Integrating R and Hadoop Comments? Sudipto Das {sudipto@cs.ucsb.edu} © 2010 IBM Corporation