Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Big Data – small machine My Background • Academic: 5th year Ph.D. @ Carnegie Mellon. Advisors: Guy Blelloch, Carlos Guestrin (UW) 2009 2012 + Shotgun : Parallel L1-regularized regression solver (ICML 2011). + Internships at MSR Asia (2011) and Twitter (2012) • Startup Entrepreneur Habbo : founded 2000 Outline of this talk 1. Why single-computer computing? 2. Introduction to graph computation and GraphChi 3. Recommender systems with GraphChi 4. Future directions & Conclusion Large-Scale Recommender Systems on Just a PC Why on a single machine? Can’t we just use the Cloud? Why use a cluster? Two reasons: 1. One computer cannot handle my problem in a reasonable time. 1. I need to solve the problem very fast. Why use a cluster? Two reasons: 1. One computer cannot handle my problem in a reasonable time. Our work expands the space of feasible (graph) problems on one machine: - Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop) - Most data not that “big”. 1. I need to solve the problem very fast. Our work raises the bar on required performance for a “complicated” system. Benefits of single machine systems Assuming it can handle your big problems… 1. Programmer productivity – Global state – Can use “real data” for development 2. Inexpensive to install, administer, less power. 3. Scalability. Efficient Scaling Distributed Graph System Task 7 Task 6 Task 5 Task 4 Task 3 Single-computer system (capable of big tasks) Task 2 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 6 machines (Significantly) less than 2x throughput with 2x machines T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1 Task 1 Exactly 2x Task 2 Task 3 throughput with 2x Task 4 machines Task 5 Task 6 Task 10 Task 11 Task 12 12 machines Time T Time T GRAPH COMPUTATION AND GRAPHCHI Why graphs for recommender systems? • Graph = matrix: edge(u,v) = M[u,v] – Note: always sparse graphs • Intuitive, human-understandable representation – Easy to visualize and explain. • Unifies collaborative filtering (typically matrix based) with recommendation in social networks. – Random walk algorithms. • Local view vertex-centric computation Vertex-Centric Computational Model • Graph G = (V, E) – directed edges: e = (source, destination) – each edge and vertex associated with a value (userdefined type) – vertex and edge values can be modified • (structure modification also supported) A B Data Data Data Data Data Data Data Data Data Data GraphChi – Aapo Kyrola 12 Vertex-centric Programming • “Think like a vertex” • Popularized by the Pregel and GraphLab projects Data Data Data Data Data { // modify neighborhood } Data Data Data Data Data MyFunc(vertex) What is GraphChi Both in OSDI’12! The Main Challenge of Disk-based Graph Computation: Random Access << 5-10 M random edges / sec to achieve “reasonable performance” 100s reads/writes per sec ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-end arrays) Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012) Parallel Sliding Windows or Only P large reads for each interval (sub-graph). P2 reads on one full pass. GraphChi Program Execution For T iterations: For p=1 to P For v in interval(p) updateFunction(v) For T iterations: For v=1 to V updateFunction(v) “Asynchronous”: updates immediately visible (vs. bulk-synchronous). Performance GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013]) GraphChi is Open Source • C++ and Java-versions in GitHub: http://github.com/graphchi – Java-version has a Hadoop/Pig wrapper. • If you really really want to use Hadoop. RECSYS MODEL TRAINING WITH GRAPHCHI Overview of Recommender Systems for GraphChi • Collaborative Filtering toolkit (next slide) • Link prediction in large networks – Random-walk based approaches (Twitter) – Talk on Wednesday. GraphChi’s Collaborative Filtering Toolkit • Developed by Danny Bickson (CMU / GraphLab Inc) • Includes: – – – – – – – – – Alternative Least Squares (ALS) Sparse-ALS SVD++ LibFM (factorization machines) GenSGD Item-similarity based methods PMF CliMF (contributed by Mark Levy) …. See Danny’s blog for more information: http://bickson.blogspot.com /2012/12/collaborativefiltering-with-graphchi.html Note: In the C++ -version. Java-version in development by a CMU team. TWO EXAMPLES: ALS AND ITEM-BASED CF Example: Alternative Least Squares Matrix Factorization (ALS) • Task: Predict ratings for items (movies) by users. • Model: – Latent factor model (see next slide) Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale Parallel Collaborative Filtering for the Netflix Prize” (2008) ALS: Product – Item bipartite graph 0.4 2.3 -1.8 2.9 1.2 4 Women on the Verge of a Nervous Breakdown 2.3 2.5 3.9 0.02 0.04 2.1 3.141 3 The Celebration 8.7 -3.2 2.8 0.9 0.2 2.9 0.04 City of God 4.1 2 Wild Strawberries 5 User’s rating of a movie modeled as a dot-product: <factor(user), factor(movie)> La Dolce Vita ALS: GraphChi implementation • Update function handles one vertex a time (user or movie) • For each user: – Estimate latent(user): minimize least squares of dot-product predicted ratings • GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk – Latent factors in memory: need O(V) memory. – If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales to very large problems! ALS: Performance Matrix Factorization (Alternative Least Squares) Netflix (99M edges), D=20 GraphChi (Mac Mini) GraphLab v1 (8 cores) 0 2 4 6 8 10 12 Minutes Remark: Netflix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-linear in #ratings). Example: Item Based-CF • Task: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common. – Similarity(X,Y) ~ # common viewers – Output top K similar items for each item to a file. – … or: create edge between X,Y containing the similarity. • Problem: enumerating all pairs takes too much time. Women on the Verge of a Nervous Breakdown 3 Solution: Enumerate all The Celebration triangles of the graph. New problem: how City of to God enumerate triangles if the graph does not fit inWildRAM? Strawberries La Dolce Vita Enumerating Triangles (Item-CF) • Triangles with edge (u, v) = intersection(neighbors(u), neighbors(v)) • Iterative memory efficient solution (next slide) Algorithm: • Let pivots be a subset of the vertices; • Load all neighbor-lists (adjacency lists) of pivots into RAM • Use now GraphChi to load all vertices from disk, one by one, and compare their adjacency lists to the pivots’ adjacency lists (similar to merge). • Repeat with a new subset of pivots. PIVOTS Triangle Counting Performance Triangle Counting twitter-2010 (1.5B edges) GraphChi (Mac Mini) Hadoop (1636 machines) 0 50 100 150 200 250 Minutes 300 350 400 450 FUTURE DIRECTIONS & FINAL REMARKS Single-Machine Computing in Production? • GraphChi supports incremental computation with dynamic graphs: – Can keep on running indefinitely, adding new edges to the graph Constantly fresh model. – However, requires engineering – not included in the toolkit. • Compare to a cluster-based system (such as Hadoop) that needs to compute from scratch. Unified Recsys Platform for GraphChi? • Working with masters students at CMU. • Goal: ability to easily compare different algorithms, parameters – Unified input, output. – General programmable API (not just file-based) – Evaluation process: Several evaluation metrics; Crossvalidation, held-out data… – Run many algorithm instances in parallel, on same graph. – Java. • Scalable from the get-go. DataDescriptor data definition column1 : categorical column2: real column3: key column4: categorical Input data Algorithm X: Input Algorithm Input Descriptor map(input: DataDescriptor) GraphChi Preprocessor aux data GraphChi Input aux data Disk GraphChi Input Algorithm X Training Program Held-out data (test data) Algorithm Y Training Program Algorithm X Predictor training metrics test quality metrics Algorithm Z Training Program Recent developments: Disk-based Graph Computation • Recently two disk-based graph computation systems published: – TurboGraph (KDD’13) – X-Stream (SOSP’13 in October) • Significantly better performance than GraphChi on many problems – Avoid preprocessing (“sharding”) – But GraphChi can do some computation that XStream cannot (triangle counting and related); TurboGraph requires SSD – Hot research area! Do you need GraphChi – or any system? • Heck, for many algorithms, you can just mmap() over your (binary) adjacency list / sparse matrix, and write a for-loop. – See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC (Big Data ’13) • Obviously good to have a common API – And some algos need more advanced solutions (like GraphChi, X-Stream, TurboGraph) Beware of the hype! Conclusion • Very large recommender algorithms can now be run on just your PC or laptop. – Additional performance from multi-core parallelism. – Great for productivity – scale by replicating. • In general, good single machine scalability requires care with data structures, memory management natural with C/C++, with Java (etc.) need lowlevel byte massaging. – Frameworks like GraphChi hide the low-level. • More work needed to ‘’productize’’ current work. Thank you! Aapo Kyrölä Ph.D. candidate @ CMU – soon to graduate! (Currently visiting U.W) http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov