Large-scale Recommender Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong)

advertisement
Large-scale Recommender
Systems on Just a PC
LSRS 2013 keynote
(RecSys ’13 Hong Kong)
Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov
Big Data – small machine
My Background
• Academic: 5th year Ph.D. @ Carnegie Mellon.
Advisors: Guy Blelloch, Carlos Guestrin (UW)
2009 
2012 
+ Shotgun : Parallel L1-regularized regression solver (ICML 2011).
+ Internships at MSR Asia (2011) and Twitter (2012)
• Startup Entrepreneur
Habbo : founded 2000
Outline of this talk
1. Why single-computer computing?
2. Introduction to graph computation and
GraphChi
3. Recommender systems with GraphChi
4. Future directions & Conclusion
Large-Scale Recommender Systems on Just a
PC
Why on a single machine?
Can’t we just use the
Cloud?
Why use a cluster?
Two reasons:
1. One computer cannot handle my problem in a reasonable
time.
1. I need to solve the problem very fast.
Why use a cluster?
Two reasons:
1. One computer cannot handle my problem in a reasonable
time.
Our work expands the space of feasible (graph) problems on
one machine:
- Our experiments use the same graphs, or bigger, than previous
papers on distributed graph computation. (+ we can do Twitter
graph on a laptop)
- Most data not that “big”.
1. I need to solve the problem very fast.
Our work raises the bar on required performance for a
“complicated” system.
Benefits of single machine systems
Assuming it can handle your big problems…
1. Programmer productivity
– Global state
– Can use “real data” for development
2. Inexpensive to install, administer, less power.
3. Scalability.
Efficient Scaling
Distributed Graph
System
Task 7
Task 6
Task 5
Task 4
Task 3
Single-computer
system (capable of big tasks)
Task 2
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
6 machines
(Significantly) less
than 2x throughput
with 2x machines
T11
T10
T9
T8
T7
T6
T5
T4
T3
T2
T1
Task 1
Exactly 2x
Task 2
Task
3
throughput with
2x
Task 4
machines
Task 5
Task 6
Task 10
Task 11
Task 12
12 machines
Time
T
Time
T
GRAPH COMPUTATION AND
GRAPHCHI
Why graphs for recommender systems?
• Graph = matrix: edge(u,v) = M[u,v]
– Note: always sparse graphs
• Intuitive, human-understandable representation
– Easy to visualize and explain.
• Unifies collaborative filtering (typically matrix
based) with recommendation in social networks.
– Random walk algorithms.
• Local view  vertex-centric computation
Vertex-Centric Computational Model
• Graph G = (V, E)
– directed edges: e = (source,
destination)
– each edge and vertex
associated with a value (userdefined type)
– vertex and edge values can be
modified
• (structure modification also
supported)
A
B
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
GraphChi – Aapo Kyrola
12
Vertex-centric Programming
• “Think like a vertex”
• Popularized by the Pregel and GraphLab projects
Data
Data
Data
Data
Data
{ // modify neighborhood }
Data
Data
Data
Data
Data
MyFunc(vertex)
What is GraphChi
Both in OSDI’12!
The Main Challenge of Disk-based
Graph Computation:
Random Access
<< 5-10 M random edges
/ sec to achieve
“reasonable
performance”
100s reads/writes per sec
~ 100K reads / sec (commodity)
~ 1M reads / sec (high-end arrays)
Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012)
Parallel Sliding Windows
or
Only P large reads for each interval (sub-graph).
P2 reads on one full pass.
GraphChi Program Execution
For T iterations:
For p=1 to P
For v in interval(p)
updateFunction(v)
For T iterations:
For v=1 to V
updateFunction(v)
“Asynchronous”: updates immediately
visible (vs. bulk-synchronous).
Performance
GraphChi can compute on the
full Twitter follow-graph with
just a standard laptop.
~ as fast as a very large Hadoop cluster!
(size of the graph Fall 2013, > 20B edges [Gupta et al 2013])
GraphChi is Open Source
• C++ and Java-versions in GitHub:
http://github.com/graphchi
– Java-version has a Hadoop/Pig wrapper.
• If you really really want to use Hadoop.
RECSYS MODEL TRAINING
WITH GRAPHCHI
Overview of Recommender Systems for
GraphChi
• Collaborative Filtering toolkit (next slide)
• Link prediction in large networks
– Random-walk based approaches (Twitter)
– Talk on Wednesday.
GraphChi’s Collaborative Filtering Toolkit
• Developed by Danny Bickson
(CMU / GraphLab Inc)
• Includes:
–
–
–
–
–
–
–
–
–
Alternative Least Squares (ALS)
Sparse-ALS
SVD++
LibFM (factorization machines)
GenSGD
Item-similarity based methods
PMF
CliMF (contributed by Mark Levy)
….
See Danny’s blog for more
information:
http://bickson.blogspot.com
/2012/12/collaborativefiltering-with-graphchi.html
Note: In the C++ -version.
Java-version in development
by a CMU team.
TWO EXAMPLES: ALS AND
ITEM-BASED CF
Example: Alternative Least Squares
Matrix Factorization (ALS)
• Task: Predict ratings for items (movies) by
users.
• Model:
– Latent factor model (see next slide)
Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale
Parallel Collaborative Filtering for the Netflix Prize” (2008)
ALS: Product – Item bipartite graph
0.4
2.3
-1.8
2.9
1.2
4
Women on the Verge of a
Nervous Breakdown
2.3
2.5
3.9
0.02
0.04
2.1
3.141
3
The Celebration
8.7
-3.2
2.8
0.9
0.2
2.9
0.04
City of God
4.1
2
Wild Strawberries
5
User’s rating of a movie modeled as a dot-product:
<factor(user),
factor(movie)>
La Dolce
Vita
ALS: GraphChi implementation
• Update function handles one vertex a time (user
or movie)
• For each user:
– Estimate latent(user): minimize least squares
of dot-product predicted ratings
• GraphChi executes the update function for each
vertex (in parallel), and loads edges (ratings) from
disk
– Latent factors in memory: need O(V) memory.
– If factors don’t fit in memory, can replicate to edges.
and thus store on disk
Scales to very large problems!
ALS: Performance
Matrix Factorization (Alternative Least Squares)
Netflix (99M edges), D=20
GraphChi (Mac
Mini)
GraphLab v1
(8 cores)
0
2
4
6
8
10
12
Minutes
Remark: Netflix is not a big problem, but
GraphChi will scale at most linearly with
input size (ALS is CPU bounded, so should
be sub-linear in #ratings).
Example: Item Based-CF
• Task: compute a similarity score [e,g. Jaccard] for
each movie-pair that has at least one viewer in
common.
– Similarity(X,Y) ~ # common viewers
– Output top K similar items for each item to a file.
– … or: create edge between X,Y containing the
similarity.
• Problem: enumerating all pairs takes too much
time.
Women on the Verge of a
Nervous Breakdown
3
Solution:
Enumerate all
The Celebration
triangles of the graph.
New problem: how
City of to
God
enumerate triangles if the
graph does not fit inWildRAM?
Strawberries
La Dolce Vita
Enumerating Triangles (Item-CF)
• Triangles with edge (u, v) =
intersection(neighbors(u), neighbors(v))
• Iterative memory efficient solution (next slide)
Algorithm:
• Let pivots be a subset of the vertices;
• Load all neighbor-lists (adjacency lists)
of pivots into RAM
• Use now GraphChi to load all vertices
from disk, one by one, and compare
their adjacency lists to the pivots’
adjacency lists (similar to merge).
• Repeat with a new subset of pivots.
PIVOTS
Triangle Counting Performance
Triangle Counting
twitter-2010 (1.5B edges)
GraphChi (Mac
Mini)
Hadoop (1636
machines)
0
50
100
150
200
250
Minutes
300
350
400
450
FUTURE DIRECTIONS &
FINAL REMARKS
Single-Machine Computing in Production?
• GraphChi supports incremental
computation with dynamic graphs:
– Can keep on running indefinitely, adding new edges
to the graph  Constantly fresh model.
– However, requires engineering – not included in
the toolkit.
• Compare to a cluster-based system (such as
Hadoop) that needs to compute from scratch.
Unified Recsys Platform for GraphChi?
• Working with masters students at CMU.
• Goal: ability to easily compare different
algorithms, parameters
– Unified input, output.
– General programmable API (not just file-based)
– Evaluation process: Several evaluation metrics; Crossvalidation, held-out data…
– Run many algorithm instances in parallel, on same
graph.
– Java.
• Scalable from the get-go.
DataDescriptor
data definition
column1 : categorical
column2: real
column3: key
column4: categorical
Input data
Algorithm X: Input
Algorithm Input Descriptor
map(input: DataDescriptor)
GraphChi
Preprocessor
aux
data
GraphChi Input
aux
data
Disk
GraphChi Input
Algorithm X Training
Program
Held-out
data (test
data)
Algorithm Y Training
Program
Algorithm X Predictor
training
metrics
test quality
metrics
Algorithm Z Training
Program
Recent developments: Disk-based Graph
Computation
• Recently two disk-based graph computation
systems published:
– TurboGraph (KDD’13)
– X-Stream (SOSP’13 in October)
• Significantly better performance than GraphChi
on many problems
– Avoid preprocessing (“sharding”)
– But GraphChi can do some computation that XStream cannot (triangle counting and related);
TurboGraph requires SSD
– Hot research area!
Do you need GraphChi – or any system?
• Heck, for many algorithms, you can just
mmap() over your (binary) adjacency list /
sparse matrix, and write a for-loop.
– See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable
Graph Computation on a PC (Big Data ’13)
• Obviously good to have a common API
– And some algos need more advanced solutions
(like GraphChi, X-Stream, TurboGraph)
Beware of the hype!
Conclusion
• Very large recommender algorithms can now be
run on just your PC or laptop.
– Additional performance from multi-core parallelism.
– Great for productivity – scale by replicating.
• In general, good single machine scalability requires
care with data structures, memory management
 natural with C/C++, with Java (etc.) need lowlevel byte massaging.
– Frameworks like GraphChi hide the low-level.
• More work needed to ‘’productize’’ current
work.
Thank you!
Aapo Kyrölä
Ph.D. candidate @ CMU – soon to
graduate! (Currently visiting U.W)
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov
Download