Using Hadoop & HBase to build Personalization

advertisement
USING HADOOP & HBASE
TO BUILD CONTENT
RELEVANCE &
PERSONALIZATION
Tools to build your big data application
Ameya Kanitkar
Ameya Kanitkar – That’s me!
• Big Data Infrastructure Engineer @ Groupon, Palo Alto
USA (Working on Deal Relevance & Personalization
Systems)
ameya.kanitkar@gmail.com
http://www.linkedin.com/in/ameyakanitkar
@aktwits
Agenda
 Basics of Hadoop & HBase
 How you can use Hadoop & HBase for big data
application
 Case Study: Deal Relevance and Personalization
Systems at Groupon with Hadoop & HBase
Big Data Application Examples
 Recommendation Systems
 Ad targeting
 Personalization Systems
 BI/ DW
 Log Analysis
 Natural Language Processing
So what is Hadoop?
 General purpose framework for processing huge
amounts of data.
 Open Source
 Batch / Offline Oriented
Hadoop - HDFS
 Open Source Distributed File System.
 Store large files. Can easily be accessed via application
built on top of HDFS.
 Data is distributed and replicated over multiple machines
 Linux Style commands eg. ls, cp, mv, touchz etc
Hadoop – HDFS
 Example:
hadoop fs –dus /data/
185453399927478 bytes =~ 168 TB
(One of the folders from one of our hadoop cluster)
Hadoop – Map Reduce
 Application Framework built on top of HDFS to process
your big data
 Operates on key-value pairs
 Mappers filter and transform input data
 Reducers aggregate mapper output
Example
• Given web logs, calculate landing page conversion rate
for each product
• So basically we need to see how many impressions each
product received and then calculate conversion rate of for
each product
Map Reduce Example
Map Phase
Reduce Phase
Map 1: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map 2: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map N: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Reducer: Here we receive all
data for a given product. Just run
simple for loop to calculate
conversion rate.
(Output: Product ID, Conversion
Rate
Recap
 We just processed terabytes of data, and calculated
conversion rate across millions of products.
 Note: This is batch process only. It takes time. You can
not start this process after some one visits your website.
How about we generate recommendations in batch process
and serve them in real time?
HBase
 Provides real time random read/ write access over HDFS
 Built on Google’s ‘Big Table’ design
 Open Sourced
This is not RDBMS, so no joins. Access patterns are
generally simple like get(key), put(key, value) etc.
Row
Cf:<qual>
Cf:<qual>
Row 1
Cf1:qual1
Cf1:qual2
Row 11
Cf1:qual2
Cf1:qual22
Row 2
Cf2:qual1
….
Cf1:qual3
Row N
 Dynamic Column Names. No need to define columns upfront.
 Both rows and columns are (lexicological) sorted
Cf:<qual>
….
Row
Cf:<qual>
user1
Cf1:click_history:{actual_cl Cf1:purchases:{actual_pur
icks_data}
chases}
user11
Cf1:purchases:{actual_pur
chases}
user20
Cf1:mobile_impressions:{a Cf1:purchases:{actual_pur
ctual mobile impressions} chases}
Note: Each row has different columns, So think about this as a hash map rather
than at table with rows and columns
Putting it all together
Store data in
HDFS
Web
Generate
Recommendations
(Map Reduce)
Serve Real Time
Requests
(HBase)
Mobile
Analyze Data
(Map Reduce)
Do offline analysis in Hadoop, and serve real time requests with HBase
Use Case: Deal Relevance &
Personalization @ Groupon
What are Groupon Deals?
Our Relevance Scenario
Users
Our Relevance Scenario
How do we surface relevant
deals ?
Users
 Deals are perishable (Deals
expire or are sold out)
 No direct user intent (As in
traditional search
advertising)
 Relatively Limited User
Information
 Deals are highly local
Two Sides to the Relevance Problem
Algorithmic
Issues
Scaling
Issues
How to find
relevant deals for
individual users
given a set of
optimization criteria
How to handle
relevance for
all users across
multiple
delivery platforms
Developing Deal Ranking Algorithms
• Exploring Data
• Understanding signals, finding
patterns
• Building Models/Heuristics
• Employ both classical machine
learning techniques and heuristic
adjustments to estimate user
purchasing behavior
• Conduct Experiments
• Try out ideas on real users and
evaluate their effect
Data Infrastructure
Growing Deals
2011
2012
Growing Users
2013
 100 Million+
subscribers
 We need to store data
20+
like, user click history,
400+
email records, service
logs etc. This tunes to
2000+
billions of data points
and TB’s of data
Deal Personalization Infrastructure Use
Cases
• Deliver Personalized
Emails
• Deliver Personalized
Website & Mobile
Experience
Email
Personalize billions of emails for
hundredsof millions of users
Offline System
Personalize one of the most popular
e-commerce mobile & web app
for hundreds of millions of
users & page views
Online System
Architecture
• We can now
maintain different
SLA on online and
offline systems
Email
Real Time
Relevance
Relevance
Map/Reduce
HBase
Offline
System
Data Pipeline
Replication
HBase for
Online System
• We can tune
HBase cluster
differently for
online and offline
systems
HBase Schema Design
User ID
Column Family 1
Column Family 2
Unique Identifier
for Users
User History and
Profile Information
Email History For Users
Overwrite user history
and profile info
Append email history for
each day as a separate
columns. (On avg each
row has over 200
columns)
• Most of our data access patterns are via “User Key”
• This makes it easy to design HBase schema
• The actual data is kept in JSON
Cluster Sizing
HBase
Replication
Hadoop +
HBase
Cluster
100+ machine Hadoop
cluster, this runs heavy
map reduce jobs
The same cluster also
hosts 15 node HBase
cluster
Online HBase
Cluster
10 Machine
dedicated HBase
cluster to serve
real time SLA
• Machine Profile
• 96 GB RAM (HBase
25 GB)
• 24 Virtual Cores
CPU
• 8 2TB Disks
• Data Profile
• 100 Million+
Records
• 2TB+ Data
• Over 4.2 Billion Data
Points
Questions?
Thank You!
(We are hiring!)
www.groupon.com/techjobs
Download