Real-time learning

advertisement
Real-time and Long-time with
Storm and Hadoop
©MapR Technologies - Confidential
1
Real-time and Long-time with
Storm and Hadoop MapR
©MapR Technologies - Confidential
2

Contact:
–
–

Slides and such:
–

tdunning@maprtech.com
@ted_dunning
http://info.mapr.com/ted-uk-05-2012
Hash tag: #mapr_uk
Collective notes: http://bit.ly/JDCRhc
©MapR Technologies - Confidential
3
Company Background

MapR provides the industry’s best Hadoop Distribution
–

Combines the best of the Hadoop community
contributions with significant internally
financed infrastructure development
Background of Team
Deep management bench with extensive analytic,
storage, virtualization, and open source experience
– Google, EMC, Cisco, VMWare, Network Appliance, IBM,
Microsoft, Apache Foundation, Aster Data, Brio, ParAccel
–

Proven
–
–
–
MapR used across industries (Financial Services, Media,
Telcom, Health Care, Internet Services, Government)
Strategic OEM relationship with EMC and Cisco
Over 1,000 installs
©MapR Technologies - Confidential
4
Expanding Hadoop Use Cases
NFS for filebased
applications
Hadoop APIs
for Hadoop
Applications
Mission
Critical and SLA
dependent
Applications
Real-time
Applications
Blue = MapR Innovations
©MapR Technologies - Confidential
ODBC (JDBC)
for SQL-based
applications
5
MapR’s Complete Distribution for Apache Hadoop



Integrated, tested, hardened and
Supported
100% Hadoop, HBase,
HDFS API compatible
Easy portability/
migration between
distributions

Unique advanced
features

No changes required
to Hadoop applications

Runs on commodity
hardware
MapR Control System
MapR
Heatmap™
Hive
LDAP, NIS
Integration
Pig
Oozle
Mahout Cascading
Direct
Access
NFS
Sqoop
HBase
Naglos
Ganglia
Integration
Integration
Real-Time
Streaming Volumes Mirrors
No NameNode
Architecture
CLI,
REST APT
Quotas,
Alerts, Alarms
Snapshots
High Performance
Direct Shuffle
6
Flume
Zookeeper
Data
Placement
Stateful Failover
and Self Healing
MapR’s Storage
2.7 Services™
©MapR Technologies - Confidential
Whirr
So what about that real-time stuff?
©MapR Technologies - Confidential
7
The Challenge

Hadoop is great of processing vats of data
–

Storm is great for real-time processing
–

But sucks for real-time (by design!)
But lacks any way to deal with batch processing
It sounds like there isn’t a solution
–
Neither fashionable solution handles everything
©MapR Technologies - Confidential
8
This is not a problem.
It’s an opportunity!
©MapR Technologies - Confidential
9
Hadoop is Not Very Real-time
Unprocessed
Data
now
t
Fully Latest full
processed period
©MapR Technologies - Confidential
10
Hadoop job
takes this
long for this
data
Need to Plug the Hole in Hadoop

We have real-time data with limited state
–
–

We also have long-term analytics with lots of state
–
–

Exactly what Storm does
And what Hadoop does not
Exactly what Hadoop does
And what Storm does not
Can Storm and Hadoop be combined?
©MapR Technologies - Confidential
11
Real-time and Long-time together
Blended
View
view
now
t
Hadoop works
great back here
©MapR Technologies - Confidential
12
Storm
works
here
An Example

I want to know how many queries I get
–

Per second, minute, day, week
Results should be available
–
–
within <2 seconds 99.9+% of the time
within 30 seconds almost always

History should last >3 years

Should work for 0.001 q/s up to 100,000 q/s

Failure tolerant, yadda, yadda
©MapR Technologies - Confidential
13
Rough Design – Data Flow
Search
Engine
Query
QueryEvent
Event
Spout
Spout
Counter
Counter
Bolt
Bolt
Logger
Logger
Bolt
Bolt
Logger
Bolt
Semi
Agg
Snap
Hadoop
Aggregator
Raw
Logs
Long
agg
©MapR Technologies - Confidential
14
Counter Bolt Detail

Input: Labels to count

Output: Short-term semi-aggregated counts
–
(time-window, label, count)

Input is logged until next flush

Non-zero counts emitted on flush if
–
–


event count reaches threshold (typical 100K)
time since last count reaches threshold (typical 1-10s)
Tuples acked when counts emitted
Double count probability is > 0 but very small
©MapR Technologies - Confidential
15
Counter Bolt Counterintuitivity

Counts are emitted for same label, same time window many times
–
–
–
–

these are semi-aggregated
this is a feature
tuples can be acked within 1s
time windows can be much longer than 1s
No need to send same label to same bolt
–
speeds failure recovery
©MapR Technologies - Confidential
16
Design Flexibility

Tuples can be ack’ed as soon as they hit the log
–
–

Count flush interval can be extended without extending tuple
timeout
–

counter can recover state on failure
log is burn after write
Decreases currency of counts in semi-aggregates
Total bandwidth for log is typically not huge
–
All of twitter @10,000 messages per second = 10K x 2KB = 20MB/s
©MapR Technologies - Confidential
17
Counter Bolt No-nos

Cannot accumulate entire period in-memory
–
–
–

Tuples must be ack’ed much sooner
State must be persisted before ack’ing
State can easily grow too large to handle without disk access
Cannot persist entire count table at once
–
Incremental persistence required
©MapR Technologies - Confidential
18
Guarantees

Counter output volume is small-ish
–
–

the greater of k tuples per 100K inputs or k tuple/s
1 tuple/s/label/bolt for this exercise
Persistence layer must provide guarantees
–
–
distributed against node failure
must have either readable flush or closed-append

HDFS is distributed, but provides no guarantees and strange
semantics

MapRfs is distributed, provides all necessary guarantees
©MapR Technologies - Confidential
19
Failure Modes

Bolt failure
–
–
–
–

buffered tuples will go un’acked
after timeout, tuples will be resent
timeout ≈ 10s
if failure occurs after persistence, before acking, then double-counting is
possible
Storage (with MapR)
–
–
–
–
most failures invisible
a few continue within 0-2s, some take 10s
catastrophic cluster restart can take 2-3 min
logger can buffer this much easily
©MapR Technologies - Confidential
20
Presentation Layer

Presentation must
–
–
–

read recent output of Logger bolt
read relevant output of Hadoop jobs
combine semi-aggregated records
User will see
–
–
counts that increment within 0-2 s of events
seamless meld of short and long-term data
©MapR Technologies - Confidential
21
Example 2 – Real-time learning

My system has to
–
learn a response model
and
–
–

select training data
in real-time
Data rate up to 100K queries per second
©MapR Technologies - Confidential
22
Door Number 3 – AB testing in real-time

I have 15 versions of my landing page

Each visitor is assigned to a version
–

A conversion or sale or whatever can happen
–

Which version?
How long to wait?
Some versions of the landing page are horrible
–
Don’t want to give them traffic
©MapR Technologies - Confidential
23
Real-time Constraints

Selection must happen in <20 ms almost all the time

Training events must be handled in <20 ms

Failover must happen within 5 seconds

Client should timeout and back-off
–

no need for an answer after 500ms
State persistence required
©MapR Technologies - Confidential
24
Rough Design
Selector
Layer
DRPC Spout
Conversion
Detector
Query Event
Timed
SpoutJoin
Counter
Model
Bolt
Logger
Logger
Bolt
Bolt
Model
State
Raw
Logs
©MapR Technologies - Confidential
25
A Quick Diversion

You see a coin
–
–
What is the probability of heads?
Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again
I catch the coin and ask again
I look at the coin (and you don’t) and ask again

Why does the answer change?


–
And did it ever have a single value?
©MapR Technologies - Confidential
26
A First Conclusion

Probability as expressed by humans is subjective and depends on
information and experience
©MapR Technologies - Confidential
27
A Second Diversion

What is the mass of the moon?
–
–
–

Is that the exact number?
–

1/2 degree @ 385 Mm = ~ 3.8 Mm diameter (really about 3.4-ish)
V = 1/6 x pi x 3.83 x 1018 m3 = ~ 29 x 1018 m3 (really about 22)
m = rho V = 4 Mg/m3 x 29 x 1018 m3 = 1.2 x 1023 kg (really about 0.7)
Shouldn’t we have confidence bounds?
Wikipedia says: 7.3477 × 1022 kg
–
–
Is that the exact number?
Shouldn’t they have confidence bounds?
©MapR Technologies - Confidential
28
A Second Conclusion

A single number is a bad way to express uncertain knowledge

A distribution of values might be better
©MapR Technologies - Confidential
29
I Dunno
©MapR Technologies - Confidential
30
5 and 5
©MapR Technologies - Confidential
31
2 and 10
©MapR Technologies - Confidential
32
Bayesian Bandit

Compute distributions based on data

Sample p1 and p2 from these distributions

Put a coin in bandit 1 if p1 > p2

Else, put the coin in bandit 2
©MapR Technologies - Confidential
33
And it works!
0.12
0.11
0.1
0.09
0.08
regret
0.07
0.06
ε- greedy, ε = 0.05
0.05
0.04
Bayesian Bandit with Gam m a- Norm al
0.03
0.02
0.01
0
0
100
200
300
400
600
500
n
©MapR Technologies - Confidential
34
700
800
900
1000
1100
Video Demo
©MapR Technologies - Confidential
35
The Code

Select an alternative
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))

Select and learn
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)

But we already know how to count!
©MapR Technologies - Confidential
36
The Basic Idea

We can encode a distribution by sampling

Sampling allows unification of exploration and exploitation

Can be extended to more general response models
©MapR Technologies - Confidential
37

Contact:
–
–

tdunning@maprtech.com
@ted_dunning
Slides and such:
–
http://info.mapr.com/ted-uk-05-2012
©MapR Technologies - Confidential
39
MapR’s Innovations
©MapR Technologies - Confidential
40
Thank You
©MapR Technologies - Confidential
41
Download