slides

advertisement
Muppet
Scalable MapUpdate data-stream processing
Wang Lam, Lu Liu, STS Prasad, Anand
Rajaraman, Zoheb Vacheri, AnHai Doan
@WalmartLabs
1
Road Map
• Motivation
• The MapUpdate framework
• An example data-stream computation
• Muppet implementation
2
The challenge
• Growing numbers of large, fast data streams
– 300+ million Twitter status updates daily
– 5+ million Foursquare checkins daily
– 3+ billion Facebook Likes and comments daily
• Streams never stop
• Growing numbers of applications for data streams
– Computations need to scale with the data
– Applications need to stay up-to-date (“What’s going on now?”)
• Machines fail
3
The wish list
• Deliver low-latency processing
– Application stays near real-time with its input stream
– Computed data can be queried live
• Scale up on commodity hardware with computation and stream rate
• Easy to program
– Simple model to enable rapid development of many applications
– Ideally resemble widely adopted MapReduce
4
Data-stream computation
• Big data: MapReduce (Hadoop)
– Map and Reduce steps
– Batch process large input (e.g., from HDFS)
– Hadoop distributes computation
• Fast data: MapUpdate (Muppet)
– Map and Update steps
– Continuously process streaming input (e.g., from network)
– Muppet maintains computation and manages memory/storage
5
The MapReduce framework (Hadoop)
• Event
– A <key, value> pair of data
• Map
– A function that performs (stateless) computation on incoming events
• Reduce
– A function that combines all input for a particular key
• Application
– Map -> Reduce
6
The MapUpdate framework (Muppet)
• Event
– A <key, value> pair of data
• Map
– A function that performs (stateless) computation on incoming events
• Update
– A function that updates a slate using incoming events
• Application
– A directed graph of Mappers and Updaters
7
A MapUpdate application
8
An example Muppet application
Checkin counts on Foursquare
• Identify Foursquare checkins at various retailers
• Maintain a live count of retailer checkins
• Enable a display of the current counts at any time
9
An example Muppet application
Checkin counts on Foursquare
• Source: Read Foursquare stream and create key-value-pair events.
• Map: For each checkin event, identify a retailer and publish if found.
• Update: For each retailer checkin, increment appropriate count.
Updater slates hold live retailer check-in counts.
10
An example Muppet application
• Source: Read Foursquare stream and create key-value-pair events.
Input (excerpt):
{
"checkin": {
"created": 1288052432,
"venue": {
"id": 453407,
"name": "Walmart Neighborhood Market"
}
}
}
11
An example Muppet application
• Source: Read Foursquare stream and create key-value-pair events.
Output:
453407,
{
"checkin": {
"created": 1288052432,
"venue": {
"id": 453407,
"name": "Walmart Neighborhood Market"
}
}
}
12
An example Muppet application
• Map: For each checkin event, identify a retailer and publish if found.
Input:
453407,
{
"checkin": {
"created": 1288052432,
"venue": {
"id": 453407,
"name": "Walmart Neighborhood Market"
}
}
}
13
An example Muppet application
• Map: For each checkin event, identify a retailer and publish if found.
Output:
Walmart.1288052100,
{
"checkin": {
"created": 1288052432,
"venue": {
"id": 453407,
"name": "Walmart Neighborhood Market"
}
},
"kosmix": {
"timeslot": 1288052100,
"interval": 900,
"retailer": "Walmart"
}
}
14
An example Muppet application
• Update: For each retailer checkin, increment appropriate count.
Input:
Walmart.1288052100,
{
"checkin": {
"created": 1288052432,
"venue": {
"id": 453407,
"name": "Walmart Neighborhood Market"
}
},
"kosmix": {
"timeslot": 1288052100,
"interval": 900,
"retailer": "Walmart"
}
}
15
An example Muppet application
• Update: For each retailer checkin, increment appropriate count.
Slate:
Walmart.1288052100,
{
"retailer": "Walmart",
"timeslot": 1288052100,
"interval": 900,
"count": 1
}
16
The Source (stream receiver)
while ($checkin = <$sock>) {
$checkin =~ s/^[^{]*//;
next if ($checkin eq "");
$checkin_count++;
my $event;
eval { $event = decode_json($checkin); };
if ($@ or (!defined($event->{checkin}))) {
$invalid_count++;
} else {
$event = $event->{checkin};
my $checkin_time = $event->{created};
my $venue = $event->{venue}->{id};
$self->publish("FoursquareCheckin", $event, $venue);
}
}
17
The Map (Foursquare::CheckinMapper)
sub map {
my $self = shift;
my $event = shift;
my $checkin = $event->{checkin};
my $timeslot = int($checkin->{created} / 900) * 900;
$event->{kosmix}->{timeslot} = $timeslot;
$event->{kosmix}->{interval} = 900;
my $venue_name = $checkin->{venue}->{name};
my $retailer = 0;
$retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i);
$retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i);
$retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i);
if ($retailer) {
$event->{kosmix}->{retailer} = $retailer;
$self->publish("FoursquareRetailerCheckin", $event,
$retailer.".".$timeslot);
}
}
18
The Update (Foursquare::RetailerUpdater)
use Muppet::Updater;
package Foursquare::RetailerUpdater;
@ISA = qw( Muppet::Updater );
use strict;
sub update {
my $self = shift;
my $event = shift;
my $slate = shift;
my $config = shift;
my $key = shift;
$slate->{timeslot}
$slate->{interval}
$slate->{retailer}
$slate->{count} +=
= $event->{kosmix}->{timeslot};
= $event->{kosmix}->{interval};
= $event->{kosmix}->{retailer};
1;
return $slate;
}
1;
19
The application configuration (flow graph)
{
"performer" : "foursquare_mapper",
"type" : "perl",
"class" : "Foursquare::CheckinMapper",
"muppet_type" : "Mapper",
"subscribes_to" : [ "FoursquareCheckin" ],
"publishes_to" : [ "FoursquareRetailerCheckin" ]
},
{
"performer" : "foursquare_retailer",
"type" : "perl",
"class" : "Foursquare::RetailerUpdater",
"muppet_type" : "Updater",
"workers" : 4,
"slate_cache_max" : 10000,
"slate_cache_write_after" : 1,
"subscribes_to" : [ "FoursquareRetailerCheckin" ]
}
20
Example results
21
Implementation
22
Implementation
• Slate management
– Slates are cached for performance
– Cache is sharded by key for load distribution across machines
– Slates are written to distributed key-value store for durability
• Event flow
– Event queues buffer transient load spikes within an application
– Host failover remaps load away from an unresponsive machine
23
Challenges
• Host failover
• Hotspots (uneven load)
• Parallelization
• Slate caching
• Overload stability
24
Hotspots
• Some key distributions are highly nonuniform (e.g., Zipfian)
– Keys based on natural-language word usage
– Keys based on a set of varying popularity
• Mappers: Run any event anywhere.
• Updaters: Popular keys need access to the same slate.
– Split associative and commutative computations
• Split computation parallelizes partial results.
• Propagate partial results to final result.
– Reduce slate serialization/deserialization overhead
25
Usage
• Time
– Running since mid-2010
• Developers
– More than a dozen developers at WalmartLabs have used Muppet to
develop their applications
• Data
– Billions of events, tens of millions of slates processed
26
Related work
• MapReduce
work toward incremental batch runs of MapReduce, rather than continuous
event processing in a revised framework (e.g., MapUpdate)
– MapReduce Online (Condie et al.)
– Nova (Olston et al.)
• Event-flow systems
systems that focus on the dispatch of events, leaving application state and
storage (cf. MapUpdate slates) as a problem for the application developer
– S4 (Neumeyer et al.)
– Storm (Marz et al.)
• Streaming-query systems
systems that run and optimize queries in a prescribed query language
(contrast low-level, general-purpose MapUpdate operators)
– Aurora (StreamBase Systems) (Zdonik et al.)
– SPADE for System S (InfoSphere Streams) (Gedik et al.)
27
Conclusion
Big Data : MapReduce :: Fast Data : MapUpdate
Create soft-real-time applications on a simple programming model.
Distributed stream-processing infrastructure scales computation across cores.
28
Muppet
Scalable data-stream processing
Big Fast Data @WalmartLabs
29
Download