Muppet Scalable MapUpdate data-stream processing Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan @WalmartLabs 1 Road Map • Motivation • The MapUpdate framework • An example data-stream computation • Muppet implementation 2 The challenge • Growing numbers of large, fast data streams – 300+ million Twitter status updates daily – 5+ million Foursquare checkins daily – 3+ billion Facebook Likes and comments daily • Streams never stop • Growing numbers of applications for data streams – Computations need to scale with the data – Applications need to stay up-to-date (“What’s going on now?”) • Machines fail 3 The wish list • Deliver low-latency processing – Application stays near real-time with its input stream – Computed data can be queried live • Scale up on commodity hardware with computation and stream rate • Easy to program – Simple model to enable rapid development of many applications – Ideally resemble widely adopted MapReduce 4 Data-stream computation • Big data: MapReduce (Hadoop) – Map and Reduce steps – Batch process large input (e.g., from HDFS) – Hadoop distributes computation • Fast data: MapUpdate (Muppet) – Map and Update steps – Continuously process streaming input (e.g., from network) – Muppet maintains computation and manages memory/storage 5 The MapReduce framework (Hadoop) • Event – A <key, value> pair of data • Map – A function that performs (stateless) computation on incoming events • Reduce – A function that combines all input for a particular key • Application – Map -> Reduce 6 The MapUpdate framework (Muppet) • Event – A <key, value> pair of data • Map – A function that performs (stateless) computation on incoming events • Update – A function that updates a slate using incoming events • Application – A directed graph of Mappers and Updaters 7 A MapUpdate application 8 An example Muppet application Checkin counts on Foursquare • Identify Foursquare checkins at various retailers • Maintain a live count of retailer checkins • Enable a display of the current counts at any time 9 An example Muppet application Checkin counts on Foursquare • Source: Read Foursquare stream and create key-value-pair events. • Map: For each checkin event, identify a retailer and publish if found. • Update: For each retailer checkin, increment appropriate count. Updater slates hold live retailer check-in counts. 10 An example Muppet application • Source: Read Foursquare stream and create key-value-pair events. Input (excerpt): { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } } 11 An example Muppet application • Source: Read Foursquare stream and create key-value-pair events. Output: 453407, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } } 12 An example Muppet application • Map: For each checkin event, identify a retailer and publish if found. Input: 453407, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } } 13 An example Muppet application • Map: For each checkin event, identify a retailer and publish if found. Output: Walmart.1288052100, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" } } 14 An example Muppet application • Update: For each retailer checkin, increment appropriate count. Input: Walmart.1288052100, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" } } 15 An example Muppet application • Update: For each retailer checkin, increment appropriate count. Slate: Walmart.1288052100, { "retailer": "Walmart", "timeslot": 1288052100, "interval": 900, "count": 1 } 16 The Source (stream receiver) while ($checkin = <$sock>) { $checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++; my $event; eval { $event = decode_json($checkin); }; if ($@ or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); } } 17 The Map (Foursquare::CheckinMapper) sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); } } 18 The Update (Foursquare::RetailerUpdater) use Muppet::Updater; package Foursquare::RetailerUpdater; @ISA = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} $slate->{interval} $slate->{retailer} $slate->{count} += = $event->{kosmix}->{timeslot}; = $event->{kosmix}->{interval}; = $event->{kosmix}->{retailer}; 1; return $slate; } 1; 19 The application configuration (flow graph) { "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ] }, { "performer" : "foursquare_retailer", "type" : "perl", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ] } 20 Example results 21 Implementation 22 Implementation • Slate management – Slates are cached for performance – Cache is sharded by key for load distribution across machines – Slates are written to distributed key-value store for durability • Event flow – Event queues buffer transient load spikes within an application – Host failover remaps load away from an unresponsive machine 23 Challenges • Host failover • Hotspots (uneven load) • Parallelization • Slate caching • Overload stability 24 Hotspots • Some key distributions are highly nonuniform (e.g., Zipfian) – Keys based on natural-language word usage – Keys based on a set of varying popularity • Mappers: Run any event anywhere. • Updaters: Popular keys need access to the same slate. – Split associative and commutative computations • Split computation parallelizes partial results. • Propagate partial results to final result. – Reduce slate serialization/deserialization overhead 25 Usage • Time – Running since mid-2010 • Developers – More than a dozen developers at WalmartLabs have used Muppet to develop their applications • Data – Billions of events, tens of millions of slates processed 26 Related work • MapReduce work toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate) – MapReduce Online (Condie et al.) – Nova (Olston et al.) • Event-flow systems systems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer – S4 (Neumeyer et al.) – Storm (Marz et al.) • Streaming-query systems systems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators) – Aurora (StreamBase Systems) (Zdonik et al.) – SPADE for System S (InfoSphere Streams) (Gedik et al.) 27 Conclusion Big Data : MapReduce :: Fast Data : MapUpdate Create soft-real-time applications on a simple programming model. Distributed stream-processing infrastructure scales computation across cores. 28 Muppet Scalable data-stream processing Big Fast Data @WalmartLabs 29