Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05

advertisement
Interpreting the Data: Parallel
Analysis With Sawzall
Steve Hookway
9/15/05
Motivation




Large amounts of (large,dynamic, unwieldy)
data
Analyses on data can be expressed quite
simply – or mapped to a series of simple
calculations
Provide parallel processing without the user
being involved
Two phase process


Evaluate each record individually
Aggregate the results
Overview

Constraints of
commutativity and
associativity

Query



Sawzall
Aggregation
Query >
Aggregation>
Result
Putting the pieces together

Protocol Buffers



Google File System


Format of permanent records on disk
DDL to generate code for accessing and
assembling data
Allows for data to be spread in “chunks” across
many machines
MapReduce


Built on top of MapReduce, Sawzall runs in the
map phase
Output of map phase is data items for aggregators
System Model



Source code parsed at each machine
Output split into a set of files (allows
parallel aggregation)
Runs one record at a time


Arena Allocator
Keyword static – considered part of state
for each record
Sawzall


Typed – has conversions between types
proto imports the DDL which defines
the Sawzall tuple type that describes
input’s layout



input: bytes = next_record(); #implicit
Proto “some_record.proto”
r: Record = input; #convert input to
Record
Sawzall Aggregation





emit - sends data to external aggregator
Drawing line between filtering and
aggregating enables high degree of
parallelism
Collection, Sample, Sum, Maximum, Quantile,
Top, Unique
Possible to process data as part of mapping
phase (ex sum)
Possible to index aggregators

Creates a distinct aggregator for each unique
value of index
Sawzall Example
proto “querylog.proto”
queries_per_degree: table
sum[lat: int][lon:int] of int;
log_record : QueryLogProto = input;
loc: Location = locationinfo(log_record.ip);
emit queries_per_degree[int(loc.lat)]
[int(loc.lon)]<-1
Sawzall the Language


Statically typed for dependability
when() for logical quantifier



when(i: some int; B(a[i])) F(i);
def() for undefined values
Captures aggregators in language

(Advantage over MapReduce)
Performance

Interpreted Language



Limited by I/O
Still slower than Java
Scales up


At 600 machines 3.2GB/s of raw input per
machine
Additional machines add .98 machine
throughput
Download