Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05 Motivation Large amounts of (large,dynamic, unwieldy) data Analyses on data can be expressed quite simply – or mapped to a series of simple calculations Provide parallel processing without the user being involved Two phase process Evaluate each record individually Aggregate the results Overview Constraints of commutativity and associativity Query Sawzall Aggregation Query > Aggregation> Result Putting the pieces together Protocol Buffers Google File System Format of permanent records on disk DDL to generate code for accessing and assembling data Allows for data to be spread in “chunks” across many machines MapReduce Built on top of MapReduce, Sawzall runs in the map phase Output of map phase is data items for aggregators System Model Source code parsed at each machine Output split into a set of files (allows parallel aggregation) Runs one record at a time Arena Allocator Keyword static – considered part of state for each record Sawzall Typed – has conversions between types proto imports the DDL which defines the Sawzall tuple type that describes input’s layout input: bytes = next_record(); #implicit Proto “some_record.proto” r: Record = input; #convert input to Record Sawzall Aggregation emit - sends data to external aggregator Drawing line between filtering and aggregating enables high degree of parallelism Collection, Sample, Sum, Maximum, Quantile, Top, Unique Possible to process data as part of mapping phase (ex sum) Possible to index aggregators Creates a distinct aggregator for each unique value of index Sawzall Example proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; log_record : QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(loc.lat)] [int(loc.lon)]<-1 Sawzall the Language Statically typed for dependability when() for logical quantifier when(i: some int; B(a[i])) F(i); def() for undefined values Captures aggregators in language (Advantage over MapReduce) Performance Interpreted Language Limited by I/O Still slower than Java Scales up At 600 machines 3.2GB/s of raw input per machine Additional machines add .98 machine throughput