Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05

Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05 Motivation     Large amounts of (large,dynamic, unwieldy) data Analyses on data can be expressed quite simply – or mapped to a series of simple calculations Provide parallel processing without the user being involved Two phase process   Evaluate each record individually Aggregate the results Overview  Constraints of commutativity and associativity  Query    Sawzall Aggregation Query > Aggregation> Result Putting the pieces together  Protocol Buffers    Google File System   Format of permanent records on disk DDL to generate code for accessing and assembling data Allows for data to be spread in “chunks” across many machines MapReduce   Built on top of MapReduce, Sawzall runs in the map phase Output of map phase is data items for aggregators System Model    Source code parsed at each machine Output split into a set of files (allows parallel aggregation) Runs one record at a time   Arena Allocator Keyword static – considered part of state for each record Sawzall   Typed – has conversions between types proto imports the DDL which defines the Sawzall tuple type that describes input’s layout    input: bytes = next_record(); #implicit Proto “some_record.proto” r: Record = input; #convert input to Record Sawzall Aggregation      emit - sends data to external aggregator Drawing line between filtering and aggregating enables high degree of parallelism Collection, Sample, Sum, Maximum, Quantile, Top, Unique Possible to process data as part of mapping phase (ex sum) Possible to index aggregators  Creates a distinct aggregator for each unique value of index Sawzall Example proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; log_record : QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(loc.lat)] [int(loc.lon)]<-1 Sawzall the Language   Statically typed for dependability when() for logical quantifier    when(i: some int; B(a[i])) F(i); def() for undefined values Captures aggregators in language  (Advantage over MapReduce) Performance  Interpreted Language    Limited by I/O Still slower than Java Scales up   At 600 machines 3.2GB/s of raw input per machine Additional machines add .98 machine throughput

Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05

Related documents

Products

Support

Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib