Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu Provenance Where data came from How it was derived, manipulated, combined, processed, … How it has evolved over time Uses: Explanation Debugging and verification Recomputation Robert Ikeda 2 The Panda Environment I1 … In O Data-oriented workflows Graph of processing nodes Data sets on edges Statically-defined; batch execution; acyclic Robert Ikeda 3 Provenance Twitter Posts Movie Sentiments Backward tracing Find the input subsets that contributed to a given output element Forward tracing Determine which output elements were derived from a particular input element Robert Ikeda 4 Provenance Basic idea Capture provenance one node at a time (lazy or eager) Use it for backward and forward tracing Handle processing nodes of all types Robert Ikeda 5 Generalized Map and Reduce Workflows R M R M M What if every node were a Map or Reduce function? Provenance easier to define, capture, and exploit than in the general case Transparent provenance capture in Hadoop Doesn’t interfere with parallelism or fault-tolerance Robert Ikeda 6 Remainder of Talk Defining Map and Reduce provenance Recursive workflow provenance Capturing and tracing provenance System description and performance Future work Robert Ikeda 7 Remainder of Talk Defining Map and Reduce provenance Recursive workflow provenance Surprising theoretical result Capturing and tracing provenance System description and performance Implementation details Future work Robert Ikeda 8 Example Tweets TweetScan Good Movies TM AM Aggregate Diggs DM Bad Movies DiggScan Robert Ikeda Filter 9 Transformation Properties Deterministic Functions. Multiplicity for Map Functions Multiplicity for Reduce Functions Monotonicity Robert Ikeda 10 Map and Reduce Provenance Map functions M(I) = UiI (M({i})) Provenance of oO is iI such that oM({i}) Reduce functions R(I) = U1≤ k ≤ n(R(Ik)) I1,…,In partition I on reduce-key Provenance of oO is Ik I such that oR(Ik) Robert Ikeda 11 Workflow Provenance I*1 I1 … … I*n In R E1 M R M M oO O E2 Intuitive recursive definition Desirable “replay” property o W(I*1,…, I*n) Usually holds, but not always Robert Ikeda 12 Replay Property Example Twitter Posts M R R TweetScan Summarize Count Inferred Movie Ratings Rating Medians #Movies Per Rating “Avatar was great” Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 “I enjoyed Avatar” Avatar 7 “I loved Twilight” Twilight 7 Avatar 4 “Avatar was okay” Robert Ikeda 13 Movie Median Median #Movies Avatar 7 2 1 Twilight 2 7 1 Replay Property Example Twitter Posts M R R TweetScan Summarize Count Inferred Movie Ratings Rating Medians #Movies Per Rating “Avatar was great” Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 “I enjoyed Avatar” Avatar 7 “I loved Twilight” Twilight 7 Avatar 4 “Avatar was okay” Robert Ikeda 14 Movie Median Median #Movies Avatar 7 2 1 Twilight 2 7 1 Replay Property Example Twitter Posts M R R TweetScan Summarize Count Inferred Movie Ratings Rating Medians #Movies Per Rating “Avatar was great” Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 15 Movie Median Median #Movies Avatar 7 2 1 Twilight 2 7 1 Replay Property Example Twitter Posts M R R TweetScan Summarize Count Inferred Movie Ratings Rating Medians #Movies Per Rating “Avatar was great” Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 16 Movie Median Median #Movies Avatar 7 2 1 Twilight 2 7 1 Replay Property Example Twitter Posts M R R TweetScan Summarize Count Inferred Movie Ratings Rating Medians Nonmonotonic Reduce Function Movie Rating “Avatar was great” One-Many “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 17 #Movies Per Rating Nonmonotonic Reduce Movie Median Median #Movies Avatar 7 2 1 Twilight 7 2 7 12 Capturing and Tracing Provenance Map functions Add the input ID to each of the output elements Reduce functions Add the input reduce-key to each of the output elements Tracing Straightforward recursive algorithms Robert Ikeda 18 RAMP System Built as an extension to Hadoop Supports MapReduce Workflows Each node is a MapReduce job Provenance capture is transparent Retaining Hadoop’s parallel execution and fault tolerance Users need not be aware of provenance capture Wrapping is automatic RAMP stores provenance separately from the input and output data Robert Ikeda 19 RAMP System: Provenance Capture Hadoop components Record-reader Mapper Combiner (optional) Reducer Record-writer Robert Ikeda 20 RAMP System: Provenance Capture Input Input Wrapper RecordReader RecordReader (ki, vi) (ki, vi) p (ki, 〈vi, p〉) Wrapper (ki, vi) Mapper Mapper p (km, vm) (km, vm) (km, 〈vm, p〉) Robert Ikeda Map Output 21 Map Output RAMP System: Provenance Capture Map Output Map Output (km, [〈vm1, p1〉,…, 〈vmn, pn〉]) (km, [vm1,…,vmn]) Wrapper (km, [vm1,…,vmn]) Reducer Reducer (ko, vo) (ko, vo) (ko, 〈vo, kmID〉) Wrapper (ko, vo) RecordWriter RecordWriter q (kmID, pj) (q, kmID) Robert Ikeda Output 22 Output Provenance Experiments 51 large EC2 instances (Thank you, Amazon!) Two MapReduce “workflows” Wordcount • Many-one with large fan-in • Input sizes: 100, 300, 500 GB Terasort • One-one • Input sizes: 93, 279, 466 GB Robert Ikeda 23 Results: Wordcount Time Overhead Space (no provenance) 1500 5000 1250 4000 1000 Execution time Map finish time Avg. reduce task time Robert Ikeda 500G 300G 100G 500G 100G 500G 300G 100G 500G 300G 0 100G 0 500G 250 300G 1000 300G 500 100G 2000 Space Overhead 750 500G 3000 300G Size (GB) 6000 100G Time (seconds) Time (no provenance) Map output data Intermediate Output data size size data size 24 Results: Terasort Space (no provenance) Time Overhead 600 1500 500 1200 400 Size (GB) 1800 900 600 300 Space Overhead 300 200 100 0 466G 279G 93G 466G 279G 93G 466G 279G 0 93G Time (seconds) Time (no provenance) Execution time Map finish time Avg. reduce task time Robert Ikeda 25 93G279G 466G 93G279G 466G 93G279G 466G Map output data size Intermediate data size Output data size Summary of Results Overhead of provenance capture Terasort • 20% time overhead, 21% space overhead Wordcount • 76% time overhead, space overhead depends directly on fan-in Backward-tracing Terasort • 1.5 seconds for one element Wordcount • Time directly dependent on fan-in Robert Ikeda 26 Future Work RAMP Selective provenance capture More efficient backward and forward tracing Indexing General Incorporating SQL processing nodes Robert Ikeda 27 PANDA A System for Provenance and Data “stanford panda”