SubZero: A Fine-Grained Lineage System for Scientific Databases Eugene W u, Samuel Madden, Michael Stonebrak er CSAIL, MIT { sirrice, madden, stonebraker } @csail.mit.edu Abstract— Data lineage is a k ey component of pr o v enance queries that w alk backw ardto identify the specific cells in that helps scientists track and query r elationships between input the input arrays on which a gi v en output cell depends and that and output data. While curr ent systems r eadily support lineage w alk forw ardto find the output ce llsthat a particular input r elationships at the file or data array le v el, finer -grained support at an array-cell le v elis impractical due to the lack of support cell influenced.Such a s ystemmust manageinput to output f or user defined operators and the high runtime and storage relationships at afine-gr ainedarray-cell le v el. o v erhead to stor e such lineage. Prior w ork in data lineage tracking systems has lar gely been W e inter viewed scientists in se v eral domains to identify a set limited to coarse-grainedmetadatatracking [3], [4], which of common semantics that can be le v eragedto efficiently stor e storesrelationshipsat the file or relational table le v el.F inefine-grained lineage. W e use the insights to define lineage r epr erelationships at the array cell or tuple e sentations that efficiently captur e common locality pr operties in gr ained linea gtracks the lineage data, and a set of APIs so operator de v elopers can le v el.The typical approach,popularizedby T rio [5], which easily export lineage inf ormation fr om user defined operators. we call cell-le vellinea g,eeagerly materializes the identifiers Finally , we intr oduce tw o benchmarks deri v edfr om astr onomy of the input data records (e.g., tuples or array cells) that and genomics, and sho w that our techniques can r educe lineage each output record depends on, and use s it to directly answer query costs by up to 10× while incuring substantially less impact backward lineage queries. An alternati v e, which we callkblac on w orkflo w runtime and storage. box linea g, esimply records the input and output datasets and I . I NTRODUCTION runtime parametersof each operator as it is e x ecuted, and Man y scientific applications are naturally e xpressedas a materializesthe lineage at lineage query time by re-running w orkflo w that comprises a sequence of operations applied rele to v ant operators in a tracing mode. ra w input data to produce an output dataset or visualization. Unfortunately , both techniques are insuf ficient in scientific Lik e database queries, such w orkflo ws can be quite comple x, applications for tw o reasons. First, scientific applications mak e consisting up to hundreds of operations [1] whose parameters hea vy use of user defined functions (UDFs), whose semantics or inputs v ary from one run to another . are opaqueto the lineage system.Existing approachesconScientists record and query pro v enance – metadata thatserv de- ati v ely assumethat e v eryoutput cell of a UDF depends scribes the processes, en vironment and relationships between on e v ery input cell, which limits the utility of a fine-grained input and output data arrays – to ascertain data quality , audit lineage system because it tracks a lar ge amount of information and deb ug w orkflo ws, and more generally understand ho w the pro viding an y insight into which inputs actually conwithout output data came to be. A k e y component of pro v enance, data trib uted to a gi v en output. This necessitates proper APIs so that linea g,eidentifies ho w input data elements are related to output UDF designerscan e xposefine-grainedlineage information data elementsand is inte gralto deb uggingw orkflo ws.F or and operator semantics to the lineage system. e xample,scientistsneed to be able to w ork backw ardfrom Second, neither black-box only nor cell-le v elonly techthe output to identify the sources of an error gi v en erroneous niques are suf ficientfor man yapplications.Scientific w orkor suspiciousoutput results. Once the sourceof the error is flo wsconsumedata arrays that re gularlycontain millions of identified, the scientist will then often w ant to identify deri v cells, ed while generating comple x relationships between groups do wnstream data elements that depend on the erroneous vofalue input and output cells. Storing cell-le v el lineage can a v oid so he can inspect and possibly correct those outputs. re-running some computationally intensi v e operators (e.g., an In this paper ,we describe the design of a fine-grained image processing operator that detects a small number of stars lineage tracking and querying system for array-orientedsci- in telescope imagery), b ut needs enormous amounts of storage entific w orkflo ws.W eassumea data and e x ecutionmodel if e v eryoutput dependson e v eryinput (e.g., a matrix sum similar to SciDB [2]. W e chose this becauseit pro vides operation) – it may be preferableto recomputethe lineage a closed e x ecutionen vironmentthat can capture all of the at query time. In addition, applications such as LSST1 are lineage information, and because it is specifically designed for often subject to limitations that only allo w them to dedicate scientific data processing(scientiststypically use RDBMSes a small percentageof storageto lineage operations.Ideally , to managemetadataand do data processingoutside of the lineage systems w ould support a h ybrid of the tw o approaches database). The system allo ws scientists to perform e xploratory w orkflo wdeb uggingby e x ecutinga series of data linea g e 1http://lsst.or g and tak euser constraintsinto accountwhen deciding which The ne xt section describes our moti v ating use cases in more operators to store lineage for . detail. It is follo wed by a high le v el system architecture and This paper seeks to address both challenges. W e interviedetails wed of the rest of the system. scientists from se v eral domains to understand their data proI I .U SEC ASES cessing w orkflo ws and lineage needs and used the results to design a science-orienteddata lineage system.W eintroduce W e de v eloped tw o benchmark applications after discussions e xploits locality properties pre v alent with in en vironmentalscientists, astronomists,and geneticists. Re gion Linea ,gwhich e the scientific operators we e n c ountered. It addresses common The first is an image processingbenchmarkde v eloped with relationshipsbetween re gionsof input and output cells by scientists at the Lar ge Synoptic Surv e yT elescope(LSST) storing grouped or summary information rather than indi vidual project. It is v ery similar to en vironmentalsciencerequirepairs of input and output cells. W e de v eloped a lineage API ments, so the y are combined together . The second w as de v el2 that supports black-box lineage as well as Re gionLinea g, e oped with geneticists at the Broad Institute . Each benchmark which subsumescell-le v ellineage. Programmerscan also consists of a w orkflo wdescription, a dataset,and lineage specify forw ard/backw ard Mapping Functionsfor an operator queries. W e used the benchmarks to design the optimizations to directly compute the forw ard/backw ard lineage solely from described in the paper . This section will briefly describe each input/output cell coordinates and operator ar guments; we imbenchmark’ s scientific application, the types of desired lineage plemented these for man y common matrix and statistical funcqueries, and application-specific insights. tions. W e also de v eloped a h ybrid lineage storage system that allo wsusersto e xplicitly trade-of fstoragespacefor lineage A. Astr onomy query performance using an optimization frame w ork. Finally ,The Lar geSynaptic Surv e yT elescope(LSST) is a wide we introduce tw o end-to-end scientific lineage benchmarks.angle telescope slated to be gin operation in F all 2015. A k e y As mentioned earlier ,the system prototype, SubZero, is challenge in processing telescope images is filtering out high implemented in the conte xt of the SciDB model. SciDB ener gyparticles (cosmic rays) that create abnormally bright stores multi-dimensional arrays and e x ecutes database queries pix els in the resulting image, which can be mistak en for stars. composed of b uilt-in and user -defined operators (UDFs) that The telescope compensates by taking tw o consecuti v e pictures are compiled into w orkflo ws.Gi v ena set of user -specified of the samepiece of the sk yand remo vingthe cosmic rays storage constraints, SubZero uses an optimization frame w in orksoftw are. The LSST image processing w orkflo w (Figure 1) to choosethe optimal type of lineage (black box, or one of tak estw o images as input and outputs an annotatedimage se v eral ne w types we propose) for each SciDB operator that that labels each pix el with the celestial body it belongs to. It minimizes lineage query costs while respectinguser storage first cleans and detects cosmic rays in each image separately , constraints. then createsa single composite,cosmic-ray-fr ee,image that A summary of our contrib utions include: is used to detect celestial bodies.There are 22 SciDB b uilt1) The notion of r e gionlinea g, ewhich SubZero uses to in operators(blue solid box es)that perform common matrix ef ficiently store and query lineage data from scientific operations,such as con v olution,and four UDFs (red dotted applications.W ealso introduce se v eralef ficientrepre- box eslabeled A-D). The UDFs A and B output cosmic-ray sentations and encoding schemes that each ha v e dif ferent masks for each of the images.After the images are subseo v erhead and query performance trade of fs. quently mer ged,C remo v es cosmic-raysfrom the composite 2) A linea g e API that operator de v elopers can use to e xpose image, and D detects stars from the cleaned image. lineage from user defined operators, including the spec- The LSST scientists are interested in three types of queries. ification of mapping functions for man yof the b uilt in The first picks a star in the output image and traces the lineage SciDB operators. back to the initial input image to detect bad input pix els. The 3) A unified stor a g model for mapping functions, re gion latter tw o queries select a re gion of output (or input) pix els and e and cell-le v el lineage, and black-box lineage. trace the pix els backw ard (or forw ard) through a subset of the 4) An optimization fr ame work which picks an optimal mix- w orkflo w to identify a single f aulty operator . As an e xample, ture of black-box and re gion lineage to maximize querysuppose the operator that computes t h e mean brightness of the performance within user defined constraints. image generated an anomalously high v alue due to a fe w bad 5) A performancee v aluationof our approachon end-to- pix el, which led to further mis-calculations.The astronomer end astronomy and genomics benchmarks. The astronomy might w ork backw ardfr om those calculations,identify the benchmark,which is computationallyintensi v eb ut e x- input pix els that contrib uted to them, and filter out those pix els hibits high locality , benefits from ef ficient representations. that appear e xcessi v ely bright. Comparedto cell-le v eland black-box lineage, SubZero Both the LSST and en vironmental scientists described w ork× and reduces storage o v erhead by nearly 70 speeds query loads where the majority of the data processing code computes × . The genomics benchmark output pix els using input pix els within a small distance from performance by almost 255 highlights the need for , and benefits of, using an optimizer the corresponding coordinate of the output pix el. These re gions to pick the storage layout, which impro v es query perfor 2 mance by 2³ × while staying within user constraints. http://www.broadinstitute.org/ Test# G may be constant, pre-defined v alues, or easily computed from Matrix# a small amount of additional metadata. F or e xample, a pix el in E C R(D) is set if the Training# the mask produced by cosmic ray detection F# H Matrix# related input pix el is a cosmic ray , and depends on neighboring input cells within 3 pix els. Otherwise, it only depends on the Modeling#phase# Tes6ng#phase# related input pix el. The y also felt that it is suf ficient for lineage queries to return a superset of the e xact lineage. Although we Fig. 2. Simplified diagram of genomics w orkflo w . Each solid rectangle is a do not tak e adv antage of this insight, this suggests future wSciDB ork nati v e operator while the red dotted rectangles are UDFs. in lossy compression techniques. C D A Array' Array' Constraints' Workflow'Executor' Op3 mizer' IP'Solver' A Query' Executor' 2' Run3me' Encoder' B. Genomics Pr ediction 1' Cells' Sta3s3cs' Collector' C B Fig. 1. Summary diagram of LSST w orkflo wEach . solid rectangleis a SciDB nati v e operator while the red dotted rectangles are UDFs. D Queries' Operator' Specific' Datastore' A' ReCexecutor' Decoder' C' D' W e ha v e also been w orking with researchers at the Broad Institute on a genomics benchmark related to predicting recur Fig. 3. The SubZero architecture. rences of medulloblastoma in patients. Medulloblastoma is a Executor ), constraints on the amount of storage that can form of cancer that spa wns brain tumors that spread through be de v oted to lineage tracking, and a sample lineage query the cerebrospinal fluid. P ablo et. al [6] ha v e identified a set of w orkload that the user e xpects to run. SubZero optimally patient features that helppredict relapsein medulloblastoma decides the type of lineage that each operator in the w orkflo w patients that ha v e been treated. The features include histology , will generate ( the ) in linea g e str ate gyorder to maximize the gene e xpression le v els, and the e xistence of genetic abnormalperformance of the query w orkload performance. ities. The w orkflo w (Figure 2) is a tw o-step proces s that first Figure 3 sho ws the system architecture. The solid and tak es a training patient-feature matrix and outputs a Bayesian dashed arro ws indicate the control and data flo w ,respecmodel. Then it uses the model to predict relapse in a test ti v ely . Users interact with SubZero by defining and e x ecuting patient-featurematrix. The model computesho wmuch each w orkflo ws ( ), specifying constraints to the Workflow Executor feature v alue contrib utes to the lik elihood of patient relapse. Optimizer Query Executor , and running lineage queries ( ). The The ten b uilt-in operators (solid blue box es) are simple matrix operators in the w orkflo w specify a list of the types of lineage transformations. The remaining UDFs e xtract a subset of the (described in Section V) that each operator can generate, input arrays (E,G), compute the model (F), and predict the which defines the set of optimization possibilities. relapse probability (H). The model is designedto be used by clinicians through a Each operator initially generates black-box lineage (i.e., just visualization that generateslineage queries. The first query records the namesof the inputs it processes)b ut o v ertime picks a relapseprediction and traces its lineage back to the changes its strate gy through optimization. As operators process training matrix to find supporting input data. The second query data, the y send lineage to Runtime the , which uses the Encoder picks a feature from the model and traces it back to the training to serialize the lineage before writing it toOper ator Specific . The Runtime may also send lineage and other matrix to find the contrib uting inputv alues. The thirdquery Datastor es points at a set of training v alues and traces them forw ard tostatistics to theOptimizer, which calculates statistics such as the model, while the last query traces them to the end of thethe amount of lineage that each operator generates. SubZero w orkflo w to find the predictions the y af fected. periodically runs the Optimizer, which uses an Inte g erPr o. gr ammingSolver to compute the ne wlineage strate gyOn The genomics benchmark can de v ote up-front storage and runtime o v erhead to ensuref astquery e x ecution becauseit the right side, the Query Executor compiles lineage querie s is an interacti v evisualization. Although this is application into query plans that join the query with lineage data. The Runtime, which reads and specific, it suggeststhat scientific applicationsha v ea wide Executorrequests lineage from the range of storage and runtime o v erhead constraints. decodesstored lineage, uses the Re-e xecutorto re-run the operators,and sendsstatistics (e.g., query f anoutand f anin) I I I .A R C H I T E C T U R E to the optimizer to refine future optimizations. SubZero records and stores lineage data at w orkflo w runtime Gi v en this o v ervie w , we no w describe the data model and and uses it to ef ficiently e x ecute lineage queries. The inputstructure to of lineage queries (Section IV), the dif ferent types of SubZero is a w orkflo wspecification(the graph in Workflow lineage the system can record (Section V), the functionality of the Runtime, Encoder, and Query Executor(Section VI), and b) Cell-le vellinea g e:Cell-le v ellineage models the relationships between an output cell and each input cell that finally the optimizer in Section VII. generated it3 as a set of pairs of input and output cells: I V . D A T A , L I N E A G E A N D Q U E RY M O D E L { ( out, in ) |out 2 OP ^ in 2 [ i i 2 [1 ,n ] I P } Here, out 2 OP means thatout is a single cell contained in In this section, we describe the representation and notation the output arrayOP . in refers to a single cell in one of the of lineage data and queries in SubZero. input arrays. c) Re gionlinea g e:Re gionlineage models lineage as a SubZero is designed to w ork with a w orkflo we x ecutor of r e gionpair s. Each re gionpair describesan all-to-all system that applies a fix ed sequence of operators to some set set of lineage relationship betweena set of output cells, outcel l ,s inputs. Each operatoroperateson one or more input objects and a set of input cells,incel l si , in each input array I, i : P (e.g., tables or arrays), and producesa single output object. { ( outce l l s, in cel 1l ,s..., inc el l ns) | outc el l s_ OP ^ in cel l is _ I Pi } F ormally we , say an operator P tak esas input n objects, I P1 , ..., IPn , and outputs a single object, OP . Re gion lineage is more than a short hand; scientific applicaMultiple operators are composed together to form a w ork-tions often e xhibit locality and generate multiple output cells flo w , described by a w orkflo w specification, which is a directed from the same set of input cells, which can be represented ac yclic graphW 0=( N , E) , where N is the set of operators, by a single re gion pair . F or e xampl e, the LSST star detection and e =( OP , I iP ) 2 E specifies that the output ofP forms operator finds clusters of adjacent bright pix els and generates the i’ th input to the operator P 0. An instance of W , Wj , an array that labels each pix el with the star that it belongs e x ecutes the w orkflo won a specific dataset.Each operator to. Ev eryoutput pix el labeled Star X dependson all of the runs when all of its inputs are a v ailable. input pix els in theStar X re gion. Automatically tracking such The data follo ws the SciDB data model, which processes relationships at the cell le v elis particularly e xpensi v so e, multi-dimensional arrays. A combination of v alues al ong each re gionlineage is a generalizationof cell-le v ellineage that dimension, termed a coor dinate , uniquely identifies a cell. mak es this relationship e xplicit. F or this reason, later sections Each cell in an array has the sameschema,and consistsof will e xclusi v ely discuss re gion pairs. one or more named,typed fields. SciDB is “no o v erwrite, ” Users e x ecute a lineage query by specifying the coordinates of an initial set of query cells, C , in a starting array ,and a meaning that intermediate results produced as the output ofpath an of operators( P . . . P ) to trace through the w orkflo w: 1 m operator are al w ays stored persistently , and each update to an R = execute q uer y( C ,(( P1 , idx 1 ) , ..., ( Pm , idx m ))) object creates a ne w , persistent v ersion. SubZero stores lineage information with each v ersion to speed up lineage queries. Here, the inde x es (idx 1 . . . idxm ) are used to disambiguate Our notion of backw ard lineage is defined as a subset of which the input of a multi-input operator that the query path inputs that will reproduce the same output v alue if the operator tra v erses. is re-run on its lineage. F or e xample, the lineage of an outputDepending on the order of operatorsin the query path, cell of Matrix Multiply are all cells of the corresponding ro w SubZero recognizesthe query as a forwar d linea g equery and column in the input a rrays– e v enif some are empty . or bac kwar d linea g e query . A forwar d linea g query defines e C , of the outputs such a path from some ancestoroperator P1 to some descendent F orw ard lineage is defined as a subset, C contains the input cells. The operator Pm . The output of an operator Pi 1 is the idx i ’ th that the backw ard lineage of e xactsemanticsfor UDFs are ulitmately controlled by the input of the ne xt operatorPi,. The query cellsC are a subset idx de v eloper . of P1 ’ s idx 1 ’ th input array C , _ I P1 1 . SubZero supportsthree types of lineage: blac kbox, cellA bac kwar dlinea g equery re v erses this process,defining le vel, and r e gionlineage. As a w orkflo w e x ecutes, lineageaispath from some descendent operator P1, that terminates at generated on an operator -by-operator basis, depending on some the ancestor operator Pm, . The output of an operatorP,i +1 types of lineage that each operator is instrumented to support is the idx i ’ th input of the pre vious operator Pi , ,and the query and the materialization decisions made by the optimizer . cells C are a subsetof P1 ’ s output array ,C _ OP 1 . The W e ha v e ins trumented SciDB’ s b uilt-in operators to generate query results are the coordinatesof the cells R _ OP m or lineage mappings from inputs to outputs and pro vide an APIR _ I Pidxm m , for forw ard and backw ard queries, respecti v ely . for UDF designersto e xposethese relationships.If the API V.L INEAGEAPIANDSTORAGEM ODEL is not used, then SubZero assumesan all-to-all relationship SubZero allo ws de v elopers to write operators that ef ficiently between the cells of the input arrays and cells of the output represent and store lineage. This section describesse v eral array . modes of re gion lineage, and an API that UDF de v elopers a) Blac k-boxlinea g e:SubZero does not require adcan use to generate lineage from within the operators.W e ditional resourcesto store black-box lineage because,lik e also introduce a mechanism to control the modesof lineage SciDB, our w orkflo w e x ecutor records intermediate results as well as input and output array v ersionsas peristent,named 3 Although we model and refer to lineage as a mapping between input objects. These are suf ficient to re-run an y pre viously e x ecuted and output cells, in the SubZero implementation we store these mappings as references to ph ysical cell coordinates. operator from an y point in the w orkflo w . API Method Description System API Calls l write(outcells,incells 1 , ...,inc ellsn ) API to store lineage relationship. l write(outcells, payload) API to store small binary payload instead of input cells. Called by payload operators. Operator Methods run(input-1,...,input-n,cur modes) Ex ecute the operator , generating _ { F ul l , lineage types in cur modes } ox M ap, P ay , C om p, B l ack b ma pb (outcell, i) Computes the input cells ininput i that contrib ute toou tcell. ma pf (incell, i) Computes the output cells that depend on incell 2 input i . ma pp (outcell, payload, i) Computes the input cells ininput i that contrib ute toou tcell, has access to pa yload. C _ { F ul l , supported modes() Returns the lineage modes } ox M ap, P ay , C om p, B l ack b that the operator can generate. other lineage modes reduce the amount of lineage that is stored by partially computing lineage at query time using de v eloper defined mapping functions. The follo wing sectionsdescribe the modes in more detail. 1) Full Linea g e:Full lineage (Full ) e xplicitly represents and stores all re gion pairs. It is straightforw ard to instrument an yoperator to generatefull lineage. The de v eloper simply () to writes code that generatesre gionpairs and usesl w r i te store the pairs. F or e xample,in the follo wing CRD pseudocode, ifcur modes containsFull , the code iterates through each cell in the output, calculates the lineage, and calls l w r ite() with lists of cell coordinates. Note that ifFull is not specified,the operatorcan a v oidrunning the lineage related code. def run(image, cur_modes): ... if F ul l 2 cur_modes: for each cell in output: if cell == 1: that an operator generates. Finally , we describe ho w SubZero neighs = get_neighbor_coords(cell) re-e x ecutes black-box operators during a lineage query . T able lwrite([cell.coord], neighs) I summarizesthe API calls and operator methods that are else: lwrite([cell.coord], [cell.coord]) introduced in this section. T ABLE I RUNTIMEANDOPERATORMETHODS Before describing the dif ferent lineage storage methods, weAlthough this lineage mode accurately records the lineage illustrate the basic structure of an operator: data, it is potentially v ery e xpensi vto e both generateand class OpName: def run(input-1,...,input-n,cur_modes): / * Process the inputs, emit the output / * Record lineage modes specified in cur_modes * / def supported_modes(): / * Return the lineage modes the operator supports */ store. W eha v eidentified se v eral widely applicable operator properties that allo w the operators to generate more ef ficient */ modes of lineage, which we describe ne xt. 2) Mapping Linea g e:Mapping lineage (Map) compactly representsan operator’ slineage using a pair of mapping functions. Man y operatorssuch as matrix transposee xhibit a fix ed e x ecution structure that does not depend on the input Each operator implements run() a method, which is called cell v alues.These operators,called mapping oper ator, scan when inputs are a v ailable to be processed. It is passed a list compute forw ard and backw ard lineage from a cell’ s coordiof lineage modes it should output in the cur modesar gument; nates and metadata(e.g., input and output array sizes) and it writes out lineage data using thelwrite() method described do not need to accessarray data v alues.This is a v aluable belo w The . de v eloper specifiesthe modes that the operator property because mapping operators do not incur runtime and supports (and that the runtime will consider) by o v erriding storage o v erhead. F or e xample,one-to-one operators,such the supported modes()method. If the de v eloperdoes not as matrix addition, are mapping operators because an output o v erridesupported modes() , SubZero assumesan all-to-all cell only dependson the input cell at the same coordinate, relationship betweenthe inputs and outputs. Otherwise, the re g ardless of the v alue. De v elopersimplement a pair of operator automatically supports black-box lineage. mapping functions,mapf ( cel l ,)i/map b( ce l l ), ,i that calculate F or ease of e xplanation,this section is describedin the the forw ard/backw ard lineage of an input/output cell’ s coordiconte xtof the LSST operator C R D (cosmic ray detection, nates, with respect to thei ’ th input array . F or e xample, a 2D depicted as A and B in Figure 1) that finds pix els containingtranspose operator w ould implement the follo wing functions: cosmic rays in a single image, and outputs an array of the def map_b((x,y), i): def map_f((x,y), i): same size. If a pix el contains a cosmic ray , the corresponding return [(y,x)] return [(y,x)] 1, and the output cell depends on the cell in the output is set to 49 neighboring pix els within a 3 pix el radius. Otherwise the Most SciDB operators (e.g., matrix multiply , join, transpose, output cell is set to0, and only depends on the correspondingcon v olution) are mapping operators, and we ha v e implemented their forw ard and backw ard mapping functions. Mapping oper l ,s incel l s). input pix el. A re gion pair is denotedoutcel ( ators in the astronomy and genomics benchmarks are depicted A. Linea g e Modes as solid box es (Figures 1 and 2). SubZero supports four modes of re gion lineage Full,( Map, 3) P ayloadLinea g e:Rather than storing the input cells P ay , Comp kbox ), and one mode of black-box lineageBlac ( ). in each re gion pair , payload lineage (P ay ) stores a small cur modesis set toBlac kboxwhen the operator does not needamount of data (a payload), and recomputesthe lineage to generatean ypairs (becauseblack box lineage is al w ays using a payload-a w are mapping function (mapp () ). Unlik e in use). Full lineage e xpli citly stores all re gion pairs, and the mapping lineage, the mapping function has accessto the user -storedbinary payload. This mode is particularly useful return [(x,y)] when the operator has high f anin and the payload is v ery Compositeoper ator scan a v oidstoring lineage for a sigsmall. F or e xample,supposethat the radius of neighboring nificant fraction of the output cells. Although it is similar pix els that a cosmic ray pix el dependson increaseswith to payload lineage in that the payload cannot be inde x edto brightness,then payload lineage only stores the brightness optimize forw ard queries, the amount of payload lineage that insteall of the input cell coordinates.(P ayloadoper ator) s is stored may be small enough that iterating through the small ( outcel l s, pay l oad ) to pass in a list of output call l w r i te number of (outcells, payload) pairs is ef ficient. Operators A,B cell coordinatesand a binary blob, and define a payload and C in the astronomy benchmark (Figure 1) are composite i directly computes function, mapp ( outcel l , pay l oad) , , that operators. outcel l 2 outcel l sfrom the outcel l the backw ard lineage of coordinateand the payload.The result are input cells in the B. Supporting Oper ator Re-e xecution i ’ th input array . As with mapping functions, payload lineage An operator stores black-box lineage when cur modes does not need to accessarray data v alues.The follo wing equals B l a ck box . When SubZero e x ecutes a lineage query pseudocode stores radius v alues instead of input cells: on an operator that stored black-box lineage, the operator is re-e x ecuted in tracing mode. When the operatoris re-run at lineage query time, SubZero passescur modes = F ul l, which causesthe operator to perform l w r ite() calls. The ar guments to these calls are sent to the query e x ecutor . ’3’) Rather than re-e x ecuting the operator on the full input arrays, SubZero could also reduce the size of the inputs by ’0’) applying bounding box predicatesprior to re-e x ecution. The def map_p((x,y), payload, i): predicates w ould reduce both the amount of lineage that needs return get_neighbors((x,y), int(payload)) to be stored and the amount of data that the operatorneeds In the abo v eimplementation,each re gionpair stores the to re-process.Although we e xtendedboth mapping and full output cells and an additional ar gumentthat representsthe operators to compute and store bounding box predicates, we radius, as opposed to the neighboring input cells. When a backdid not find it to be a widely useful optimization. During query e x ecution, SubZero must retrie v e the bounding box es for e v ery w ard lineage query is e x ecuted, SubZero retrie v es the (outcells, mapp query cell, and either re-e x ecute the operator for each box, or payload) pairs that intersect with the query and e x ecutes mer ge the bounding box es and re-run the operator using the on each pair . This approach is particularly po werful because mer gedpredicate Unfortunately . the , former approachincurs the payload can store arbitrary data – an ything from array data v alues to lineage predicates [7]. Operators D to G in the tw an o o v erhead on each e x ecution (to read the input arrays and apply the predicates) that quickly becomes a significant cost. benchmarks (Figures 1 and 2) are payload operators. In the latter approach, the mer ged bounding box quickly e xNote that payload functions are designed to optimize e x epands to encompass the full input array , which is equi v alent to cution of backw ard lineage queries. While SubZero can inde x completely re-e x ecuting the operator , b ut incurs the additional the input cells in full lineage, the payload is a binary blob that cost to retrie v ethe predicates.F orthesereasons,we do not cannot be e asily inde x ed. A forw ard query must iterate through further consider them here. each (outcells, payload) pair and compute the input cells using mapp before it can be compared to the query coordinates. V I .I MP LE ME NTA TION 4) CompositeLinea g e:Composite lineage (Comp) comThis section describesthe Runtime, Encoder, and Query bines mapping and payload lineage. The mapping function Executorcomponents in greater detail. defines the def ault relationship between input and output cells, def run(image,cur_modes): ... if P AY 2 cur_modes: for each cell in output: if cell == 1: lwrite([cell.coord], else: lwrite([cell.coord], and results of the payload functiono verwritethe def ault lin- A. Runtime eage if specified. F or e xample, CRD can represent the def ault In SciDB (and our prototype), we automatically store blackrelationship – each output cell depends on the corresponding box lineage by using write-aheadlogging, which guarantees input cell in the same coordinate – using a mapping function, that black-box lineage is written before the array data, and and write payload lineage for the cosmic ray pix els: is “no o v erwrite”on updates.Re gionlineage is stored in a def run(image,cur_modes): collection of Berk ele yDB hashtable instances. W e use Berk e... le yDB to store re gion lineage to a v oid the client-serv er comif C O M P 2 cur_modes): for each cell in output: munication o v erhead of interacting with traditional DBMSes. if cell == 1: W e turn of f fsync, logging and concurrenc y control to a v oid lwrite([cell.coord], 3) reco v ery and locking o v erhead. This is safe because the re gion // else map_b defines default behavior lineage is treated as a cache, and can al w ays be reco v ered by def map_p((x,y), radius, i): re-running operators. return get_neighbors((x,y), radius) The runtime allocates a ne w Berk ele yDB database for each def map_b((x,y), i): operator instance that stores re gion lineage. Blocks of re gion pairs are b uf feredin memory ,and b ulk encodedusing the hash entry . The hash v alue stores a reference to a single entry Encoder. The data in each re gionpair is stored as a unit containing the input cells (Figure 4.2). This implementation (SubZero does not optimize across re gion pairs), and the doesn’ t need to compute and store bounding box information output and input cells use separateencoding schemes.The and doesn’ t need the spatial inde x because each input cell is layout can be optimized for backw ard or forw ard queries bystored separately , so queries e x ecute using direct hash lookups. respecti v ely storing the output or input cells as the hash k e yF. or payload lineage, P ay M an ystores the lineage in a On a k e ycollision, the runtime decodes,mer ges,and re- similar manner asF ul l M an ,yb ut stores the payload as the encodesthe tw o hash v alues.The ne xtsubsectiondescribes hash v alue(Figure 4.3). P ay O necreatesa hash entry for ho w theEncoderserializes the re gion pairs. e v ery output cell and stores a duplicate of the payload in each hash v alue (Figure 4.4). B. Encoder The Optimizerpicks a lineage strate gy that spans the entire While Section V presented ef ficient w ays to represent rewgion orkflo wIt. picks one or more stor a g str for each e ate gies lineage, SubZero still needsto store cell coordinates,which operator . Each storage strate gy is fully specified by a lineage can easily be lar ger than the original data arrays.Encoder The mode (Full, Map, P ayload, Composite, or Black-box), encodstores the input and output cells of a re gion pair (generateding by strate gy , and whether it is forw ard or backw ard optimized calls tol w r ite() ) into one or more hash table entries, specified (→ or ). SubZero can use multiple storage strate giesto by an encoding str ate gy . W e say the encoding strate gyis optimize for dif ferent query types. if the output cells are stored in the hash bac kwar d optimized k e y , and if the hash k e y contains input cells. forwar d optimized C. Query Execution W e found that four basic strate giesw ork well for the The Query Executor iterati v elye x ecutes each step in the operatorswe encountered.– F ul l O neand F ul l M a nyare the tw o strate giesto encodefull lineage, and P ay O neand lineage query path by joining the lineage with the coordinates 4 P ay M an yencode payload lineage of the query cells, or the intermediatecells generatedfrom . the pre viousstep. The output at each step is a set of cell Hash%Value% Hash%Key% Hash%Value% Hash%Key% coordinates that is compactly stored in an in-memory boolean #1234& (0,1)& array with the same dimensions as the input (backw ard query) Index& #1234& (2,3)& or output (forw ard query) array . A bit is set if the intermediate (4,5),(6,7)& (0,1),&(2,3)& (4,5),(6,7)& #1234& result contains the corresponding cell. F or e xample, suppose 1. FullMany strategy! 2. FullOne strategy! we ha v ean operator P that tak esas input a 1 × 4 array . Considera backw ardsquery asking for the lineage of some Index& payload& (0,1)& (0,1),&(2,3)& output cell C of P . If the result of the query is 1001, this (2,3)& payload& payload& means thatC depends on the first and fourth cell Pin’ s input. 3. PayMany strategy! 4. PayOne strategy! W e chose the in-memory array becauseman yoperators Fig. 4. F our e xamples of encoding strate gies ha v elar gef anin or f anout,and can easily generatese v eral times more results (due to duplicates)than are unique. Deduplication a v oids w asting storage and sa v es w ork. Similarly , Figure 4 depicts ho w the backw ard-optimiedimplementhe e x ecutor can close an operatorearly if it detects that all tation of these strate giesencode tw o output cells with coof the possible cells ha v e been generated. ordinates @, 1) and B , 3) that depend on input cells with W e also implement an to speed up entir e arr ay optimization coordinatesD , 5) and F , 7). F ul l M an yusesa single hash queries where all of the bits in the boolean array are set. F or entry with the set of serialized output cells as the k e y and the e xample, this can happen if a backw ard query tra v erses se v eral set of input cells as the v alue (Figure 4.1). Each coordinate is high-f anin operators or an all-to-all operator such as matrix bitpack ed into a single inte ger if the array is small enough. W e in v ersion. In these cases, calculating the lineage of e v ery query also create an R T ree on the cells in the hash k e y to quickly cell is v ery e xpensi v e and often unnecessary . Man y operators find the entries that intersect with the query . This inde x uses (e.g., matrix multiply or in v erse) can safely assumethat the the dimensions of the array as its k e ys and identifies the hash forw ard (backw ard) lineage of an entire input (output) array table entries that contain cells in particular re gions. The figure is the entire output (input) array . This optimization is v aluable sho wsthe unserializedv ersionsof the cells for simplicity . when it can be applied – it impro v ed the query performance F ul l M an yis most appropriatewhen the lineage has high of a forw ard query in the astronomy benchmark that tra v erses f anout because it only needs to store the output cells once. ×. an all-to-all-operator by 83 If the f anoutis lo w ,F ul l O nemore ef ficientlyserializes In general, it is dif ficult to automatically identify when and stores each output cell as the hash k e yof a separate the optimization’ sassumptionshold. Considera concatenate 4 W etried a lar genumber of possiblestrate giesand found that comple x operator that tak es tw o 2D arrays A, B with shapes A, n) and encodings(e.g., computeand store the bounding box of a set of cells, C , A, m), and produces an A, n+m) output by concatenating B to along with cells in the bounding box b ut not C in) incur high encoding costs A. The opt imization w ould produce dif ferent results, because without noticeably reduced storage costs. Man y are also readily implemented as payload or composite lineage A’ s forw ard lineage is only a subset of the output. W e currently Strategy rely on the programmer to manually annotate operators where the optimization can be applied. BlackBox V I I .L I N E A G E S T R A T E G Y O P T I M I Z E R BlackBoxOpt FullOne FullMan y Subzero Description Astr onomy Benchmark All operators store black-box lineage Lik e BlackBox, uses mapping lineage for b uilt-in-operators. Lik e BlackBoxOpt, b ut uses FullOne for UDFs. Lik e FullOne, b ut uses FullMan y for UDFs. Lik e FullOne, b ut stores composite lineage using P ayOne for UDFs. Genomics Benchmark UDFs store black-box lineage UDFs store backw ard optimized FullOne UDFs store backw ard optimized FullMan y UDFs store forw ard optimized FullOne UDFs store FullF orw and FullOne UDFs store P ayOne UDFs store P ayMan y UDFs store P ayOne and FullF orw Ha ving described the basic storage strate gies implemented in SubZero,we no wdescribeour lineage storageoptimizer . The optimizer’ s objecti v e is to choose a set of a g e str ate- BlackBox stor FullOne gies that minimize the cost of e x ecuting the w orkflo w while FullMan y k eeping storage o v erhead within user -defined constraints. FullF W e orw FullBoth formulate the task as an inte ger programming problem, where P ayOne the inputs are a list of operators, strate gy pairs, disk o v erheads, P ayMan y query cost estimates,and a samplew orkloadthat is used to P ayBoth deri v e the frequenc y with which each operator is in v ok ed in TABLE II the lineage w orkload. Additionally , users can manually specify L INEAGES TRATEGIESFOREXPERIMENTS. operator specific strate gies prior to running the optimizer . The formal problem description is stated as: A. Query-time Optimizer While the lineage strate gyoptimizer picks the optimal min x P i pi _ “ min j | x ij =1 qij ” + _ _ P ij ( disk ij + β _ r un ij ) _ x ij lineage strate gy , the e x ecutor must still pick between accessing P ≤ M axD I S K s.t. ij disk ij _ x ij P ij r unij _ x ij ≤ M axR U N T I M E the lineage stored by one of the lineage strate gies,or rerunning the operator The . query-time optimizer consults the 8i “ P 0 ≤ j<M x ij ” ≥ 1 cost model using statistics g atheredduring query e x ecution 8i,j x ij 2 { 0, 1} and the size of the query result so f ar to pick the best e x ecution use r sp ecifie d stra tegie s method. In addition, the optimizer monitors the time to access 8i,j x ij 2 U x ij = 1 the materialized lineage. If it e xceeds the cost of re-e x ecuting the operator , SubZero dynamically switches to re-running the Here, x ij = 1 if operator i stores lineage using strate gy operator . This bounds the w orst case performance×tothe 2 j , and 0 otherwise. M axD I S K is the maximum storage black-box approach. qij , r unij , and disk ij , are the o v erhead specified by the user; V I I I .E X P E R I M E N T S a v erage query cost, runtime o v erhead, and storageo v erhead costs for operator i using strate gyj as computed by the In the follo wing subsections, we first describe ho w SubZero cost model. pij is the probability that a lineage query in optimizes the storage strate gies for the real-w orld benchmarks the w orkloadaccessesoperat ori , and is computedfrom the described in Section II, then compare se v eralof our linsample w orkload. A single operator may store its lineage data eage storage techniques with black-box le v el only techniques. using multiple strate gies. The astronomy benchmark sho wsho w our re gion lineage The goal of the objecti v e function is to minimize the costtechniquesimpro v eo v ercell-le v eland black-box strate gies of e x ecuting the lineage w orkload,preferring strate giesthat on an image processing w orkflo w . The genomics benchmark use less storage. When an operator uses multiple strate gies to illustrates the comple xityin determining an optimal lineage store its lineage, the query processorpicks the strate gythat strate gy and that the the optimizer is able to choose an ef fecti v e min statement in the left hand strate gy within user constraints. minimizes the query cost. The term picks the best query performance from the strate gies that Ov erall, our findings are that: j |x(ij =1 ). The right hand term penalizes • An optimal strate gy hea vily relies on operator properties ha v e been pick ed strate giesthat tak e e xcessi vdisk e space or cause runtime such as f anin, and f anout,the specific lineage queries, slo wdo wn.β weights runtime ag ainstdisk o v erhead, and _ and query e x ecution-time optimizations.The dif ference is set to a v ery small v alue to break ties. A lar_ge is similar betweena sub-optimal and optimal strate gycan be so to reducingM axD I S K or M ax R U N T I M.E lar ge that an optimizer -based approach is crucial. • P ayload, composite, and mapping lineage are e xtremely W e heuristically remo v econfigurations that are clearly non-optimal, such as strate giesthat e xceeduser constraints, ef fecti v and e lo w o v erhead techniquesthat greatly imor are not properly inde x edfor an y of the queries in the pro v equery performance,and are applicable across a w orkload(e.g., forw ardoptimized when the w orkloadonly number of scientific domains. contains backw ard queries). The optimizer also picks mapping• SubZero can impro v ethe LSST benchmarkqueries by functions o v er all other classes of lineage. up to 10× compared to nai v ely storing the re gion lineage W esolv ethe ILP problem using the simple xmethod in (similar to what cell-le v el approaches w ould do) and up GNU Linear Programming Kit. The solv er tak es about 1ms to to 255× f aster than black-box lineage. The runtime and solv e the problem for the benchmarks. storage o v erhead of the optimal scheme is up to 30 and Di sk C os t 70× lo wer than cell-le v el lineage, respecti v ely , and arrays only – the go al is to be as close to these bars as possible. 1.49 and 1.95× higher than e x ecuting the w orkflo w . F ul l O neand F ul l M an yboth require considerablestorage • Ev en though the genomics benchmark e x ecutes operators space 6 × , 53× ) becausethe three cosmic ray operators v ery quickly , SubZero can find the optimal mix of blackgenerate a re gion pair for e v ery input and output pix el at the box and re gion lineage that scales to the amount of same coordinates.Similarly , both approachesincur 6× and F ul l M an y a v ailable storage. SubZero uses a black-box only strate 44×gyruntime o v erhead to serialize and store them. when the a v ailable storageis small, and switches from must also construct the spatial inde x on the output cells. The space-ef ficient to query-optimized encodings with looser SubZero optimizer instead picks composite lineage that only constraints. When the storage constraints are unbounded, stores payload lineage for the small number of cosmic rays SubZero impro v esforw ard queries by o v er500× and and stars. This reducesthe runtime and disk o v erheads to × × × backw ard queries by 2-3. 1.49 and 1.95 the w orkflo winputs. By comparison,this storage o v erhead is ne gligible compared to the cost of storing The current prototype is written in Python and uses Berk e× the le yDB for the persistent store, and libspatialinde xfor the the intermediate and final results (which amount to 11.5 input size). spatial inde x.The microbenchmarksare run on a 2.3 GHz BQx linux serv er with 24 GB of RAM, running Ub untu 2.6.38-13- Figure 5(b) compares lineage query e x ecution costs. and F Q x respecti v ely stand for backw ard and forw ard query serv er . The benchmarks are run on a 2.3 GHz MacBook Pro x. All of the queries use the entire array optimization described with 8 GB of RAM, a 5400 RPM hard disk, running OS X F Q0S l owdoes not.B l a ck B ox in Section VI-C whereas must 10.7.2. re-run each operator and tak esup to 100 secs per query . A. Astr onomy Benc hmark B l ack B oxO pt can a v oidrerunning the mapping operators, b ut still re-runs the computationally intensi v e UDFs. Storing 1500 15 15 847 1051 30 re gionlineage reducesthe cost of e x ecutingthe backw ard 1000 500 × (F ul l M an )yand 45× (F ul l O ne queries by 34 ) on a v erage. 0 1500 37 37 1666 1030 55 SubZero benefits from e x ecuting mapping functi ons and read1000 500 × f aster ing a small amount of lineage data and e x ecutes 255 on 0 BlackBox BlackBoxOpt FullMany FullOne SubZero F Q 0 S l ow a v erage. illustrates ho w the all-to-all optimi zation Storage Strategies × impro v es the query performance by 83 by a v oiding fineBlackBox FullMany SubZero Strategy BlackBoxOpt FullOne grained lineage all-together . R un (s tiec m ) e(M Di B) sk R un ti m e (a) Disk and runtime o v erhead 100 B. Genomics Benc hmark Q y (s C , ue lo r ec os tg) In this e xperiment,we run the genomics w orkflo wand e x ecute a lineage w orkloadwith an equal mix of forw ard 10 and backw ardlineage queries (Section II-B). There are 10 b uilt-in mapping operators,and the 4 UDFs are all payload 1 operators. In contrast to the astronomy w orkflo w , these UDFs BlackBox BlackBoxOpt FullMany FullOne SubZero do not e xhibit significant locality , and perform data shuf fling Storage Strategies and e xtractionoperationsthat are not amenableto mapping BQ 0 BQ 2 BQ 4 FQ 0 Slow Query BQ 1 BQ 3 FQ 0 functions. In addition, the operatorsperform simple calculations, and e x ecute quickly , so there is a less pronounced trade (b) Query costs. Y-ax es are log scale of f betweenre-e x ecuting the w orkflo wand accessingre gion lineage. In f act, there are cases where storing lineage actually Fig. 5. Astronomy Benchmark the query performance. W e were pro vided×a100 56 de gr ades In this e xperiment,we run the Astronomy w orkflo wwith matrix of 96 patients and 55 health and genetic features. fi v ebackw ardqueries and one forw ard query as described Although the dataset is small, future datasets are e xpected to in Section II-A. The 22 b uilt-in operatorsare all e xpressed come from a lar ger group of patients, so we constructed lar ger as mapping operatorsand the UDFs consist of one payload datasets by replicating the patient data. The query performance operator that detects celestial bodies and three composite and o v erheads scaled linearly with the size of the dataset and ×. sowwe report results for the dataset scaled by 100 operators that detect and remo v e cosmic rays. This w orkflo e xhibits considerable locality (stars only depend on neighbor -W efirst sho wthe high v ariability betweendif ferentstatic strate gies (T able II) and ho w thequery-time optimizer (Secing pix els), sparsity (stars are rare and small), and the queries are primarily backw ardqueries. Each w orkflo we x ecution tion VII-A) a v oids sub-optimal query e x ecution. W e then sho w consumestw o 512× 2000 pix el HMB) images (pro videdby ho w the SubZero cost based optimizer can identify the optimal LSST) as input, and we compare the strate gies in T able II.strate gy within v arying user constraints. 1) Query-T imeOptimizer: This e xperimentcomparesthe Figure 5(a) plots the disk and runtime o v erhead for each of the strate gies.B l ack B oxand B l ack B oxO pt sho wthe strate giesin T ableII with and without the query-time opbase cost to e x ecute the w orkflo wand the size of the input timization described in Section VII-A. Each operator uses 73 60 2 54 27 45 31 32 16 FullBoth FullForw FullMany FullOne PayBoth PayMany BlackBox 18 5 PayOne 80 60 40 20 0 80 60 40 20 0 8 8 8 12 28 2 2 2 6 16 42 SubZero1 SubZero10 SubZero20 SubZero50 R un (s tiec m ) e(M Di B) sk 89 R un (s tiec m ) e(M Di B) sk 73 BlackBox Storage Strategies Strategy BlackBox FullBoth FullForw FullMany 77 R un ti m e 64 Di sk C os t 161 R un ti m e 8 Di sk C os t 150 100 50 0 150 100 50 0 SubZero100 Storage Strategies FullOne PayBoth PayMany PayOne Strategy (a) Disk and runtime o v erhead BlackBox SubZero1 SubZero10 SubZero20 SubZero50 SubZero100 (a) Disk and runtime o v erhead 10.0 Q y (s C , ue lo r ec os tg) Q y (s C , ue lo r ec os tg) 1e+02 0.1 1e+00 1e−02 BlackBox FullBoth FullForw FullMany FullOne PayBoth PayMany PayOne BlackBox SubZero1 Storage Strategies Query BQ 0 BQ 1 FQ 0 SubZero100 SubZero20 SubZero50 Storage Strategies FQ 1 (b) Query costs (static) Y-ax es are log scale. 10.0 SubZero10 Query BQ 0 BQ 1 FQ 0 FQ 1 (b) Query costs. Y-ax es are log scale. Fig. 7. Genomics benchmark. SubZero X has a storage constraint of X MB Q y (s C , ue lo r ec os tg) dynamically switching to theB l ack B oxstrate gy . Ov erall, the × , 25 backward and forw ard queries impro v ed by up to 2 and respecti v ely . The pre vious section com2) Linea g e Str ate gy Optimizer: BlackBox FullBoth FullForw FullMany FullOne PayBoth PayMany PayOne Storage Strategies pared man y strate gies, each with dif ferent performance characQuery BQ 0 BQ 1 FQ 0 FQ 1 teristics depending on the operator and query . W e no w e v aluate the SubZerostrate gy optimizeron the genomicsbenchmark. (c) Query costs (dynamic) Y-ax es are log scale. Figure 7 illustrates that when the user increases storage constraints from 1 to 100MB (with unbounded runtime constraint), Fig. 6. Genomics benchmark. Querys run with (dynamic) and without (static) the query-time optimizer described in Section VII-A. the optimizer picks more storage intensi v estrate giesthat are predicted to impro v ethe benchmark queries. SubZero mapping lineage if possible, and otherwise stores lineage using the specified strate gyThe . majority of the UDFs generate choosesB l ack B oxwhen the constraint is too small, and forw ardand backw ard-optimized lineage that benefits re gion pairs that contain a single output cell. As mentioned stores in all of the queries when the minimum amount of storageis pre vious e xperiments, payload lineage stores v ery little binary a v ailable ˇMB). Materializing further lineage has dimindata, and incurs less o v erhead than the full lineage approaches S ubZ er 100 o ishing storage-t o-query benefits. uses 50MB to (Figure 6(a)). Storing both forw ardand backward-optimized ( M AN Y , O N ) E forw ard-optimize the UDFs using , which lineage (P ay B othand F ul l B oth ) requires signifi cantly more reduces the forw ard query costs to sub-second costs. This × o v erhead – 8 and 18.5 more space than the input arrays, and is because the UDFs ha v e lo w f anout, so each join in the × 2.8 and 26 runtime slo wdo wn. query path is a small number of hash lookups. Due to space Figure 6(b) highlights ho w query performance can de gr ade if the e x ecutorblindly joins queries with mismatchedin- constraints, we simply mention that specifying and v arying the o v erhead constraints achie v es similar results. de x ed lineage (e.g., backw ard-optimized lineage with forwruntime ard 5 F ul l F or wde graded backw ard query queries). F or e xample, C. Micr obenc hmark performance by 520× . Interestingly ,the BQ1 ran slo wer The pre vious e xperimentscompared se v eralend-to-end becausethe query path contains se v eral operatorswith v ery strate lar ge f anins. This generates so man y intermediate results that gies, ho we v er it can be dif ficult to distinguish the sources performing inde x lookups on each one is slo werthan re- of the benefits.This subsectionssummarizesthe k e ydif fer running the operators.Note ho we v er that , the forw ardopti- encesbetween the pre v ailing strate gies in terms of o v erhead and query performance. The comparisons use an operator that mized strate gies impro v ed the performance of FQ0 and FQ2 generates synthetic lineage data with tunable parameters. Due because the f anout is lo w . Figure 6(c) sho wsthat the query-time optimizer e x ecutes to space constraints we sho w results from v arying the f anin, the queries as f ast as, or f aster than, B l ack B ox . In general, f anout and payload size (for payload lineage). Each e xperiment processes and outputs a 1000x1000 array , this requires accuratestatistics and cost estimation, the opand generateslineage for 10% of the output cells. The re× timizer limits the query performancede gradationto 2 by sults scaled close to linearly as the number of output cells 5 All comparisons are relati v e B tol ack B o x with lineage v aries.A re gionpair is randomly generatedby 0.1 Fanout: 1 Fanout: 100 selecting a cluster of output cells with a radius defined by 30 f anout , and selectingf anin cells in the same area from the 20 input array .W egeneratere gionpairs until the total number 10 of output cells is equal to 10% of the output array .The 0 payload strate gyuses a payload size of fanin× 4 bytes (the 30 payload is e xpectedto be v erysmall). W ecomparese v eral 20 F ul l O ne backw ard optimized strate gies (F ul l M a ny , , 10 P ay M an ,y P ay O ne ), one forw ardlineage strate gy 0 0 20 40 60 80 100 0 20 40 60 80 100 Fanin (→ F ul l O ne ), and black-box (B l ack B ox ). W efirst discuss <− PayMany <− FullMany −> FullOne the o v erhead to store and inde x the lineage, then comm ent on Strategy <− PayOne <− FullOne BlackBox the query costs. Figure 8 comparesthe runtime and disk o v erhead of the Fig. 8. Disk and runtime o v erhead Fanout: 1 Fanout: 100 dif ferentstrate gies.F orreferenc,the size of the input array 0.100 is 3.8MB. The best full lineage strate gydif fers based on 0.075 the operator f anout.F ul l O neis superior whenf anout ≤ 5 0.050 because it doesn’ t need to create and s tore the spatial inde x.0.025 0 20 40 60 80 100 0 20 40 60 80 100 The crosso v erpoint to F ul l M an yoccurs when the cost Fanin of duplicating hash entries for each output cell in a re gion Strategy <− PayMany <− PayOne <− FullMany <− FullOne pair e xceedsthat of the spatial inde x.The o v erhead of both approaches increases with f anin. In contrast, payload lineage Fig. 9. Backward Lineage Queries, only backw ard-optimized strate gies has a muchlo wer o v erhead than the full lineage approaches or astronomy – e xhibit substantial locality (e.g., a v erage temand is independent of the f anin because the payload is typically small and does not need to be encoded.When the f anout peraturereadingswithin an area) that can be used to define increases to 50 or 100, P ay M an yandF ul l M a ny require less payload, mapping or composite operators. As the e xperiments sho w , SubZero can record their lineage with less o v erhead than than 3MB and 1 second of o v erhead. The forw ard optimized F ul l O neis comparableto the other approacheswhen the from operatorsthat only support full lineage. When locality is not present,as in the genomicsbenchmark,the optimizer f anin is lo w . Ho we v er , when the f anin increases it can require up to f anin × more hash entries because it creates an entrymay still be able to find opportunities to record lineage for e v ery distinct input cell in the lineage. It con v er ges to ifthethe constraintsare relax ed.A v ery promising alternati v e F ul l O newhen the f anout and f anin areis to simplify the processof writing payload and mapping backw ard optimized functions by supporting v ariable granularities of lineage. This high. Finally ,B l ack B oxhas nearly no o v erhead. Figure 9 sho wsthat the query performancefor queries lets de v elopers define coarser relationships between input and outputs (e.g., specify lineage as a bounding box that may that access the backw ard/forw ard lineage of 1000 output/input cells. The performance scales mostly linearly with the querycontain inputs that didn’ t contrib ute to the output). This also size. There is a clear dif ferencebetween F ul l M an yor allo ws the lineage system perform lossy compression. P ay M an ,yand F ul l O neor P a y O ne , due to the additional I X .R E LA TE DW ORK cost of accessing the spatial inde x (Figure 9). P ayload lineage There is a long history of pro v enance and lineage research performs similar to, b ut not significantly f aster than full both in databasesystems and in more general w orkflo w pro v enance, although the query performance remains constant l ack B ox systems. There are se v eral e xcellent surv e ys that characterize as the f anin increases. In comparison (not shoBwn), in databases[8] and scientific w orkflo ws[9], tak esbetween 2 to 20 secondsto e x ecutea query where pro v enance fanin=1 and around 0.7 secondswhen fanin=100. Usi n ga [10]. As noted in the introduction, the primary dif ferences from prior w ork are that SubZero uses a mix of black-box mis-matched inde x (e.g, using forw ard-optimized lineage for and re gionpro v enance, e xploitsthe semanticsof scientific backw ard queries) tak es up to tw o orders of magnitude longer than B l ack B oxto e x ecute the same queries. The forw ard operators(making using of mapping functions) and uses a queries using→ F ul l O nee x ecute similarly to F ul l O ne number of pro v enance encodings. Most w orkflo w systems support custom operators containin Figure 9 so we do not include the plots. ing user -designed code that is opaque to the runtime. This D. Discussion presentsa dif ficulty when trying to managecell-le v el(e.g., The e xperimentssho w that the best strate gyis tied to array cells or database tuples) pro v enance. Some systems [4], the operator’ slineage properties, and that there are orders [11] model operators as black-box es where all outputs depend on all inputs, and track the dependenciesbetweeninput and of magnitude dif ferences between dif ferent lineage strate gies. output datasets. Ef ficient methods to e xpose, store and query Science-oriented lineage systems should seek to identify and e xploit operator f anin, f anout, and redundanc y properties.cell-le v el pro v enance is an area of on-going research. Man y scientific applications – particularly sensor -based or Se v eralprojects e xploit w orkflo wsystemsthat use high image processing applications lik e en vironmental monitoring le v elprogramming constructswith well defined semantics. llllll llllll Di sk llllll llllll llllll llllll llllll llllll llllll llllll R un ti m e llllll llllll llllll llllll llllll R un ti m e (s ec ) Di sk (M B) llllll l l l l Q ue r y C os t (s ec ) l l l l RAMP [12] e xtendsMapReduceto automatically generate redundantlineage that is generatedand stored to impro v e × while using up to 70× less lineage capturing wrappers around Map and Reduce operators. query performance by up to 10 Similarly , Amsterdameret al [13] instrument the PIG [14] storagespaceas comparedto e xistingcell-basedstrate gies. frame w orkto track the lineage of PIG operators.Ho we v er The , optimizer successfully scales the amount of lineage stored user defined operators are treated as black-box es, which limits basedon application constraints,and can impro v ethe query their ability to track lineage. performance of the genomics benchmark, which is amenable black-box only strate gies..In conclusion, SubZero is an Other w orkflo w systems (e.g., T a v erna [3] and K epler to [15]), processnestedcollections of data, where data itemsmay be important initial step to mak einteracti v elyqueryi ngfinelineage a reality for scientific applications. imagees or DN A sequences. Operators process data itemsgrained in a collection, and these systems automatically track which subR EFERENCES sets of the collections were modified, added, or remo v ed [16], [1] Z. Iv ezi, J. T yson, E. Acosta, R. Allsman, S. Anderson, et al., “LSST: [17]. Chapman et. al [18] attach to each data item a pro v enance From science dri v ers to reference design and anticipated data products. ” tree of the transformationsresulting in the data item, and [Online]. A v ailable: http://lsst.org/files/docs/ov ervie wV2.0.pdf [2] propose ef ficient compression methods to reduce the tree size. M. Stonebraker , J. Becla, D. J. DeW itt, K.-T . Lim, D. Maier , O. Ratzesber ger and , S. B. Zdonik, “Requirementsfor sciencedata basesand Ho we v er these , systemsmodel operatorsas blac k-box es and SciDB, ” inCIDR, 2009. data items are typically files, not records. [3] T . Oinn, M. Greenwood, M. Addis, N. Alpdemir , J. Ferris, K. Glo v er , C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. Pocock, Databasesystemse x ecute queries that processstructured M. Senger R. , Ste v ens, A. W ipat,and C. Wroe, “T a v erna: lessonsin tuples using well defined relational operators, and are a natural creating a w orkflo w en vironment for the life sciences,Concurr ” in ency tar get for a lineage system. Cui et. al [19] identified ef ficient and Computation: Pr actice and Experience , 2006. [4] H. K uehn, A. Liberzon, M. Reich, and J. P. Mesiro v , “Using genepattern tracing procedures for a number of operator properties. These for gene e xpression analysis, Curr. ” Pr otoc. Bioinform. , Jun 2008. procedures are then used to e x ecute backw ard lineage queries. [5] J. W idom, “T rio: A system for inte grated management of data, accurac y , and lineage, ” T ech. Rep., 2004. Ho we v erthe , model does not allo w arbitrary operators to generate lineage, and models them as black-box es. Section[6] V P. T amayo,Y.-J. Cho, A. Tsherniak, H. Greulich, et al., “Predicting relapsein patients with medulloblastomaby inte gratinge vidence describesse v eralmechanisms(e.g., payload functions) that from clinical and genomic features. ”J ournalof Clinical Oncology, p. can implement man y of these procedures. 29:14151423, 2011. T rio [5] w as the first database implementation of cell-le v [7] el R. Ik eda and J. W idom, “P anda: A system for pro v enanceand data, ”in IEEE Data Engineering Bulletin, 2010. [Online]. A v ailable: lineage, and unified uncertainty and pro v enance under a singlehttp://ilpubs.stanford.edu:8090/972/ data and query model. T rio e xplicitly stores relationships [8] J. Chene y , L. Chiticariu, and W. C. T an., “Pro v enance in databases: Wh y , ho w , and where, ”Foundations in and T r ends in Databases , 2009. between input and output tuples, and is analogous to the full [9] S. Da vidson, S. Cohen-Boulakia, A. Eyal, B. Ludscher , T . McPhillips, pro v enance approach described in Section V. S. Bo wers,M. K. Anand, and J. Freire, “Pro v enance in scientific w orkflo w systems. ” The SubZero runtime API is inspired by the PASS [20], [21] pro v enance API. PASS is a file system that automat- [10] R. BOSE and J. FREW , “Lineage retrie v al for scientific data processing: A surv e y , ” AinCM Computing Surve, ys 2005. ically stores pro v enance information of files and processes. [11] J. Goecks, A. Nekrutenko, J. T aylor ,and T . G. T eam,“Galaxy: a comprehensi vapproach e for supporting accessible,reproducible,and Applications can use the libpass library to create abstract transparentcomputationalresearchin the life sciences .in ” Genome pro v enance objects and relationships between them, analagous Biolo gy, 2010. to producing cell-le v ellineage. SubZero e xtendsthis API [12] R. Ikeda, H. P ark, and J. W idom, “Pro v enancefor generalized map and reduce w orkflo ws, in ” CIDR, 2011. [Online]. A v ailable: to support the semanticsof common scientific pro v enance http://ilpubs.stanford.edu:8090/985/ relationships. [13] Y. Amsterdamer , S. Da vidson, D. Deutch, T . Milo, J. Sto yano vich, and V. T annen,“Putting lipstick on pig: Enabling database-stylew orkflo w pro v enance, ”PVLDB in , 2012. [14] C. Olston, B. Reed, U. Sri v asta v a, R. K umar , and A. T omkins, “Pig latin: This paper introduced SubZero, a scientific-oriented lineage A not-so-foreign language for data processing, ”SIGMOD in , 2008. storage and query system that stores a mix of black-box and[15] I. Altintas, C. Berkle y , E. Jae ger , M. Jones, B. Ludascher , and S. Mock, fine-grained lineage. SubZero uses an optimization frame w ork “K epler: an e xtensiblesystem for design and e x ecutionof scientific w orkflo ws, ” in SSDM, 2004. that picks the lineage representationon a per -operatorba[16] M. K. Anand, S. Bo wers,T . McPhillips, and B. Ludscher “Ef , ficient sis that maximizes lineage query performancewhile staying pro v enance storage o v er nested data collections, ” in , 2009. EDBT within user constraints. In addition, we presented r e gionlin- [17] P. Missier ,N. P aton,and K. Belhajjame, “Fine-grained and ef ficient ea g, ewhich e xplicitly represents lineage relationships between lineage querying of collection-basedw orkflo wpro v enance,in” EDBT, 2010. sets of input and output data elements, along with a number[18] A. P. Chapman,H. Jagadish,and P. Ramanan,“Ef ficient pro v enance of ef ficient encoding schemes. SubZero is hea vily optimized storage, ” inSIGMOD, 2008. for operators that can deterministically compute lineage from[19] Y. Cui, J. W idom, and J. L. V iener , “T racing the lineage of vie w data in a w arehousing en vironment, A ” in , CM T r ansactions on Database Systems array cell coordinates and small amounts of operator -generated1997. metadata.UDF de v elopers e xposelineage relationshipsand [20] K.-K. Muniswamy-Reddy D. , A. Holland, U. Braun, and M. Seltzer , “Pro v enance-a w are storage systems, ” in, 2005. NetDB semanticsby calling the runtime API and/or implementing [21] K.-K. Muniswamy-Reddy J. , Barillariy , U. Braun, D. A. Holland, mapping functions. D. Maclean,M. Seltzer and , S. D. Holland, “Layering in pro v enanceOur e xperimentssho wthat man yscientific operatorscan a w are storage systems, ” T ech. Rep., 2008. X . C ONCLUSION use our techniques to dramatically reduce the amount of