A Fine-Grained Lineage System for Scientific Databases

advertisement
SubZero: A Fine-Grained Lineage System for
Scientific Databases
Eugene W u, Samuel Madden, Michael Stonebrak er
CSAIL, MIT
{ sirrice,
madden, stonebraker
} @csail.mit.edu
Abstract— Data lineage is a k ey component of pr o v enance queries that w alk backw ardto identify the specific cells in
that helps scientists track and query r elationships between input the input arrays on which a gi v en output cell depends and that
and output data. While curr ent systems r eadily support lineage w alk forw ardto find the output ce llsthat a particular input
r elationships at the file or data array le v el, finer -grained support
at an array-cell le v elis impractical due to the lack of support cell influenced.Such a s ystemmust manageinput to output
f or user defined operators and the high runtime and storage relationships at afine-gr ainedarray-cell le v el.
o v erhead to stor e such lineage.
Prior w ork in data lineage tracking systems has lar gely been
W e inter viewed scientists in se v eral domains to identify a set limited to coarse-grainedmetadatatracking [3], [4], which
of common semantics that can be le v eragedto efficiently stor e
storesrelationshipsat the file or relational table le v el.F inefine-grained lineage. W e use the insights to define lineage r epr erelationships at the array cell or tuple
e
sentations that efficiently captur e common locality pr operties in gr ained linea gtracks
the lineage data, and a set of APIs so operator de v elopers
can le v el.The typical approach,popularizedby T rio [5], which
easily export lineage inf ormation fr om user defined operators. we call cell-le vellinea g,eeagerly materializes the identifiers
Finally , we intr oduce tw o benchmarks deri v edfr om astr onomy of the input data records (e.g., tuples or array cells) that
and genomics, and sho w that our techniques can r educe lineage
each output record depends on, and use s it to directly answer
query costs by up to 10× while incuring substantially less impact
backward lineage queries. An alternati v e, which we
callkblac
on w orkflo w runtime and storage.
box linea g, esimply records the input and output datasets and
I . I NTRODUCTION
runtime parametersof each operator as it is e x ecuted,
and
Man y scientific applications are naturally e xpressedas a materializesthe lineage at lineage query time by re-running
w orkflo w that comprises a sequence of operations applied rele
to v ant operators in a tracing mode.
ra w input data to produce an output dataset or visualization. Unfortunately , both techniques are insuf ficient in scientific
Lik e database queries, such w orkflo ws can be quite comple
x,
applications
for tw o reasons. First, scientific applications mak e
consisting up to hundreds of operations [1] whose parameters
hea vy use of user defined functions (UDFs), whose semantics
or inputs v ary from one run to another .
are opaqueto the lineage system.Existing approachesconScientists record and query pro v enance – metadata thatserv
de- ati v ely
assumethat e v eryoutput cell of a UDF depends
scribes the processes, en vironment and relationships between
on e v ery input cell, which limits the utility of a fine-grained
input and output data arrays – to ascertain data quality , audit
lineage system because it tracks a lar ge amount of information
and deb ug w orkflo ws, and more generally understand ho w
the pro viding an y insight into which inputs actually conwithout
output data came to be. A k e y component of pro v enance,
data trib uted to a gi v en output. This necessitates proper APIs so that
linea g,eidentifies ho w input data elements are related to output
UDF designerscan e xposefine-grainedlineage information
data elementsand is inte gralto deb uggingw orkflo ws.F or and operator semantics to the lineage system.
e xample,scientistsneed to be able to w ork backw ardfrom
Second, neither black-box only nor cell-le v elonly techthe output to identify the sources of an error gi v en erroneous
niques are suf ficientfor man yapplications.Scientific w orkor suspiciousoutput results. Once the sourceof the error is flo wsconsumedata arrays that re gularlycontain millions of
identified, the scientist will then often w ant to identify deri v cells,
ed while generating comple x relationships between groups
do wnstream data elements that depend on the erroneous vofalue
input and output cells. Storing cell-le v el lineage can a v oid
so he can inspect and possibly correct those outputs.
re-running some computationally intensi v e operators (e.g., an
In this paper ,we describe the design of a fine-grained image processing operator that detects a small number of stars
lineage tracking and querying system for array-orientedsci- in telescope imagery), b ut needs enormous amounts of storage
entific w orkflo ws.W eassumea data and e x ecutionmodel if e v eryoutput dependson e v eryinput (e.g., a matrix sum
similar to SciDB [2]. W e chose this becauseit pro vides operation) – it may be preferableto recomputethe lineage
a closed e x ecutionen vironmentthat can capture all of the at query time. In addition, applications such as LSST1 are
lineage information, and because it is specifically designed for
often subject to limitations that only allo w them to dedicate
scientific data processing(scientiststypically use RDBMSes a small percentageof storageto lineage operations.Ideally ,
to managemetadataand do data processingoutside of the lineage systems w ould support a h ybrid of the tw o approaches
database). The system allo ws scientists to perform e xploratory
w orkflo wdeb uggingby e x ecutinga series of data linea g e 1http://lsst.or g
and tak euser constraintsinto accountwhen deciding which
The ne xt section describes our moti v ating use cases in more
operators to store lineage for .
detail. It is follo wed by a high le v el system architecture and
This paper seeks to address both challenges. W e interviedetails
wed of the rest of the system.
scientists from se v eral domains to understand their data proI I .U SEC ASES
cessing w orkflo ws and lineage needs and used the results to
design a science-orienteddata lineage system.W eintroduce
W e de v eloped tw o benchmark applications after discussions
e xploits locality properties pre v alent with
in en vironmentalscientists, astronomists,and geneticists.
Re gion Linea ,gwhich
e
the scientific operators we e n c ountered. It addresses common
The first is an image processingbenchmarkde v eloped
with
relationshipsbetween re gionsof input and output cells by scientists at the Lar ge Synoptic Surv e yT elescope(LSST)
storing grouped or summary information rather than indi vidual
project. It is v ery similar to en vironmentalsciencerequirepairs of input and output cells. W e de v eloped a lineage API
ments, so the y are combined together . The second w as de v el2
that supports black-box lineage as well as Re gionLinea g, e oped with geneticists at the Broad Institute
. Each benchmark
which subsumescell-le v ellineage. Programmerscan also consists of a w orkflo wdescription, a dataset,and lineage
specify forw ard/backw ard
Mapping Functionsfor an operator queries. W e used the benchmarks to design the optimizations
to directly compute the forw ard/backw ard lineage solely from
described in the paper . This section will briefly describe each
input/output cell coordinates and operator ar guments; we imbenchmark’ s scientific application, the types of desired lineage
plemented these for man y common matrix and statistical funcqueries, and application-specific insights.
tions. W e also de v eloped a h ybrid lineage storage system that
allo wsusersto e xplicitly trade-of fstoragespacefor lineage A. Astr onomy
query performance using an optimization frame w ork. Finally ,The Lar geSynaptic Surv e yT elescope(LSST) is a wide
we introduce tw o end-to-end scientific lineage benchmarks.angle telescope slated to be gin operation in F all 2015. A k e y
As mentioned earlier ,the system prototype, SubZero, is challenge in processing telescope images is filtering out high
implemented in the conte xt of the SciDB model. SciDB ener gyparticles (cosmic rays) that create abnormally bright
stores multi-dimensional arrays and e x ecutes database queries
pix els in the resulting image, which can be mistak en for stars.
composed of b uilt-in and user -defined operators (UDFs) that
The telescope compensates by taking tw o consecuti v e pictures
are compiled into w orkflo ws.Gi v ena set of user -specified of the samepiece of the sk yand remo vingthe cosmic rays
storage constraints, SubZero uses an optimization frame w in
orksoftw are. The LSST image processing w orkflo w (Figure 1)
to choosethe optimal type of lineage (black box, or one of tak estw o images as input and outputs an annotatedimage
se v eral ne w types we propose) for each SciDB operator that
that labels each pix el with the celestial body it belongs to. It
minimizes lineage query costs while respectinguser storage first cleans and detects cosmic rays in each image separately ,
constraints.
then createsa single composite,cosmic-ray-fr ee,image that
A summary of our contrib utions include:
is used to detect celestial bodies.There are 22 SciDB b uilt1) The notion of r e gionlinea g, ewhich SubZero uses to in operators(blue solid box es)that perform common matrix
ef ficiently store and query lineage data from scientific operations,such as con v olution,and four UDFs (red dotted
applications.W ealso introduce se v eralef ficientrepre- box eslabeled A-D). The UDFs A and B output cosmic-ray
sentations and encoding schemes that each ha v e dif ferent
masks for each of the images.After the images are subseo v erhead and query performance trade of fs.
quently mer ged,C remo v es
cosmic-raysfrom the composite
2) A linea g e API
that operator de v elopers can use to e xpose
image, and D detects stars from the cleaned image.
lineage from user defined operators, including the spec- The LSST scientists are interested in three types of queries.
ification of mapping functions for man yof the b uilt in The first picks a star in the output image and traces the lineage
SciDB operators.
back to the initial input image to detect bad input pix els. The
3) A unified stor a g model
for mapping functions, re gion latter tw o queries select a re gion of output (or input) pix els and
e
and cell-le v el lineage, and black-box lineage.
trace the pix els backw ard (or forw ard) through a subset of the
4) An optimization fr ame work
which picks an optimal mix- w orkflo w to identify a single f aulty operator . As an e xample,
ture of black-box and re gion lineage to maximize querysuppose the operator that computes t h e mean brightness of the
performance within user defined constraints.
image generated an anomalously high v alue due to a fe w bad
5) A performancee v aluationof our approachon end-to- pix el, which led to further mis-calculations.The astronomer
end astronomy and genomics benchmarks. The astronomy
might w ork backw ardfr om those calculations,identify the
benchmark,which is computationallyintensi v eb ut e x- input pix els that contrib uted to them, and filter out those pix els
hibits high locality , benefits from ef ficient representations.
that appear e xcessi v ely bright.
Comparedto cell-le v eland black-box lineage, SubZero
Both the LSST and en vironmental scientists described w ork× and
reduces storage o v erhead by nearly
70 speeds query loads where the majority of the data processing code computes
× . The genomics benchmark output pix els using input pix els within a small distance from
performance by almost 255
highlights the need for , and benefits of, using an optimizer
the corresponding coordinate of the output pix el. These re gions
to pick the storage layout, which impro v es query perfor 2
mance by 2³ × while staying within user constraints.
http://www.broadinstitute.org/
Test#
G
may be constant, pre-defined v alues, or easily computed from
Matrix#
a small amount of additional metadata. F or e xample, a pix el in
E
C R(D) is set if the
Training#
the mask produced by cosmic ray detection
F#
H
Matrix#
related input pix el is a cosmic ray , and depends on neighboring
input cells within 3 pix els. Otherwise, it only depends on the
Modeling#phase#
Tes6ng#phase#
related input pix el. The y also felt that it is suf ficient for lineage
queries to return a superset of the e xact lineage. Although we
Fig. 2. Simplified diagram of genomics w orkflo w . Each solid rectangle is a
do not tak e adv antage of this insight, this suggests future wSciDB
ork nati v e operator while the red dotted rectangles are UDFs.
in lossy compression techniques.
C
D
A
Array'
Array'
Constraints'
Workflow'Executor'
Op3 mizer'
IP'Solver'
A
Query'
Executor'
2'
Run3me'
Encoder'
B. Genomics Pr ediction
1'
Cells'
Sta3s3cs'
Collector'
C
B
Fig. 1. Summary diagram of LSST w orkflo wEach
.
solid rectangleis a
SciDB nati v e operator while the red dotted rectangles are UDFs.
D
Queries'
Operator'
Specific'
Datastore'
A'
ReCexecutor'
Decoder'
C'
D'
W e ha v e also been w orking with researchers at the Broad
Institute on a genomics benchmark related to predicting recur Fig. 3. The SubZero architecture.
rences of medulloblastoma in patients. Medulloblastoma is a
Executor
),
constraints
on the amount of storage that can
form of cancer that spa wns brain tumors that spread through
be
de
v
oted
to
lineage
tracking,
and a sample lineage query
the cerebrospinal fluid. P ablo et. al [6] ha v e identified a set of
w
orkload
that
the
user
e
xpects
to run. SubZero optimally
patient features that helppredict relapsein medulloblastoma
decides
the
type
of
lineage
that
each operator in the w orkflo w
patients that ha v e been treated. The features include histology ,
will
generate
(
the
) in
linea
g
e
str
ate
gyorder to maximize the
gene e xpression le v els, and the e xistence of genetic abnormalperformance
of
the
query
w
orkload
performance.
ities. The w orkflo w (Figure 2) is a tw o-step proces s that first
Figure
3
sho
ws
the
system
architecture.
The solid and
tak es a training patient-feature matrix and outputs a Bayesian
dashed
arro
ws
indicate
the
control
and
data
flo w ,respecmodel. Then it uses the model to predict relapse in a test
ti
v
ely
.
Users
interact
with
SubZero
by
defining
and e x ecuting
patient-featurematrix. The model computesho wmuch each
w
orkflo
ws
(
),
specifying
constraints
to the
Workflow
Executor
feature v alue contrib utes to the lik elihood of patient relapse.
Optimizer
Query
Executor
,
and
running
lineage
queries
(
).
The
The ten b uilt-in operators (solid blue box es) are simple matrix
operators
in
the
w
orkflo
w
specify
a
list
of
the
types
of
lineage
transformations. The remaining UDFs e xtract a subset of the
(described
in
Section
V)
that
each
operator
can
generate,
input arrays (E,G), compute the model (F), and predict the
which defines the set of optimization possibilities.
relapse probability (H).
The model is designedto be used by clinicians through a
Each operator initially generates black-box lineage (i.e., just
visualization that generateslineage queries. The first query records the namesof the inputs it processes)b ut o v ertime
picks a relapseprediction and traces its lineage back to the changes its strate gy through optimization. As operators process
training matrix to find supporting input data. The second query
data, the y send lineage to Runtime
the
, which uses the
Encoder
picks a feature from the model and traces it back to the training
to serialize the lineage before writing it toOper ator Specific
. The Runtime may also send lineage and other
matrix to find the contrib uting inputv alues. The thirdquery Datastor es
points at a set of training v alues and traces them forw ard tostatistics to theOptimizer, which calculates statistics such as
the model, while the last query traces them to the end of thethe amount of lineage that each operator generates. SubZero
w orkflo w to find the predictions the y af fected.
periodically runs the Optimizer, which uses an Inte g erPr o.
gr ammingSolver to compute the ne wlineage strate gyOn
The genomics benchmark can de v ote up-front storage and
runtime o v erhead
to ensuref astquery e x ecution
becauseit the right side, the Query Executor compiles lineage querie s
is an interacti v evisualization. Although this is application into query plans that join the query with lineage data. The
Runtime, which reads and
specific, it suggeststhat scientific applicationsha v ea wide Executorrequests lineage from the
range of storage and runtime o v erhead constraints.
decodesstored lineage, uses the Re-e xecutorto re-run the
operators,and sendsstatistics (e.g., query f anoutand f anin)
I I I .A R C H I T E C T U R E
to the optimizer to refine future optimizations.
SubZero records and stores lineage data at w orkflo w runtime
Gi v en this o v ervie w , we no w describe the data model and
and uses it to ef ficiently e x ecute lineage queries. The inputstructure
to
of lineage queries (Section IV), the dif ferent types of
SubZero is a w orkflo wspecification(the graph in Workflow lineage the system can record (Section V), the functionality of
the Runtime, Encoder, and Query Executor(Section VI), and
b) Cell-le vellinea g e:Cell-le v ellineage models the relationships between an output cell and each input cell that
finally the optimizer in Section VII.
generated it3 as a set of pairs of input and output cells:
I V . D A T A , L I N E A G E A N D Q U E RY M O D E L
{ ( out, in ) |out 2 OP ^ in 2 [
i
i 2 [1 ,n ] I P
}
Here, out 2 OP means thatout is a single cell contained in
In this section, we describe the representation and notation
the output arrayOP . in refers to a single cell in one of the
of lineage data and queries in SubZero.
input arrays.
c) Re gionlinea g e:Re gionlineage models lineage as a
SubZero is designed to w ork with a w orkflo we x ecutor
of
r e gionpair s. Each re gionpair describesan all-to-all
system that applies a fix ed sequence of operators to some set
set
of
lineage relationship betweena set of output cells, outcel l ,s
inputs. Each operatoroperateson one or more input objects and a set of input cells,incel l si , in each input array I, i :
P
(e.g., tables or arrays), and producesa single output object.
{ ( outce l l s, in cel 1l ,s..., inc el l ns) | outc el l s_ OP ^ in cel l is _ I Pi }
F ormally we
, say an operator P tak esas input n objects,
I P1 , ..., IPn , and outputs a single object,
OP .
Re gion lineage is more than a short hand; scientific applicaMultiple operators are composed together to form a w ork-tions often e xhibit locality and generate multiple output cells
flo w , described by a w orkflo w specification, which is a directed
from the same set of input cells, which can be represented
ac yclic graphW 0=( N , E) , where N is the set of operators, by a single re gion pair . F or e xampl e, the LSST star detection
and e =( OP , I iP ) 2 E specifies that the output ofP forms operator finds clusters of adjacent bright pix els and generates
the i’ th input to the operator P 0. An instance of W , Wj , an array that labels each pix el with the star that it belongs
e x ecutes
the w orkflo won a specific dataset.Each operator to. Ev eryoutput pix el labeled Star X dependson all of the
runs when all of its inputs are a v ailable.
input pix els in theStar X re gion. Automatically tracking such
The data follo ws the SciDB data model, which processes relationships at the cell le v elis particularly e xpensi v so
e,
multi-dimensional arrays. A combination of v alues al ong each
re gionlineage is a generalizationof cell-le v ellineage that
dimension, termed a coor dinate
, uniquely identifies a cell. mak es this relationship e xplicit. F or this reason, later sections
Each cell in an array has the sameschema,and consistsof will e xclusi v ely discuss re gion pairs.
one or more named,typed fields. SciDB is “no o v erwrite, ” Users e x ecute a lineage query by specifying the coordinates
of an initial set of query cells, C , in a starting array ,and a
meaning that intermediate results produced as the output ofpath
an of operators( P . . . P ) to trace through the w orkflo w:
1
m
operator are al w ays stored persistently , and each update to an
R = execute q uer y( C ,(( P1 , idx 1 ) , ..., ( Pm , idx m )))
object creates a ne w , persistent v ersion. SubZero stores lineage
information with each v ersion to speed up lineage queries. Here, the inde x es
(idx 1 . . . idxm ) are used to disambiguate
Our notion of backw ard lineage is defined as a subset of which
the input of a multi-input operator that the query path
inputs that will reproduce the same output v alue if the operator
tra v erses.
is re-run on its lineage. F or e xample, the lineage of an outputDepending on the order of operatorsin the query path,
cell of Matrix Multiply are all cells of the corresponding ro w SubZero recognizesthe query as a forwar d linea g equery
and column in the input a rrays– e v enif some are empty . or bac kwar d linea g e query
. A forwar d linea g query
defines
e
C , of the outputs such a path from some ancestoroperator P1 to some descendent
F orw ard lineage is defined as a subset,
C contains the input cells. The operator Pm . The output of an operator Pi 1 is the idx i ’ th
that the backw ard lineage of
e xactsemanticsfor UDFs are ulitmately controlled by the input of the ne xt operatorPi,. The query cellsC are a subset
idx
de v eloper .
of P1 ’ s idx 1 ’ th input array C
, _ I P1 1 .
SubZero supportsthree types of lineage: blac kbox, cellA bac kwar dlinea g equery re v erses
this process,defining
le vel, and r e gionlineage. As a w orkflo w e x ecutes, lineageaispath from some descendent operator
P1, that terminates at
generated on an operator -by-operator basis, depending on some
the ancestor operator
Pm, . The output of an operatorP,i +1
types of lineage that each operator is instrumented to support
is the idx i ’ th input of the pre vious operator
Pi , ,and the query
and the materialization decisions made by the optimizer . cells C are a subsetof P1 ’ s output array ,C _ OP 1 . The
W e ha v e ins trumented SciDB’ s b uilt-in operators to generate
query results are the coordinatesof the cells R _ OP m or
lineage mappings from inputs to outputs and pro vide an APIR _ I Pidxm m , for forw ard and backw ard queries, respecti v ely .
for UDF designersto e xposethese relationships.If the API
V.L INEAGEAPIANDSTORAGEM ODEL
is not used, then SubZero assumesan all-to-all relationship
SubZero
allo ws de v elopers to write operators that ef ficiently
between the cells of the input arrays and cells of the output
represent
and
store lineage. This section describesse v eral
array .
modes
of
re
gion
lineage, and an API that UDF de v elopers
a) Blac k-boxlinea g e:SubZero does not require adcan
use
to
generate
lineage from within the operators.W e
ditional resourcesto store black-box lineage because,lik e
also
introduce
a
mechanism
to control the modesof lineage
SciDB, our w orkflo w e x ecutor records intermediate results as
well as input and output array v ersionsas peristent,named
3
Although we model and refer to lineage as a mapping between input
objects. These are suf ficient to re-run an y pre viously e x ecuted
and output cells, in the SubZero implementation we store these mappings as
references to ph ysical cell coordinates.
operator from an y point in the w orkflo w .
API Method
Description
System API Calls
l write(outcells,incells 1 , ...,inc ellsn )
API to store lineage relationship.
l write(outcells, payload)
API to store small binary payload
instead of input cells. Called by
payload operators.
Operator Methods
run(input-1,...,input-n,cur modes)
Ex ecute the operator , generating
_ { F ul l ,
lineage types in cur modes
} ox
M ap, P ay , C om p, B l ack b
ma pb (outcell, i)
Computes the input cells ininput i
that contrib ute toou tcell.
ma pf (incell, i)
Computes the output cells that depend
on incell 2 input i .
ma pp (outcell, payload, i)
Computes the input cells ininput i
that contrib ute toou tcell, has access
to pa yload.
C _ { F ul l ,
supported modes()
Returns the lineage modes
} ox
M ap, P ay , C om p, B l ack b
that the operator can generate.
other lineage modes reduce the amount of lineage that is stored
by partially computing lineage at query time using de v eloper
defined mapping functions. The follo wing sectionsdescribe
the modes in more detail.
1) Full Linea g e:Full lineage (Full ) e xplicitly represents
and stores all re gion pairs. It is straightforw ard to instrument
an yoperator to generatefull lineage. The de v eloper
simply
() to
writes code that generatesre gionpairs and usesl w r i te
store the pairs. F or e xample,in the follo wing CRD pseudocode, ifcur modes containsFull , the code iterates through
each cell in the output, calculates the lineage, and calls
l w r ite() with lists of cell coordinates. Note that ifFull is not
specified,the operatorcan a v oidrunning the lineage related
code.
def run(image,
cur_modes):
...
if F ul l 2 cur_modes:
for each cell in output:
if cell == 1:
that an operator generates. Finally , we describe ho w SubZero
neighs = get_neighbor_coords(cell)
re-e x ecutes black-box operators during a lineage query . T able
lwrite([cell.coord],
neighs)
I summarizesthe API calls and operator methods that are
else:
lwrite([cell.coord],
[cell.coord])
introduced in this section.
T ABLE I
RUNTIMEANDOPERATORMETHODS
Before describing the dif ferent lineage storage methods, weAlthough this lineage mode accurately records the lineage
illustrate the basic structure of an operator:
data, it is potentially v ery e xpensi vto
e both generateand
class OpName:
def run(input-1,...,input-n,cur_modes):
/ * Process the inputs,
emit the output
/ * Record lineage
modes specified
in cur_modes * /
def supported_modes():
/ * Return the lineage
modes the
operator
supports
*/
store. W eha v eidentified se v eral
widely applicable operator
properties
that
allo
w
the
operators
to generate more ef ficient
*/
modes of lineage, which we describe ne xt.
2) Mapping Linea g e:Mapping lineage (Map) compactly
representsan operator’ slineage using a pair of mapping
functions. Man y operatorssuch as matrix transposee xhibit
a fix ed e x ecution structure that does not depend on the input
Each operator implements run()
a
method, which is called cell v alues.These operators,called mapping oper ator, scan
when inputs are a v ailable to be processed. It is passed a list
compute forw ard and backw ard lineage from a cell’ s coordiof lineage modes it should output in the
cur modesar gument; nates and metadata(e.g., input and output array sizes) and
it writes out lineage data using thelwrite() method described do not need to accessarray data v alues.This is a v aluable
belo w The
.
de v eloper
specifiesthe modes that the operator property because mapping operators do not incur runtime and
supports (and that the runtime will consider) by o v erriding storage o v erhead.
F or e xample,one-to-one operators,such
the supported modes()method. If the de v eloperdoes not as matrix addition, are mapping operators because an output
o v erridesupported modes()
, SubZero assumesan all-to-all cell only dependson the input cell at the same coordinate,
relationship betweenthe inputs and outputs. Otherwise, the re g ardless
of the v alue. De v elopersimplement a pair of
operator automatically supports black-box lineage.
mapping functions,mapf ( cel l ,)i/map b( ce l l ), ,i that calculate
F or ease of e xplanation,this section is describedin the the forw ard/backw ard lineage of an input/output cell’ s coordiconte xtof the LSST operator C R D (cosmic ray detection, nates, with respect to thei ’ th input array . F or e xample, a 2D
depicted as A and B in Figure 1) that finds pix els containingtranspose operator w ould implement the follo wing functions:
cosmic rays in a single image, and outputs an array of the
def map_b((x,y),
i):
def map_f((x,y),
i):
same size. If a pix el contains a cosmic ray , the corresponding return [(y,x)]
return
[(y,x)]
1, and the output cell depends on the
cell in the output is set to
49 neighboring pix els within a 3 pix el radius. Otherwise the Most SciDB operators (e.g., matrix multiply , join, transpose,
output cell is set to0, and only depends on the correspondingcon v olution) are mapping operators, and we ha v e implemented
their forw ard and backw ard mapping functions. Mapping oper l ,s incel l s).
input pix el. A re gion pair is denotedoutcel
(
ators in the astronomy and genomics benchmarks are depicted
A. Linea g e Modes
as solid box es (Figures 1 and 2).
SubZero supports four modes of re gion lineage
Full,( Map,
3) P ayloadLinea g e:Rather than storing the input cells
P ay , Comp
kbox
), and one mode of black-box lineageBlac
(
). in each re gion pair , payload lineage (P ay
) stores a small
cur modesis set toBlac kboxwhen the operator does not needamount of data (a payload), and recomputesthe lineage
to generatean ypairs (becauseblack box lineage is al w ays using a payload-a w are
mapping function (mapp () ). Unlik e
in use). Full lineage e xpli citly stores all re gion pairs, and the
mapping lineage, the mapping function has accessto the
user -storedbinary payload. This mode is particularly useful
return
[(x,y)]
when the operator has high f anin and the payload is v ery
Compositeoper ator scan a v oidstoring lineage for a sigsmall. F or e xample,supposethat the radius of neighboring
nificant fraction of the output cells. Although it is similar
pix els that a cosmic ray pix el dependson increaseswith
to payload lineage in that the payload cannot be inde x edto
brightness,then payload lineage only stores the brightness
optimize forw ard queries, the amount of payload lineage that
insteall of the input cell coordinates.(P ayloadoper ator) s
is stored may be small enough that iterating through the small
( outcel l s, pay l oad
) to pass in a list of output
call l w r i te
number of (outcells, payload) pairs is ef ficient. Operators A,B
cell coordinatesand a binary blob, and define a payload
and C in the astronomy benchmark (Figure 1) are composite
i
directly computes
function, mapp ( outcel l , pay l oad) , , that
operators.
outcel l 2 outcel l sfrom the outcel l
the backw ard lineage of
coordinateand the payload.The result are input cells in the B. Supporting Oper ator Re-e xecution
i ’ th input array . As with mapping functions, payload lineage An operator stores black-box lineage when cur modes
does not need to accessarray data v alues.The follo wing equals B l a ck box
. When SubZero e x ecutes
a lineage query
pseudocode stores radius v alues instead of input cells:
on an operator that stored black-box lineage, the operator
is re-e x ecuted
in tracing mode. When the operatoris re-run
at lineage query time, SubZero passescur modes = F ul l,
which causesthe operator to perform l w r ite() calls. The
ar guments to these calls are sent to the query e x ecutor .
’3’)
Rather than re-e x ecuting
the operator on the full input
arrays,
SubZero
could
also
reduce
the size of the inputs by
’0’)
applying bounding box predicatesprior to re-e x ecution.
The
def map_p((x,y),
payload,
i):
predicates w ould reduce both the amount of lineage that needs
return
get_neighbors((x,y),
int(payload))
to be stored and the amount of data that the operatorneeds
In the abo v eimplementation,each re gionpair stores the to re-process.Although we e xtendedboth mapping and full
output cells and an additional ar gumentthat representsthe operators to compute and store bounding box predicates, we
radius, as opposed to the neighboring input cells. When a backdid not find it to be a widely useful optimization. During query
e x ecution, SubZero must retrie v e the bounding box es for e v ery
w ard lineage query is e x ecuted, SubZero retrie v es the (outcells,
mapp query cell, and either re-e x ecute the operator for each box, or
payload) pairs that intersect with the query and e x ecutes
mer ge the bounding box es and re-run the operator using the
on each pair . This approach is particularly po werful because
mer gedpredicate Unfortunately
.
the
, former approachincurs
the payload can store arbitrary data – an ything from array data
v alues to lineage predicates [7]. Operators D to G in the tw an
o o v erhead on each e x ecution (to read the input arrays and
apply the predicates) that quickly becomes a significant cost.
benchmarks (Figures 1 and 2) are payload operators.
In the latter approach, the mer ged bounding box quickly e xNote that payload functions are designed to optimize e x epands
to encompass the full input array , which is equi v alent to
cution of backw ard lineage queries. While SubZero can inde
x
completely re-e x ecuting the operator , b ut incurs the additional
the input cells in full lineage, the payload is a binary blob that
cost to retrie v ethe predicates.F orthesereasons,we do not
cannot be e asily inde x ed. A forw ard query must iterate through
further consider them here.
each (outcells, payload) pair and compute the input cells using
mapp before it can be compared to the query coordinates.
V I .I MP LE ME NTA TION
4) CompositeLinea g e:Composite lineage (Comp) comThis section describesthe Runtime, Encoder, and Query
bines mapping and payload lineage. The mapping function
Executorcomponents in greater detail.
defines the def ault relationship between input and output cells,
def run(image,cur_modes):
...
if P AY 2 cur_modes:
for each cell in output:
if cell == 1:
lwrite([cell.coord],
else:
lwrite([cell.coord],
and results of the payload functiono verwritethe def ault lin- A. Runtime
eage if specified. F or e xample, CRD can represent the def ault
In SciDB (and our prototype), we automatically store blackrelationship – each output cell depends on the corresponding
box lineage by using write-aheadlogging, which guarantees
input cell in the same coordinate – using a mapping function,
that black-box lineage is written before the array data, and
and write payload lineage for the cosmic ray pix els:
is “no o v erwrite”on updates.Re gionlineage is stored in a
def run(image,cur_modes):
collection of Berk ele yDB hashtable instances. W e use Berk e...
le yDB to store re gion lineage to a v oid the client-serv er comif C O M P 2 cur_modes):
for each cell in output:
munication o v erhead of interacting with traditional DBMSes.
if cell == 1:
W e turn of f fsync, logging and concurrenc y control to a v oid
lwrite([cell.coord],
3)
reco v ery and locking o v erhead. This is safe because the re gion
// else map_b defines
default
behavior
lineage is treated as a cache, and can al w ays be reco v ered by
def map_p((x,y),
radius,
i):
re-running operators.
return
get_neighbors((x,y),
radius)
The runtime allocates a ne w Berk ele yDB database for each
def map_b((x,y),
i):
operator instance that stores re gion lineage. Blocks of re gion
pairs are b uf feredin memory ,and b ulk encodedusing the hash entry . The hash v alue stores a reference to a single entry
Encoder. The data in each re gionpair is stored as a unit containing the input cells (Figure 4.2). This implementation
(SubZero does not optimize across re gion pairs), and the doesn’ t need to compute and store bounding box information
output and input cells use separateencoding schemes.The and doesn’ t need the spatial inde x because each input cell is
layout can be optimized for backw ard or forw ard queries bystored separately , so queries e x ecute using direct hash lookups.
respecti v ely storing the output or input cells as the hash k e yF. or payload lineage, P ay M an ystores the lineage in a
On a k e ycollision, the runtime decodes,mer ges,and re- similar manner asF ul l M an ,yb ut stores the payload as the
encodesthe tw o hash v alues.The ne xtsubsectiondescribes hash v alue(Figure 4.3). P ay O necreatesa hash entry for
ho w theEncoderserializes the re gion pairs.
e v ery output cell and stores a duplicate of the payload in each
hash v alue (Figure 4.4).
B. Encoder
The Optimizerpicks a lineage strate gy that spans the entire
While Section V presented ef ficient w ays to represent rewgion
orkflo wIt. picks one or more stor a g str
for each
e ate gies
lineage, SubZero still needsto store cell coordinates,which operator . Each storage strate gy is fully specified by a lineage
can easily be lar ger than the original data arrays.Encoder
The
mode (Full, Map, P ayload, Composite, or Black-box), encodstores the input and output cells of a re gion pair (generateding
by strate gy , and whether it is forw ard or backw ard optimized
calls tol w r ite() ) into one or more hash table entries, specified
(→ or ). SubZero can use multiple storage strate giesto
by an encoding str ate gy
. W e say the encoding strate gyis optimize for dif ferent query types.
if the output cells are stored in the hash
bac kwar d optimized
k e y , and
if the hash k e y contains input cells.
forwar d optimized
C. Query Execution
W e found that four basic strate giesw ork well for the
The Query Executor iterati v elye x ecutes
each step in the
operatorswe encountered.– F ul l O neand F ul l M a nyare
the tw o strate giesto encodefull lineage, and P ay O neand lineage query path by joining the lineage with the coordinates
4
P ay M an yencode payload lineage
of the query cells, or the intermediatecells generatedfrom
.
the pre viousstep. The output at each step is a set of cell
Hash%Value%
Hash%Key% Hash%Value%
Hash%Key%
coordinates that is compactly stored in an in-memory boolean
#1234&
(0,1)&
array with the same dimensions as the input (backw ard query)
Index&
#1234&
(2,3)&
or output (forw ard query) array . A bit is set if the intermediate
(4,5),(6,7)&
(0,1),&(2,3)& (4,5),(6,7)&
#1234&
result contains the corresponding
cell. F or e xample, suppose
1. FullMany strategy!
2. FullOne strategy!
we ha v ean operator P that tak esas input a 1 × 4 array .
Considera backw ardsquery asking for the lineage of some
Index&
payload&
(0,1)&
(0,1),&(2,3)&
output cell C of P . If the result of the query is 1001, this
(2,3)&
payload&
payload&
means thatC depends on the first and fourth cell Pin’ s input.
3. PayMany strategy!
4. PayOne strategy!
W e chose the in-memory array becauseman yoperators
Fig. 4. F our e xamples of encoding strate gies
ha v elar gef anin or f anout,and can easily generatese v eral
times more results (due to duplicates)than are unique. Deduplication a v oids w asting storage and sa v es w ork. Similarly ,
Figure 4 depicts ho w the backw ard-optimiedimplementhe e x ecutor
can close an operatorearly if it detects that all
tation of these strate giesencode tw o output cells with coof the possible cells ha v e been generated.
ordinates @, 1) and B , 3) that depend on input cells with
W e also implement an
to speed up
entir e arr ay optimization
coordinatesD , 5) and F , 7). F ul l M an yusesa single hash
queries
where
all
of
the
bits
in
the
boolean
array
are set. F or
entry with the set of serialized output cells as the k e y and the
e
xample,
this
can
happen
if
a
backw
ard
query
tra
v erses se v eral
set of input cells as the v alue (Figure 4.1). Each coordinate is
high-f
anin
operators
or
an
all-to-all
operator
such
as
matrix
bitpack ed into a single inte ger if the array is small enough. W e
in v ersion. In these cases, calculating the lineage of e v ery query
also create an R T ree on the cells in the hash k e y to quickly
cell is v ery e xpensi v e and often unnecessary . Man y operators
find the entries that intersect with the query . This inde x uses
(e.g., matrix multiply or in v erse)
can safely assumethat the
the dimensions of the array as its k e ys and identifies the hash
forw
ard
(backw
ard)
lineage
of
an
entire input (output) array
table entries that contain cells in particular re gions. The figure
is
the
entire
output
(input)
array
.
This optimization is v aluable
sho wsthe unserializedv ersionsof the cells for simplicity .
when
it
can
be
applied
–
it
impro
v
ed the query performance
F ul l M an yis most appropriatewhen the lineage has high
of
a
forw
ard
query
in
the
astronomy
benchmark that tra v erses
f anout because it only needs to store the output cells once.
×.
an
all-to-all-operator
by
83
If the f anoutis lo w ,F ul l O nemore ef ficientlyserializes
In general, it is dif ficult to automatically identify when
and stores each output cell as the hash k e yof a separate
the optimization’ sassumptionshold. Considera concatenate
4
W etried a lar genumber of possiblestrate giesand found that comple x operator that tak es tw o 2D arrays A, B with shapes A, n) and
encodings(e.g., computeand store the bounding box of a set of cells, C , A, m), and produces an A, n+m) output by concatenating B to
along with cells in the bounding box b ut not C
in) incur high encoding costs
A. The opt imization w ould produce dif ferent results, because
without noticeably reduced storage costs. Man y are also readily implemented
as payload or composite lineage
A’ s forw ard lineage is only a subset of the output. W e currently
Strategy
rely on the programmer to manually annotate operators where
the optimization can be applied.
BlackBox
V I I .L I N E A G E S T R A T E G Y O P T I M I Z E R
BlackBoxOpt
FullOne
FullMan y
Subzero
Description
Astr onomy Benchmark
All operators store black-box lineage
Lik e BlackBox, uses mapping lineage for b uilt-in-operators.
Lik e BlackBoxOpt, b ut uses FullOne for UDFs.
Lik e FullOne, b ut uses FullMan y for UDFs.
Lik e FullOne, b ut stores composite lineage
using P ayOne for UDFs.
Genomics Benchmark
UDFs store black-box lineage
UDFs store backw ard optimized FullOne
UDFs store backw ard optimized FullMan y
UDFs store forw ard optimized FullOne
UDFs store FullF orw and FullOne
UDFs store P ayOne
UDFs store P ayMan y
UDFs store P ayOne and FullF orw
Ha ving described the basic storage strate gies implemented
in SubZero,we no wdescribeour lineage storageoptimizer .
The optimizer’ s objecti v e is to choose a set
of a g e str ate- BlackBox
stor
FullOne
gies that minimize the cost of e x ecuting the w orkflo w while FullMan y
k eeping storage o v erhead within user -defined constraints. FullF
W e orw
FullBoth
formulate the task as an inte ger programming problem, where
P ayOne
the inputs are a list of operators, strate gy pairs, disk o v erheads,
P ayMan y
query cost estimates,and a samplew orkloadthat is used to
P ayBoth
deri v e the frequenc y with which each operator is in v ok ed in
TABLE II
the lineage w orkload. Additionally , users can manually specify
L INEAGES TRATEGIESFOREXPERIMENTS.
operator specific strate gies prior to running the optimizer .
The formal problem description is stated as:
A. Query-time Optimizer
While the lineage strate gyoptimizer picks the optimal
min x P i pi _ “ min j | x ij =1 qij ” + _ _ P ij ( disk ij + β _ r un ij ) _ x ij
lineage
strate gy , the e x ecutor must still pick between accessing
P
≤ M axD I S K
s.t.
ij disk ij _ x ij
P ij r unij _ x ij
≤ M axR U N T I M E
the lineage stored by one of the lineage strate gies,or rerunning the operator The
.
query-time optimizer consults the
8i “ P 0 ≤ j<M x ij ”
≥ 1
cost model using statistics g atheredduring query e x ecution
8i,j x ij
2 { 0, 1}
and the size of the query result so f ar to pick the best e x ecution
use r sp ecifie d stra tegie s
method. In addition, the optimizer monitors the time to access
8i,j x ij 2 U
x ij = 1
the materialized lineage. If it e xceeds the cost of re-e x ecuting
the operator , SubZero dynamically switches to re-running the
Here, x ij = 1 if operator i stores lineage using strate gy
operator . This bounds the w orst case performance×tothe
2
j , and 0 otherwise. M axD I S K is the maximum storage
black-box approach.
qij , r unij , and disk ij , are the
o v erhead specified by the user;
V I I I .E X P E R I M E N T S
a v erage
query cost, runtime o v erhead,
and storageo v erhead
costs for operator i using strate gyj as computed by the
In the follo wing subsections, we first describe ho w SubZero
cost model. pij is the probability that a lineage query in optimizes the storage strate gies for the real-w orld benchmarks
the w orkloadaccessesoperat ori , and is computedfrom the described in Section II, then compare se v eralof our linsample w orkload. A single operator may store its lineage data
eage storage techniques with black-box le v el only techniques.
using multiple strate gies.
The astronomy benchmark sho wsho w our re gion lineage
The goal of the objecti v e function is to minimize the costtechniquesimpro v eo v ercell-le v eland black-box strate gies
of e x ecuting
the lineage w orkload,preferring strate giesthat on an image processing w orkflo w . The genomics benchmark
use less storage. When an operator uses multiple strate gies
to
illustrates
the comple xityin determining an optimal lineage
store its lineage, the query processorpicks the strate gythat strate gy and that the the optimizer is able to choose an ef fecti v e
min statement in the left hand strate gy within user constraints.
minimizes the query cost. The
term picks the best query performance from the strate gies that
Ov erall, our findings are that:
j |x(ij =1 ). The right hand term penalizes • An optimal strate gy hea vily relies on operator properties
ha v e been pick ed
strate giesthat tak e e xcessi vdisk
e space or cause runtime
such as f anin, and f anout,the specific lineage queries,
slo wdo wn.β weights runtime ag ainstdisk o v erhead,
and _
and query e x ecution-time
optimizations.The dif ference
is set to a v ery small v alue to break ties. A lar_ge
is similar
betweena sub-optimal and optimal strate gycan be so
to reducingM axD I S K or M ax R U N T I M.E
lar ge that an optimizer -based approach is crucial.
• P ayload, composite, and mapping lineage are e xtremely
W e heuristically remo v econfigurations that are clearly
non-optimal, such as strate giesthat e xceeduser constraints,
ef fecti v and
e lo w o v erhead
techniquesthat greatly imor are not properly inde x edfor an y of the queries in the
pro v equery performance,and are applicable across a
w orkload(e.g., forw ardoptimized when the w orkloadonly
number of scientific domains.
contains backw ard queries). The optimizer also picks mapping• SubZero can impro v ethe LSST benchmarkqueries by
functions o v er all other classes of lineage.
up to 10× compared to nai v ely storing the re gion lineage
W esolv ethe ILP problem using the simple xmethod in
(similar to what cell-le v el approaches w ould do) and up
GNU Linear Programming Kit. The solv er tak es about 1ms to to 255× f aster than black-box lineage. The runtime and
solv e the problem for the benchmarks.
storage o v erhead of the optimal scheme is up to 30 and
Di
sk
C
os
t
70× lo wer than cell-le v el lineage, respecti v ely , and arrays
only – the go al is to be as close to these bars as possible.
1.49 and 1.95× higher than e x ecuting the w orkflo w . F ul l O neand F ul l M an yboth require considerablestorage
• Ev en though the genomics benchmark e x ecutes operators
space 6 × , 53× ) becausethe three cosmic ray operators
v ery quickly , SubZero can find the optimal mix of blackgenerate a re gion pair for e v ery input and output pix el at the
box and re gion lineage that scales to the amount of same coordinates.Similarly , both approachesincur 6× and
F ul l M an y
a v ailable storage. SubZero uses a black-box only strate
44×gyruntime o v erhead to serialize and store them.
when the a v ailable
storageis small, and switches from must also construct the spatial inde x on the output cells. The
space-ef ficient to query-optimized encodings with looser
SubZero optimizer instead picks composite lineage that only
constraints. When the storage constraints are unbounded,
stores payload lineage for the small number of cosmic rays
SubZero impro v esforw ard queries by o v er500× and and stars. This reducesthe runtime and disk o v erheads
to
×
×
×
backw ard queries by 2-3.
1.49 and 1.95 the w orkflo winputs. By comparison,this
storage o v erhead is ne gligible compared to the cost of storing
The current prototype is written in Python and uses Berk e× the
le yDB for the persistent store, and libspatialinde xfor the the intermediate and final results (which amount to 11.5
input
size).
spatial inde x.The microbenchmarksare run on a 2.3 GHz
BQx
linux serv er with 24 GB of RAM, running Ub untu 2.6.38-13- Figure 5(b) compares lineage query e x ecution costs.
and
F
Q
x
respecti
v
ely
stand
for
backw
ard
and
forw
ard query
serv er . The benchmarks are run on a 2.3 GHz MacBook Pro
x.
All
of
the
queries
use
the
entire
array
optimization
described
with 8 GB of RAM, a 5400 RPM hard disk, running OS X
F Q0S l owdoes not.B l a ck B ox
in Section VI-C whereas
must
10.7.2.
re-run each operator and tak esup to 100 secs per query .
A. Astr onomy Benc hmark
B l ack B oxO pt
can a v oidrerunning the mapping operators,
b ut still re-runs the computationally intensi v e UDFs. Storing
1500
15
15
847
1051
30
re gionlineage reducesthe cost of e x ecutingthe backw ard
1000
500
× (F ul l M an )yand 45× (F ul l O ne
queries by 34
) on a v erage.
0
1500
37
37
1666
1030
55
SubZero
benefits
from
e
x
ecuting
mapping
functi
ons and read1000
500
× f aster
ing a small amount of lineage data and e x ecutes
255 on
0
BlackBox
BlackBoxOpt
FullMany
FullOne
SubZero
F
Q
0
S
l
ow
a
v
erage.
illustrates
ho
w
the
all-to-all
optimi
zation
Storage Strategies
×
impro
v
es
the
query
performance
by
83
by
a
v
oiding
fineBlackBox
FullMany SubZero
Strategy BlackBoxOpt FullOne
grained lineage all-together .
R
un
(s
tiec
m
)
e(M
Di
B)
sk
R
un
ti
m
e
(a) Disk and runtime o v erhead
100
B. Genomics Benc hmark
Q y
(s C
,
ue
lo
r ec os
tg)
In this e xperiment,we run the genomics w orkflo wand
e
x
ecute
a lineage w orkloadwith an equal mix of forw ard
10
and backw ardlineage queries (Section II-B). There are 10
b uilt-in mapping operators,and the 4 UDFs are all payload
1
operators. In contrast to the astronomy w orkflo w , these UDFs
BlackBox
BlackBoxOpt
FullMany
FullOne
SubZero
do not e xhibit significant locality , and perform data shuf fling
Storage Strategies
and e xtractionoperationsthat are not amenableto mapping
BQ 0 BQ 2 BQ 4 FQ 0 Slow
Query
BQ 1 BQ 3 FQ 0
functions. In addition, the operatorsperform simple calculations, and e x ecute quickly , so there is a less pronounced trade
(b) Query costs. Y-ax es are log scale
of f betweenre-e x ecuting
the w orkflo wand accessingre gion
lineage. In f act, there are cases where storing lineage actually
Fig. 5. Astronomy Benchmark
the query performance. W e were pro vided×a100
56
de gr ades
In this e xperiment,we run the Astronomy w orkflo wwith matrix of 96 patients and 55 health and genetic features.
fi v ebackw ardqueries and one forw ard query as described Although the dataset is small, future datasets are e xpected to
in Section II-A. The 22 b uilt-in operatorsare all e xpressed come from a lar ger group of patients, so we constructed lar ger
as mapping operatorsand the UDFs consist of one payload datasets by replicating the patient data. The query performance
operator that detects celestial bodies and three composite and o v erheads scaled linearly with the size of the dataset and
×.
sowwe report results for the dataset scaled by 100
operators that detect and remo v e cosmic rays. This w orkflo
e xhibits considerable locality (stars only depend on neighbor -W efirst sho wthe high v ariability betweendif ferentstatic
strate gies (T able II)
and ho w thequery-time optimizer (Secing pix els), sparsity (stars are rare and small), and the queries
are primarily backw ardqueries. Each w orkflo we x ecution tion VII-A) a v oids sub-optimal query e x ecution. W e then sho w
consumestw o 512× 2000 pix el HMB) images (pro videdby ho w the SubZero cost based optimizer can identify the optimal
LSST) as input, and we compare the strate gies in T able II.strate gy within v arying user constraints.
1) Query-T imeOptimizer: This e xperimentcomparesthe
Figure 5(a) plots the disk and runtime o v erhead
for each
of the strate gies.B l ack B oxand B l ack B oxO pt
sho wthe strate giesin T ableII with and without the query-time opbase cost to e x ecute
the w orkflo wand the size of the input timization described in Section VII-A. Each operator uses
73
60
2
54
27
45
31
32
16
FullBoth
FullForw
FullMany
FullOne
PayBoth
PayMany
BlackBox
18
5
PayOne
80
60
40
20
0
80
60
40
20
0
8
8
8
12
28
2
2
2
6
16
42
SubZero1
SubZero10
SubZero20
SubZero50
R
un
(s
tiec
m
)
e(M
Di
B)
sk
89
R
un
(s
tiec
m
)
e(M
Di
B)
sk
73
BlackBox
Storage Strategies
Strategy
BlackBox
FullBoth
FullForw
FullMany
77
R
un
ti
m
e
64
Di
sk
C
os
t
161
R
un
ti
m
e
8
Di
sk
C
os
t
150
100
50
0
150
100
50
0
SubZero100
Storage Strategies
FullOne
PayBoth
PayMany
PayOne
Strategy
(a) Disk and runtime o v erhead
BlackBox
SubZero1
SubZero10
SubZero20
SubZero50
SubZero100
(a) Disk and runtime o v erhead
10.0
Q y
(s C
,
ue
lo
r ec os
tg)
Q y
(s C
,
ue
lo
r ec os
tg)
1e+02
0.1
1e+00
1e−02
BlackBox
FullBoth
FullForw
FullMany
FullOne
PayBoth
PayMany
PayOne
BlackBox
SubZero1
Storage Strategies
Query
BQ 0
BQ 1
FQ 0
SubZero100
SubZero20
SubZero50
Storage Strategies
FQ 1
(b) Query costs (static) Y-ax es are log scale.
10.0
SubZero10
Query
BQ 0
BQ 1
FQ 0
FQ 1
(b) Query costs. Y-ax es are log scale.
Fig. 7. Genomics benchmark. SubZero
X has a storage constraint of
X MB
Q y
(s C
,
ue
lo
r ec os
tg)
dynamically switching to theB l ack B oxstrate gy . Ov erall, the
× , 25
backward and forw ard queries impro v ed by up to 2 and
respecti v ely .
The pre vious section com2) Linea g e Str ate gy Optimizer:
BlackBox FullBoth FullForw FullMany FullOne
PayBoth PayMany PayOne
Storage Strategies
pared man y strate gies, each with dif ferent performance characQuery
BQ 0 BQ 1 FQ 0 FQ 1
teristics depending on the operator and query . W e no w e v aluate
the SubZerostrate gy optimizeron the genomicsbenchmark.
(c) Query costs (dynamic) Y-ax es are log scale.
Figure 7 illustrates that when the user increases storage constraints from 1 to 100MB (with unbounded runtime constraint),
Fig. 6. Genomics benchmark. Querys run with (dynamic) and without (static)
the query-time optimizer described in Section VII-A.
the optimizer picks more storage intensi v estrate giesthat
are predicted to impro v ethe benchmark queries. SubZero
mapping lineage if possible, and otherwise stores lineage using
the specified strate gyThe
.
majority of the UDFs generate choosesB l ack B oxwhen the constraint is too small, and
forw ardand backw ard-optimized
lineage that benefits
re gion pairs that contain a single output cell. As mentioned stores
in
all
of
the
queries
when
the
minimum
amount
of storageis
pre vious e xperiments, payload lineage stores v ery little binary
a
v
ailable
ˇMB).
Materializing
further
lineage
has
dimindata, and incurs less o v erhead than the full lineage approaches
S
ubZ
er
100
o
ishing
storage-t
o-query
benefits.
uses
50MB to
(Figure 6(a)). Storing both forw ardand backward-optimized
(
M
AN
Y
,
O
N
)
E
forw
ard-optimize
the
UDFs
using
, which
lineage (P ay B othand F ul l B oth
) requires signifi cantly more
reduces
the
forw
ard
query
costs
to
sub-second
costs.
This
×
o v erhead – 8 and 18.5
more space than the input arrays, and
is
because
the
UDFs
ha
v
e
lo
w
f
anout,
so
each
join
in
the
×
2.8 and 26 runtime slo wdo wn.
query
path
is
a
small
number
of
hash
lookups.
Due
to
space
Figure 6(b) highlights ho w query performance can
de gr ade
if the e x ecutorblindly joins queries with mismatchedin- constraints, we simply mention that specifying and v arying the
o v erhead constraints achie v es similar results.
de x ed lineage (e.g., backw ard-optimized lineage with forwruntime
ard
5
F ul l F or wde graded backw ard query
queries). F or e xample,
C. Micr obenc hmark
performance by 520× . Interestingly ,the BQ1 ran slo wer
The pre vious e xperimentscompared se v eralend-to-end
becausethe query path contains se v eral
operatorswith v ery
strate
lar ge f anins. This generates so man y intermediate results that gies, ho we v er it can be dif ficult to distinguish the sources
performing inde x lookups on each one is slo werthan re- of the benefits.This subsectionssummarizesthe k e ydif fer running the operators.Note ho we v er
that
, the forw ardopti- encesbetween the pre v ailing strate gies in terms of o v erhead
and query performance. The comparisons use an operator that
mized strate gies impro v ed the performance of FQ0 and FQ2
generates synthetic lineage data with tunable parameters. Due
because the f anout is lo w .
Figure 6(c) sho wsthat the query-time optimizer e x ecutes to space constraints we sho w results from v arying the f anin,
the queries as f ast as, or f aster than,
B l ack B ox
. In general, f anout and payload size (for payload lineage).
Each e xperiment processes and outputs a 1000x1000 array ,
this requires accuratestatistics and cost estimation, the opand
generateslineage for 10% of the output cells. The re×
timizer limits the query performancede gradationto 2 by
sults scaled close to linearly as the number of output cells
5
All comparisons are relati v e B
tol ack B o x
with lineage v aries.A re gionpair is randomly generatedby
0.1
Fanout: 1
Fanout: 100
selecting a cluster of output cells with a radius defined by
30
f anout , and selectingf anin cells in the same area from the
20
input array .W egeneratere gionpairs until the total number
10
of output cells is equal to 10% of the output array .The
0
payload strate gyuses a payload size of fanin× 4 bytes (the
30
payload is e xpectedto be v erysmall). W ecomparese v eral
20
F ul l O ne
backw ard optimized strate gies (F ul l M a ny
,
,
10
P ay M an ,y
P ay O ne
), one forw ardlineage strate gy
0
0
20
40
60
80
100 0
20
40
60
80
100
Fanin
(→ F ul l O ne
), and black-box (B l ack B ox
). W efirst discuss
<− PayMany <− FullMany −> FullOne
the o v erhead to store and inde x the lineage, then comm ent on
Strategy <− PayOne
<− FullOne
BlackBox
the query costs.
Figure 8 comparesthe runtime and disk o v erhead
of the
Fig. 8. Disk and runtime o v erhead
Fanout: 1
Fanout: 100
dif ferentstrate gies.F orreferenc,the size of the input array
0.100
is 3.8MB. The best full lineage strate gydif fers based on
0.075
the operator f anout.F ul l O neis superior whenf anout ≤ 5
0.050
because it doesn’ t need to create and s tore the spatial inde x.0.025
0
20
40
60
80
100 0
20
40
60
80
100
The crosso v erpoint to F ul l M an yoccurs when the cost
Fanin
of duplicating hash entries for each output cell in a re gion
Strategy <− PayMany <− PayOne <− FullMany <− FullOne
pair e xceedsthat of the spatial inde x.The o v erhead
of both
approaches increases with f anin. In contrast, payload lineage
Fig. 9. Backward Lineage Queries, only backw ard-optimized strate gies
has a muchlo wer o v erhead
than the full lineage approaches
or astronomy – e xhibit substantial locality (e.g., a v erage temand is independent of the f anin because the payload is typically
small and does not need to be encoded.When the f anout peraturereadingswithin an area) that can be used to define
increases to 50 or 100,
P ay M an yandF ul l M a ny
require less payload, mapping or composite operators. As the e xperiments
sho w , SubZero can record their lineage with less o v erhead than
than 3MB and 1 second of o v erhead. The forw ard optimized
F ul l O neis comparableto the other approacheswhen the from operatorsthat only support full lineage. When locality
is not present,as in the genomicsbenchmark,the optimizer
f anin is lo w . Ho we v er , when the f anin increases it can require
up to f anin × more hash entries because it creates an entrymay still be able to find opportunities to record lineage
for e v ery distinct input cell in the lineage. It con v er ges to ifthethe constraintsare relax ed.A v ery promising alternati v e
F ul l O newhen the f anout and f anin areis to simplify the processof writing payload and mapping
backw ard optimized
functions by supporting v ariable granularities of lineage. This
high. Finally ,B l ack B oxhas nearly no o v erhead.
Figure 9 sho wsthat the query performancefor queries lets de v elopers define coarser relationships between input and
outputs (e.g., specify lineage as a bounding box that may
that access the backw ard/forw ard lineage of 1000 output/input
cells. The performance scales mostly linearly with the querycontain inputs that didn’ t contrib ute to the output). This also
size. There is a clear dif ferencebetween F ul l M an yor allo ws the lineage system perform lossy compression.
P ay M an ,yand F ul l O neor P a y O ne
, due to the additional
I X .R E LA TE DW ORK
cost of accessing the spatial inde x (Figure 9). P ayload lineage
There is a long history of pro v enance and lineage research
performs similar to, b ut not significantly f aster than full
both in databasesystems and in more general w orkflo w
pro v enance, although the query performance remains constant
l ack B ox systems. There are se v eral e xcellent surv e ys that characterize
as the f anin increases. In comparison (not shoBwn),
in databases[8] and scientific w orkflo ws[9],
tak esbetween 2 to 20 secondsto e x ecutea query where pro v enance
fanin=1 and around 0.7 secondswhen fanin=100. Usi n ga [10]. As noted in the introduction, the primary dif ferences
from prior w ork are that SubZero uses a mix of black-box
mis-matched inde x (e.g, using forw ard-optimized lineage for
and re gionpro v enance,
e xploitsthe semanticsof scientific
backw ard queries) tak es up to tw o orders of magnitude longer
than B l ack B oxto e x ecute
the same queries. The forw ard operators(making using of mapping functions) and uses a
queries using→ F ul l O nee x ecute similarly to F ul l O ne number of pro v enance encodings.
Most w orkflo w systems support custom operators containin Figure 9 so we do not include the plots.
ing user -designed
code that is opaque to the runtime. This
D. Discussion
presentsa dif ficulty when trying to managecell-le v el(e.g.,
The e xperimentssho w that the best strate gyis tied to array cells or database tuples) pro v enance. Some systems [4],
the operator’ slineage properties, and that there are orders [11] model operators as black-box es where all outputs depend
on all inputs, and track the dependenciesbetweeninput and
of magnitude dif ferences between dif ferent lineage strate gies.
output datasets. Ef ficient methods to e xpose, store and query
Science-oriented lineage systems should seek to identify and
e xploit operator f anin, f anout, and redundanc y properties.cell-le v el pro v enance is an area of on-going research.
Man y scientific applications – particularly sensor -based or Se v eralprojects e xploit w orkflo wsystemsthat use high
image processing applications lik e en vironmental monitoring
le v elprogramming constructswith well defined semantics.
llllll
llllll
Di
sk
llllll
llllll
llllll
llllll
llllll
llllll
llllll
llllll
R
un
ti
m
e
llllll
llllll
llllll
llllll
llllll
R
un
ti
m
e
(s
ec
)
Di
sk
(M
B)
llllll
l
l
l
l
Q
ue
r
y
C
os
t
(s
ec
)
l
l
l
l
RAMP [12] e xtendsMapReduceto automatically generate redundantlineage that is generatedand stored to impro v e
× while using up to 70× less
lineage capturing wrappers around Map and Reduce operators.
query performance by up to 10
Similarly , Amsterdameret al [13] instrument the PIG [14] storagespaceas comparedto e xistingcell-basedstrate gies.
frame w orkto track the lineage of PIG operators.Ho we v er The
,
optimizer successfully scales the amount of lineage stored
user defined operators are treated as black-box es, which limits
basedon application constraints,and can impro v ethe query
their ability to track lineage.
performance of the genomics benchmark, which is amenable
black-box only strate gies..In conclusion, SubZero is an
Other w orkflo w systems (e.g., T a v erna [3] and K epler to
[15]),
processnestedcollections of data, where data itemsmay be important initial step to mak einteracti v elyqueryi ngfinelineage a reality for scientific applications.
imagees or DN A sequences. Operators process data itemsgrained
in a
collection, and these systems automatically track which subR EFERENCES
sets of the collections were modified, added, or remo v ed [16],
[1] Z. Iv ezi, J. T yson, E. Acosta, R. Allsman, S. Anderson,
et al., “LSST:
[17]. Chapman et. al [18] attach to each data item a pro v enance
From science dri v ers to reference design and anticipated data products. ”
tree of the transformationsresulting in the data item, and
[Online]. A v ailable: http://lsst.org/files/docs/ov ervie wV2.0.pdf
[2]
propose ef ficient compression methods to reduce the tree size. M. Stonebraker , J. Becla, D. J. DeW itt, K.-T . Lim, D. Maier , O. Ratzesber ger and
, S. B. Zdonik, “Requirementsfor sciencedata basesand
Ho we v er
these
, systemsmodel operatorsas blac k-box es
and
SciDB, ” inCIDR, 2009.
data items are typically files, not records.
[3] T . Oinn, M. Greenwood, M. Addis, N. Alpdemir , J. Ferris, K. Glo v er ,
C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. Pocock,
Databasesystemse x ecute
queries that processstructured
M. Senger R.
, Ste v ens,
A. W ipat,and C. Wroe, “T a v erna:
lessonsin
tuples using well defined relational operators, and are a natural creating a w orkflo w en vironment for the life sciences,Concurr
” in
ency
tar get for a lineage system. Cui et. al [19] identified ef ficient and Computation: Pr actice and Experience
, 2006.
[4] H. K uehn, A. Liberzon, M. Reich, and J. P. Mesiro v , “Using genepattern
tracing procedures for a number of operator properties. These
for gene e xpression analysis,
Curr.
” Pr otoc. Bioinform.
, Jun 2008.
procedures are then used to e x ecute backw ard lineage queries.
[5] J. W idom, “T rio: A system for inte grated management of data, accurac y ,
and lineage, ” T ech. Rep., 2004.
Ho we v erthe
, model does not allo w arbitrary operators to
generate lineage, and models them as black-box es. Section[6]
V P. T amayo,Y.-J. Cho, A. Tsherniak, H. Greulich, et al., “Predicting relapsein patients with medulloblastomaby inte gratinge vidence
describesse v eralmechanisms(e.g., payload functions) that
from clinical and genomic features. ”J ournalof Clinical Oncology, p.
can implement man y of these procedures.
29:14151423, 2011.
T rio [5] w as the first database implementation of cell-le v [7]
el R. Ik eda and J. W idom, “P anda: A system for pro v enanceand
data, ”in IEEE Data Engineering Bulletin, 2010. [Online]. A v ailable:
lineage, and unified uncertainty and pro v enance under a singlehttp://ilpubs.stanford.edu:8090/972/
data and query model. T rio e xplicitly stores relationships [8] J. Chene y , L. Chiticariu, and W. C. T an., “Pro v enance in databases: Wh y ,
ho w , and where, ”Foundations
in
and T r ends in Databases
, 2009.
between input and output tuples, and is analogous to the full
[9] S. Da vidson, S. Cohen-Boulakia, A. Eyal, B. Ludscher , T . McPhillips,
pro v enance approach described in Section V.
S. Bo wers,M. K. Anand, and J. Freire, “Pro v enance
in scientific
w orkflo w systems. ”
The SubZero runtime API is inspired by the PASS [20],
[21] pro v enance
API. PASS is a file system that automat- [10] R. BOSE and J. FREW , “Lineage retrie v al for scientific data processing:
A surv e y , ” AinCM Computing Surve, ys
2005.
ically stores pro v enance
information of files and processes. [11] J. Goecks, A. Nekrutenko, J. T aylor ,and T . G. T eam,“Galaxy: a
comprehensi vapproach
e
for supporting accessible,reproducible,and
Applications can use the libpass library to create abstract
transparentcomputationalresearchin the life sciences .in
” Genome
pro v enance objects and relationships between them, analagous
Biolo gy, 2010.
to producing cell-le v ellineage. SubZero e xtendsthis API [12] R. Ikeda, H. P ark, and J. W idom, “Pro v enancefor generalized
map and reduce w orkflo ws, in
” CIDR, 2011. [Online]. A v ailable:
to support the semanticsof common scientific pro v enance
http://ilpubs.stanford.edu:8090/985/
relationships.
[13] Y. Amsterdamer , S. Da vidson, D. Deutch, T . Milo, J. Sto yano vich, and
V. T annen,“Putting lipstick on pig: Enabling database-stylew orkflo w
pro v enance, ”PVLDB
in
, 2012.
[14] C. Olston, B. Reed, U. Sri v asta v a, R. K umar , and A. T omkins, “Pig latin:
This paper introduced SubZero, a scientific-oriented lineage
A not-so-foreign language for data processing, ”SIGMOD
in
, 2008.
storage and query system that stores a mix of black-box and[15] I. Altintas, C. Berkle y , E. Jae ger , M. Jones, B. Ludascher , and S. Mock,
fine-grained lineage. SubZero uses an optimization frame w ork “K epler: an e xtensiblesystem for design and e x ecutionof scientific
w orkflo ws, ” in
SSDM, 2004.
that picks the lineage representationon a per -operatorba[16] M. K. Anand, S. Bo wers,T . McPhillips, and B. Ludscher “Ef
, ficient
sis that maximizes lineage query performancewhile staying
pro v enance storage o v er nested data collections,
” in
, 2009.
EDBT
within user constraints. In addition, we presented
r e gionlin- [17] P. Missier ,N. P aton,and K. Belhajjame, “Fine-grained and ef ficient
ea g, ewhich e xplicitly represents lineage relationships between lineage querying of collection-basedw orkflo wpro v enance,in” EDBT,
2010.
sets of input and output data elements, along with a number[18] A. P. Chapman,H. Jagadish,and P. Ramanan,“Ef ficient pro v enance
of ef ficient encoding schemes. SubZero
is hea vily optimized
storage, ” inSIGMOD, 2008.
for operators that can deterministically compute lineage from[19] Y. Cui, J. W idom, and J. L. V iener , “T racing the lineage of vie w data in a
w arehousing en vironment, A
” in
,
CM T r ansactions on Database Systems
array cell coordinates and small amounts of operator -generated1997.
metadata.UDF de v elopers
e xposelineage relationshipsand [20] K.-K. Muniswamy-Reddy D.
, A. Holland, U. Braun, and M. Seltzer ,
“Pro v enance-a w are storage systems,
” in, 2005.
NetDB
semanticsby calling the runtime API and/or implementing
[21] K.-K. Muniswamy-Reddy J.
, Barillariy , U. Braun, D. A. Holland,
mapping functions.
D. Maclean,M. Seltzer and
,
S. D. Holland, “Layering in pro v enanceOur e xperimentssho wthat man yscientific operatorscan
a w are storage systems, ” T ech. Rep., 2008.
X . C ONCLUSION
use our techniques to dramatically reduce the amount of
Download