>> Kaushik Chakrabarti: All right. Let's get started... pleasure to introduce today's speaker, Alexander Kalinin. Alexander is...

advertisement
>> Kaushik Chakrabarti: All right. Let's get started here. Good morning everyone. It is my
pleasure to introduce today's speaker, Alexander Kalinin. Alexander is a PhD student at the
Computer Science Department in Brown University; and his thesis work he's focusing on
developing data exploration techniques for large scientific data sets, specifically he had
developed a couple of data exploration frameworks called Searchlight, another called Semantic
Windows. I’m sure he is going to talk more about those today. So without any further delay I’ll
hand it to Alexander so he can start his talk.
>> Alexander Kalinin: Right. So I'll be talking about integrated search and exploration over
large multidimensional data which is as Kaushik said my basically main dissertation work. And
this is joint work with my basically advisors at Brown, Professor Ugur Cetintemel and Professor
Stan Zdonik.
Basically what we deal is we’re searching for the interesting within big data and they assume
exploratory analysis tends to be very ad hoc and repetitive, right? Because equation is not well
defined so since users have a hard time identifying what is interesting so they tend to ask some
queries. Then we basically after we find something we go and find something else because
users change their preference quite frequently, they change their constraints quite frequently.
An interesting is also not only hard to find because we have a lot of different objects of interest,
we have a lot to choose from but also it's hard to compute because objects of interest tend to
[indiscernible] of data under the hood and I will talk about that this more.
At the same time, even as I just said, we want fast online results because basically we don't
want for the users to wait for hours for creators to finish, right? We want to identify [inaudible]
some results really quickly. And to give a more concrete example of what we are looking for
let's say this is basically a kind of Sloan Digital Sky Survey as DSS, data set. It contains
information about different star objects, object in the sky, and what users might ask is like
three types of queries, as an example. And the first type we should call first-order queries
basically. We might want to find celestial regions here with some brightness constraints. And
the idea here is that we can specify ranges for the shape of the regions, the user might not be
sure about the correct shape, we specify constraints such as brightness. We might add more
constraints as well. And the problem here is very large search space because the regions can be
anywhere the user doesn’t specify where to look for them. So we have to go here and here
because the regions can overlap, they can have different shapes, they can be anywhere in the
data set. This creates a really large search space of possible choices.
Or we might move to something like high-order queries which we call them is basically where
we look at let's say pairs of regions of interest, the regions that are tied or linked by some
constraints like maybe they have similar brightness or similar spectral characteristics or
something like that. But again, the user specifies what they mean by this [indiscernible].
One of my goals to optimize queries were basically we are looking for something maximized in
some objective function like maybe we are looking for maximum bright region. This is one
example. This is kind of, we believe this is a very general thing to do. For example, we can go
to a more common problem like subsequence matching which can also be in some sense seen
as a constraint-based, problem, right? Again, so this is an example from the [indiscernible] data
set which among other things contains recordings for ICU patients so we might do something
like query by drawing where basically we are looking for similar subsequences. And it's a wellresearched study of course, what I would emphasize is that it’s basically the same query in
nature, right? We're not told where to look so basically we just have some constraints. Here it
might be the distance from the query sequence to the result sequence, subsequence and
maybe some other constraints as well actually. So it might add some other constraints like
maybe some constraints on the amplitude of the signal or maybe we can add some constraints
on the neighborhood of the found results and so on.
So we believe that there are actually two sides to such data exploration problems and the first
side is search complexity because the search space is really large so enumeration is not feasible.
So we cannot just basically enumerate all possible regions and just check them one by one.
Constraints might be also more elaborate. So the one before, for example, was, again as I said,
average brightness so we might have aggregates but we also might have something else,
similarities more complex than that. Again, we might not only decide to study the region itself
but also maybe the neighborhood around it and maybe compute the brightness of the
neighborhood, the brightness of the region itself, compare them somehow, and we assume
that it might be quite complex logic where the user actually defines it as a part of the query. So
it's not necessarily a predefined range of constraints.
On the other hand, we have what we call data complexity because we still have a large data set
and we have to perform a lot of out of core computation and this might be expensive. What I
mean here is that to basically check the brightness of region, for instance, we need to go and
retrieve the corresponding objects. And the region might include a lot of stars, planets,
whatever so we have to go and compute all these functions, all these aggregates that use the
references. So we might basically incur a lot of computation overhead.
And what I believe is that these two sides of data exploration they are covered really well by
two different techniques. And search complexity is covered really well by constraint program
which is very general technique in some sense that basically deals with exploring large search
space efficiently and it's very customizable. So I will talk about this a little bit later. At the same
time data complexity, this kind of [indiscernible] core computation is handled really well by
traditional DBMS’s.
So I guess the next question is that can it actually be used? So, since we have large data sets,
can we use actually traditional DBMS’s for this kind of data exploration? And the problem with
the traditional DBMS’s is that they don't have nice support for exploratory constructs. So in
some sense when we look at regions it more resembles power sets. So basically if you think
about, for example, [indiscernible] arrays we are not looking at like every single point in the
array. We're looking at a sub set of points. It resembles more like a power set but databases
don't have supports or power sets constructs and they have limited support for user-defined
logic. For example, I cannot if I'm looking for a brightness region, if I’m solving an optimization
problem, they basically don't allow you to define an objective function, for instance. They also
don't allow you to easily define complex constraints and [indiscernible] are treated as black
boxes. And they don't allow you to, for example, to perform some customized search like
maybe steer in the search in the required direction which is common, right? Defining search
heuristics is really common in search-based problems.
Another problem is that they support interactivity poorly which is kind of a no-no for interactive
exploration. But still it’s possible to kind of express some of the simple queries in SQL. Like we
can take a simple query and basically what you can divide [indiscernible] in this kind of batch
processing. So you just can still enumerate all possible regions by dividing data into cells, for
instance, and these joint cells combine these cells together by using something like recursive
[indiscernible], for instance, and then perform filtering. [indiscernilbe] common way to do such
processing in the database systems.
So such a query is really hard to maintain and they are really hard to optimize so this is a very
simple query. If we had more constraints and we tried to express more queries the query is
going to grow really fast. At the same time such queries don't allow any interactivity because
almost surely we'll have a block in operations. Like here we have basically block in group by
which basically is kind of the first point here like dividing data into cells requires group by.
On the other hand, expressability-wise[phonetic], such queries, such exploration queries is very
easy to express in constraint programming . So constraint programming basically specifies
those queries by using a bunch of decision variables. So here, for instance, objects of interest
can be easily expressed by four variables defined in the leftmost corner of the region itself and
the possible ranges of lengths and then the user defines a bunch of constraints between these
variables. So here we have only one constraint which is average brightness so I just put as
average BR, for instance. But we can of course add more constraints; we can define a bright
region, what does it mean to have a bright region. If you want to express, for example, the
neighborhood of a region we can add more decision variables and, again, tie them by using
constraints and so on.
When you define such a problem and feed it to a CP solver the CP solver is actually very good at
basically dynamically building the search space and exploring that in very efficient ways.
Because I have a large number of heuristics, number of methods to explore large search spaces,
a large number of heuristics already defined, they support a variety of constraints outside of
box and they are also very highly extensible which is really important for ad hoc exploration if
the user wants to define their own extensions. For example, users can easily define new
functions or new constraints. They can easily define new search heuristics and that's important
actually because in constraint programming it is often the case where for a specific problem we
want to define a different search heuristic that will allow us to find the results much quicker
than, for example, a general heuristic. This is actually supported in a really nice way in all CP
solvers in a very natural way.
To give us more kind of refresher on how constraint programming basically works this is the
traditional backtracking CP solver which basically just builds a search space three starting from
the initial values for the variables. So here we have, for example, we might have two variables,
[indiscernible] and bound variables because they still have a branch of choices so [indiscernible]
declination as SDSS, for instance. Then basically the traditional CP solver goes with the search
heuristic which is defined by the user or might be some of the standard heuristics. So here this
is a very simple what is called variable value heuristic where first we have to choose the
variable which is not bound yet and we basically pick some value or values from each domain.
So, for example, the first step might be [indiscernible] extension and this choice actually it’s
highly customizable, right? So we might decide based on, for example, sampling and we might
decide based on just random choice or whatever. It depends on the task at hand and it's really
easy to define.
And then after picking the variable basically the software takes, here it divides the domain into
three parts and picks the middle part, again, based on some heuristic based on maybe some
estimations. And it continues to do so until we have basically a leaf of the tree here where the
variables are called bound and that's where we have basically a solution. Or, if we are lucky
enough and that's what this is all about, is that we might get something like this here where we
basically explore the right branch of the tree. So [indiscernible] to branches, the left branch
and the right branch which are basically just joined but cover the whole search space. If we are
lucky enough we can actually, at this point the solver might prove that there is no way
constraints can satisfy for these ranges so [indiscernible] can prune the entire subtree from the
storage space and this is really important for constraint programming. That's what CP solvers
excel at. Sometimes it's called like a search fail. And we believe that for efficient exploration
we need to combine this with technologists to make them work together as a single system.
And that's basically the move of our work is CP plus DBMS for data intensive exploration.
But first I want to talk about the first framework. Basically our first work we just call Semantic
Windows. And Semantic Windows was our kind of first step towards constraint-based
exploration so it still allowed us to define queries based on constraints but it was very custom
and [indiscernible] solution for a specific problem. So it used a custom solver, not a general
solver, just a custom solver a solver written by us which we used utility-based search which I’ll
talk about and used basically a kind of sampling base exploration in a way what it allows us to
do basically it allows us to steer the search in the promising directions by using sampling. So it
wasn't general, as I mentioned, it supported partial only first order queries and again, by firstorder queries I mean queries with just kind of simple constraints like aggregate- based
constraints. So it was not general; it was very specific, and it was hard to extend but I will talk
about lessons learned later.
So again, going back to this as example, again this is basically an example of a query. So we go
and search for all of these regions satisfying the constraints and some [indiscernible] supports
like two types of constraints. The first type we call them conditions actually in that research
and we search these shape-based conditions or shape-based constraints which basically just
specify, for example, the shape of the region; and it is important to distinguish them from the
content-based conditions, from the content-based constraints because shape-based constraints
they don't reference data they just basically defined the shape of the region, they define what
we are looking at without looking at the data. But quantum-based constraints they basically go
to the data. They’re more semantic like. And we called basically all multidimensional regions
satisfying such constraints we basically call them Semantic Windows.
And again, we didn't want to go with the SQL approach, which I outlined before with these
huge queries although we of course performed experimental relation comparing ourselves with
that, but we decided to make the process more search like. Basically as a solver would do that.
So basically we dynamically enumerate all possible windows subject to pruning of course. We
couldn’t use extensive pruning because we strived for the exact result and sampling doesn't
allow you to perform 100 percent pruning, but we tried to at least take some information like
from shared-based constraints, for instance, to perform at least some pruning. And the gist
here was that we wanted to focus on on-line results. We don't want to just enumerate
windows and check them one by one. We basically want to go to the most promising alias of
the search space. So we basically started the windows, the numeric windows, yeah?
>>: Is this sampling [inaudible] you don't want all answers?
>>: Alexander Kalinin: Say that again?
>>: Whatever completeness you don’t want all answers?
>> Alexander Kalinin: What I meant is that we actually, the other way around. We want all
answers. We want the exact result. And we didn't want, so basically the sampling wouldn't
allow us to provably prune regions from consideration because if you prune it based on some
estimation which is not 100 percent correct. So in some sense we guarantee [indiscernible] for
provable pruning of regions. That's why we didn’t perform extensive pruning in this work.
So we enumerated all windows and we started them in order of utility. We shall talk about on
next slide, but here it basically defines is how promising the window is to be output to the user.
So in this case we basically can quickly identify the results and move to the promising parts of
the search space and identify and output the results to the user. And this kind of the solver,
which is basically, you can say this is kind of a solver. It also builds some kind of a search tree.
And it was close to where via this utility. So utility is basically a combination of the benefit and
cost. Benefit measured how close the window is to satisfy the conditions. For example, if we
have a brightness constraint we might estimate the brightness of [indiscernible] window by
using sampling and make some guess at how close it is to user-defined constraints.
And at the same time since we again perform out of core computations we also measure the
cost of the window because again, we don't want to distinguish between distortion and data
problem. We want to kind of work it closer together. So we have to measure how expensive it
would be to read the window from disk, to read the corresponding data from the disk. So we
measure basically, I say it in cells, but basically measured in data we have to read. And we've
made some adjustments for skewed data because you have some skewness in data of course.
Of course we also made some provisions for caching for instance because we have a lot of
overlapping windows, one window brings some data with it, we cache it, and we have to make
sure that we basically count it in the cost function so basically not to overestimate,
>>: [inaudible] constraint using [inaudible]?
>> Alexander Kalinin: So here we don't. So I will talk about this later when we actually use the
constraint [indiscernible]. So here we basically wrote it by ourselves. So basically it's a custom
solver in C plus plus as a kind of additional thing on top of actually [indiscernible] as a DBMS. So
it’s a completely custom thing. I think we didn't use anything off-the-shelf except for
PostgreSQL here.
And so when we basically measure benefit and cost what we tried to do we tried to go for the
beneficial windows which are hopefully are more or less cheap. And it's basically you can say
that our search was basically best-first search when you think about it. So we might have kind
of a priority queue of utility order of regions so we started them in the order of the utilities. For
example, here we will look at region three and basically if it satisfies the constraints we output
the user and that's why we need to check it basically if it actually satisfies the constraints
because utilities are sampling-based. We don't read data yet. And then basically we can
generate new windows from that by, for example, expanding regions more and more and
basically put them in the queue and [indiscernible] until basically all regions have been
explored. So I want to emphasize the fact that we have to kind of go on, go on right because
since we want the exact result we cannot just stop at some point and say we are done until we
explore everything again because we want exact result and sampling would not allow us to kind
of say for sure that, for example 0.98 doesn't mean that region can be safely proved, for
instance. We have to go and check. So we have to go through the whole data set, through the
whole search space.
So one thing, it's more system-like is that I was about this kind of optimization, CP DBMS
optimization is that when we would encounter it actually as a problem where basically we have
a lot of kind of small reads which disbursed around a data file so you basically go read
something here, here, here, here and this basically creates a problem with six and problem with
threshing and six understandable right because we study windows and the order of utilities so
it's kind of we didn't take six into consideration here because it would be to prohibitive I guess.
So when we go to promising regions, we go to bright region doesn't mean that, so what might
happen is that we might jump between like interesting parts of the search space so we don't
want to force sequential reads here. At the same time window locality doesn't imply disklocality because here we are talking about relational, yeah?
>>: So my question [inaudible]? Like what's a basic data, what’s a basic [inaudible] data and
how does it [inaudible] because you're looking at multidimensional data and there are a bunch
of choices on how we organize disk-end and none of this discussion is highly specific to how we
[inaudible].
>> Alexander Kalinin: Right. So that's actually what I was going to say for window-locality
versus disk-locality. So if we look, for example on something like SDSS this is basically kind of an
array, two-dimensional array, maybe like two coordinates. We actually will be performed the
enumeration we tried different skills. Basically the underlying system is a relational system so
you have some choices on how to put the data there. So we tried different things like maybe
aligning it by one coordinate or maybe do something like [indiscernible] ordering, for instance.
You try to force some locality because you’re still looking at continuous regions, right? You
want force some locality but you cannot since tuples are linear you cannot enforce it like
[indiscernible] locality basically, right? So in all experiments we tried different layouts for
relational system.
>>: So you have a [inaudible]? What's the architecture of your system? So it looks like you
have a middle layer on top and underneath you have postscripts. And the queries may have
certain constraints in them but they also have just regular predicates that the [inaudible]
system can very well evaluate, right? So when you're given a query what did you do? Do you
send some part of the query down to the [inaudible] then do the rest of it on top or how do you
[inaudible]?
>> Alexander Kalinin: So in this system the query is handled. So it's not a SQL query, so
basically kind of a custom query where you basically specify I want these shape-based
constraints for the shape of the region and I want these quantum-based constraints like
average brightness, for instance. So it's no SQL-based. So some predicates, we don't push the
query in the database. We do some basically SQL. So what happens here is that this custom
thing, custom solver sits on top of a database system but not inside, just on top. So of course to
compute something, for example to verify the window, we need basically to compute the
average brightness. So to verify constraints we need access to data and we do this access via
SQL queries. That's how it is done. So if you specify some additional constraints basically some
of them are used like average brightness, some of them are used for steering the search. If you
specify some other constraints they basically just are pushed without consideration to the
corresponding SQL queries.
>>: [inaudible] writing the database query processing functionality why use database? Couldn’t
you have used a file and just used [inaudible] if all you're doing is pulling the data out and doing
your processing [inaudible]?
>> Alexander Kalinin: Well, we believe pulling the data out is actually very expensive because
we envision that the data is still stored in the DBMS system, right?
>>: [inaudible]? Are you pulling different and you're coming up with the priority order of
pullings through the windows. Given a window you're pulling the data out of the database and
processing [inaudible]?
>> Alexander Kalinin: No. So basically we generate a corresponding query. So, for instance, to
identify the constraints we basically generate the SQL query and push it into the database and
we just get the result. So we try not to move the data around.
>>: Some of these data sets are mostly [inaudible]. So currently the architecture seems to
suggest you're [inaudible] and query processing together in online fashion. But if you think of
architecture where you build something like views and these views could be way more
complicated because they don't need to update. So if you have aggregate constraints, if you
think some part of the data is interesting, put aggregate constraints on that part of the data
exhaustively and run queries using those rules if you constrain that the other data [inaudible]?
>> Alexander Kalinin: Right. So the problem we actually saw is that basically is you would have
to enumerate everything. So if you create such an index, like let's say we assume that the users
might ask different functions, like here I'm asking for average brightness, but think about this
SDSS actually. It has 500 I think different attributes for its objects so the users might actually
ask queries based on any of these attributes. So we assume that we want to be general. If I
know that I'm going to ask a lot of queries based on the brightness, for instance, even in our
case brightness is actually a function of multiple attributes so becomes a little more tricky than
that. But let's say we just touch one attribute; we might basically create an inverted index by
but this would require us, since you saw that windows can actually overlap and so on and they
can be [indiscernible] arbitrarily because the user can change the shape of the region based on
the shape-based constraints so it would need to enumerate all possible regions. And it's not
only about bright region. We cannot just go to bright part of the data and enumerate
everything there because some other user might actually need less bright values or maybe it's
all about a range of spectral characteristics for these regions. So we also played the velocity.
Maybe you want to go and see a region with high velocity objects, for instance, so in this case
one user is interested in high velocity regions and the other might be interested with more slow
I guess regions.
So in some sense we assume that we cannot identify interesting, so if the user, this is kind of a
different problem. If we often find the results in a particular part of the data, particular part of
that storage space we might be able to do that. We wanted to be general in a way that the
user’s constraints, the results for user’s queries might touch actually the whole data set so
that's why we didn't do any indexing for this thing.
>>: And [inaudible] did you mention that you have to touch the entire data in order to ask
[inaudible] that you have [inaudible]?
>> Alexander Kalinin: Yes.
>>: And on the middle [inaudible] trying to decide [inaudible] for axis in the data. What do you
do if you tested a sequential scan of the data and [inaudible] predicates as part of doing the
sequential scan? Because anyway you have to do the scan all the data once it seems.
>> Alexander Kalinin: Right. So good question. So a sequential scan is actually pretty good. If
you have a small search space and a small number of possible windows you can actually
basically go enumerate everything which was basically the SQL query I showed at the beginning
[indiscernible]. The problem is such a query is that you have to blog basically. So you express
as a SQL query, you put it inside a database and then basically have to wait until the results pop
up. The sequential scan it goes in a specific way. If the results are at the end you have to wait
for an hour for the query to finish.
>>: But you are exposing some [inaudible] results to the user and the user can come and maybe
it is not interesting stuff the query [inaudible]?
>> Alexander Kalinin: She can. Yeah. So this was the idea, the interactive results, right? So
that's why we basically tried to steer the search in a required direction so we can start quickly
the output results to the user. And then the user might interrupt the query if she wants to. So
the idea was to try to do it as fast as possible.
>>: Do you still need to go complete scan before you can output it in one result, right?
>> Alexander Kalinin: No. We basically, so when I show the [indiscernible], we have this kind of
best-first search and when you take the currently best region we check it immediately and if it
satisfies the constraints we immediately output to the user.
>>: [inaudible] there could be certain constraints where you may need to look [inaudible]?
>> Alexander Kalinin: So if your constraint touches the whole data yeah, you're screwed
because you can't do anything about it. We assume that the constraints are kind of tailored to
the window so hopefully it won't touch a lot of data. If it touches a lot of data we kind of
cannot do much about it because if you have to compute it you have to compute it. So here
since the window is more or less small I guess compared to the whole data set, the whole sky
for instance, basically you just need to verify only that to read the data on the table to the
window itself and that's why we basically can output quickly.
>>: So it seems like using the shape-based conditions and then do some different term and then
you do the other conditions in your middle layer?
>> Alexander Kalinin: Right. So you can basically say that we use shape-based conditions to
kind of generate all the candidate windows in some sense. We basically generate them by
shape-based condition because the user tells us the window on the X axis the window must be
from 3 to 5 degrees in the right [indiscernible] for instance; we kind of know what we have to
generate in some sense. So basically we use shape-based conditions to generate candidate
windows but then when we check quantum-based constraints that's when we basically assess
them by using sampling and that's when we go to this to verify them.
This is the main idea here. Again, interactivity, so we want to identify an output result very
quickly. Possibly the user might actually want to stop the query at some point and not go for
the final result. But if she doesn't eventually we have to get everything. And as the problem
right, as I was saying we have not only small reads which are basically dispersed around the
data file, but we also have this [indiscernible] locality versus this locality problem because
windows are kind of semantic constructs, logical constructs and it doesn't mean that the
corresponding data pages are going to be close to each other in database files so we have
threshing problem. And we extensively measured this and started a lot of probes and
PostgreSQL and the six times where actually very bad.
So it what we decided to do a small trick, basically the traditional trick I guess using prefetching
instead of reading just a window enabled with a window. Again, we have to read everything so
it's fine if we have at least read something cache because it would result in better to
[indiscernible], so prefetching we can do something like that basically just reading by larger
portions.
The trick here though is that you don't know how much to prefetch because if we start
prefetching too much we start delaying on the line results because even if I have identified a
promising window, if in addition to reading the window data also start reading a lot of data
around it we basically delay online results. So we went with a progress-driven scheme where
basically we kind of see what's going on right now and while we're still finding new results we
basically prefetch a small amount which I think increases over time but still we kept a small
value so we don't have interactivity, but when we stop finding new results again, remember
that we have to read everything to confirm that. Basically we start increasing that
exponentially which basically in some sense can be as similar as a kind of performing the whole
this kind of remaining data to just confirm that we do have anything missing.
>>: So benefit-based priority ordering is that you computed up front or it’s incrementally
maintained as your making progress through y-axis?
>> Alexander Kalinin: Priority based even for windows. So windows are generated dynamically
basically.
>>: [inaudible]?
>> Alexander Kalinin: So basically when we go further and further [indiscernible], for instance,
and [indiscernible] some data actually improve the quality of our estimation since windows can
overlap and we generated more and more and more windows so we have a current queue of
promising candidates. Since we bring some data we also actually improve the quality of our
estimations to try to actually>>: Is that the dynamic nature of that the queue that makes your prefetching problem hard?
>> Alexander Kalinin: Yes because we don't want to enumerate everything like the beginning
basically. Plus we don't know the order in which we are going to read our windows because the
in some sense the initial ordering depends on the sampling and then we started changing our
estimations and changing our utilities. But that's a very good point. We didn't do any
experiments with that but we could basically try to create the whole queue beforehand and try
maybe to come up with an optimal prefetching core, but that's a good point. We didn't think
about that at the time.
And this is a small basically graph which actually shows why we don't want to use something
like PostgreSQL for sequential scan because again the PostgreSQL here is the green line right
here and you can see that it's actually in some sense it's fast. Sequential scans you go through
everything but you output everything in one hour, for instance, which is not a good way to go.
So the static approach is basically we do some prefetching but a really small amount. And you
can see that while we can identify the results quickly so here it's actually not that quickly
because of the prefetching actually. And we start basically going down because again, we
perform too many reads, too many candidates perform too many reads and the overhead starts
piling up. At the same time when we keep kind of adaptive prefetching is actually the first
result so this is actually, I think in most cases the first result open up in 10 or 20 seconds despite
the total time being one-hour, for instance. But here you can see that basically we kind of
eventually don't lose to PostgreSQL that much because, again, we kind of tailor prefetching to
the current situation. Yeah?
>>: So one thing about the results that probably because sequential scans, as you might know,
they have a lot of optimizations that can be done with sequential scans. [inaudible] PostgreSQL
doesn’t have a lot of those optimizations from that. For example, you could build [inaudible]
representation and scan all complex data and assuming a standard [inaudible] or a few harddrives spin those [inaudible] gigabytes of data in one scan you cannot scan into it again and
again and again. Just one sequential scan can be done in a matter of less than a minute. I think
we did the math of 150 megabytes per second of this bandwidth [inaudible] and[inaudible]. So
coming back to the question is I understand your [inaudible] to give interactive results
[inaudible]. If a lot of the queries are being terminated. [inaudible] after seeing some odd
[inaudible] results then you gain a lot by giving the interactive results. But in the quest for
getting the interactive results a question to ask is how much are you leaving on the table? You
could have done a very efficient fast sequential scan of the data and the entire result that may
not [inaudible] boundary in a matter of minutes probably.
>> Alexander Kalinin: Well, that’s true, but in some sense we also depend on database
optimizations, right? We perform the same queries as well. For example, PostgreSQL doesn't
perform complex optimizations we also adhere by that. So it would be interesting to repeat
this experiment because this experiment is kind of an old one. We didn't have access to SDSS
or even in the VM simulator or something like that, right? It would be really interesting. But
again, if we moved to a more complex query optimizer or if we move towards more efficient
disks I guess, more efficient storage, we also benefit from this because since we also perform a
lot of SQL queries and especially for prefetching we also retrieve a lot of data, we also benefit
from that tremendously. So maybe the gap will be smaller between kind of doing the
sequential scan and actually kind of outputting the results really quickly the gap might be
smaller, but I think the gap will be super [indiscernible].
And this is actually the question was before about our architecture so was actually distributed
solver which basically was up on top of the DBMS and it performed some search distribution
across workers. Not much to talk about. It was basically just something to want to point out.
This was a very custom thing, very custom solver, very custom solution.
We kind of learned some lessons from semantic windows is that first of all semantic windows
supports window queries which basically kind of imply simple constraints. So it's very hard to
extend, if you want to add more types of constraints is kind of hard to do because it's hard to
extend this kind of solver. This is very specific. We have to read everything because if
something doesn't allow, as I said before, something doesn't allow provable pruning based on
content so in some sense the completion time, if you want to really find everything, the
completion time has a lower bound which is the sequential scan. And we didn't perform any
search balancing here. So we distributed the search space around but at some point what
might happen is that some nodes, due to the data skew for instance, some nodes actually finish
their part of the search space really quickly and just sit idle during nothing.
And I believe another thing is that we kind of tied search and data partitions together because
we just distributed the data and each worker looked only have the corresponding data partition
which was not the way to go. Basically we want to kind of distribute search and data on
different levels which would allow basically I think much more interesting balancing.
So I want to move to our actually I guess main system, Searchlight, which is actually kind of this
implementation of this CP plus DBMS idea. And we moved to array databases and mainly SCiDB
because mainly scientific data sets can be stored there efficiently. So we, for example, store
SDSS there as a two-dimensional array in this [indiscernible] declination. If you think about
something like time series data again, this is also a bunch of the error rates and so on. So we
tend to take basically a traditional CP solver and we took the one from Google’s Or-Tools which
is Operation Research Tools from Google which is open source, if you’re available. So that
makes basically our queries CP-based so we don't use SQL, raQL [phonetic] or whatever the
database gives us so we basically formulate our data exploration queries in form of constraint
programs.
So constraints still primarily dealt with aggregates because again, since we deal with these kinds
of regions we kind of user-specify aggregates. In this case, the principal subsequence matching
can also be seen as an aggregate because it deals with the distance between kinds of two
sequences. But we kind of again actually extend this so this average function, like average
brightness, that kind of building blocks. So in this case the system is extensible since it allows
us to define new types of constraints based on these building blocks. For example, I might
easily express the difference between the window itself and its neighborhood and compare it
somehow or maybe I can easily define the difference of aggregates between two different
windows in my search space. So constraints become more elaborate, more interesting.
And I want to start from kind of the result of exploring alternatives which is basically, so what
we have we have kind of two alternatives here, right? Again, as I said at the beginning, we see
that exploration is kind of consisting of two phases in some sense having two sides to it. So
basically we have two alternatives if we want to perform such exploration queries. And the first
alternative is basically we just call CP here which is basically just taking constraint programming
solver and making it work with out of core data which is a bad way to go because constraint
programming solvers actually don't work with out of core data at all. They assume data feeds
into memory and constraints are actually quite inexpensive to check. They don't deal with
hardness of computation. They deal with large search spaces not large data sets.
But we did this section. In this case CP is actually put really close inside the SciDB engine. We
don't perform any specific special optimizations but it sits close to the data so we don't perform
any unusual serialization, the serialization here for the CP approach. And SciDB is basically is
another alternative where you can try to express these kind of window queries by using SciDB
language; we just call it AQL, Array Query Language. It's not possible to do for all the queries
we support but from for some queries it's possible. And when we start data with large search
space, and if I remember correctly large search space was about 10 to the power of 8 or 9
windows basically, maybe more than that, so basically for some data, for some search space it's
like really hard to finish in a reasonable amount of time because due to the result in a very
expensive computation.
So in this experiment we decided to limit the execution to one hour and basically this is a use
case like that. So the user basically gives this query and says find me something in one hour, at
least something. And you can say that with Searchlight, with using this combination of
technologies, we can actually identify first result in about five seconds and by subsequent
delays, delays between subsequent results because we find one result we need to kind of find
another and we have some delays between the result. So basically what I want to show here is
that Searchlight, the main goal of our system to basically to output results quickly and basically
keep outputting the results as interactively as possible.
At the same time CP and SciDB in one hour didn't produce any results and the reason for that is
that SciDB basically again it employs the same traditional databased approach. So here, for
example, you basically can define your query as a kind of moving window operator. You just
move your window through your data but it does it in a very direct way from the left of left
corner and so on through your data. So if your data is not in the top left corner, for instance,
you might not actually be able to get to the end of the data set in one hour. At the same time
you might not be able to find anything. At the same time when you look at CP, CP is smart
about building the search space but it performs a lot of these kind of constraints and variations,
not only the leaves of the search tree but also it has to make pruning decisions so it makes a lot
of constraint computations at the internal nodes of the search tree and then basically it creates
a lot of overlapping computations, a lot of overlapping data access to the database which
cannot be optimized well because CP solvers are not aware of such optimizations.
For the small search space we can actually kind of finish the query much faster. Again, in at
least CP and SciDB approaches I'm able to do that. And again, they have the same problems.
Even if you're able to finish the query the query it is still possible, by using pruning it's possible
to see in the search required [inaudible] it is possible to find the results quickly; but also we can
prune a lot of unnecessary data by using constraint programming solver. We can actually finish
the query in five seconds while if we continue with these approaches, with the alternative
approaches especially stuck for quite a while not only for finding the first results but also for
performing in query completion times.
So the main idea behind Searchlight was solve-validate approach. So the idea here is that since
CP solvers are very bad at accessing out of core data so we don't want them to work with out of
core data. So instead of going for the data immediately to basically assess all these constraints
we actually precompute a synopsis array, as we call it, which basically contains some
information allowing us to kind of answer these aggregate functions with boundaries. So, for
example, if I'm looking at an average brightness of a region by using the data I might say the
average brightness of the region is 10, for instance, right? But with a synopsis we might say
that this is from 5 to 15, for instance, or something like that. So synopsis allows us to basically
answer aggregates with these intervals but this interval is kind of one hundred percent correct
in a way that the average is not going to be less than five or greater than 15 so it's definitely
going to be from 5 to 15, for instance which allows us basically to make some quick estimations
in the internal nodes of the search tree and prune unnecessary parts of the search tree and the
data with it.
And we assume that the synopsis thing actually feeds into main memory. It doesn’t have to.
Actually the system will work without that but it will benefit really if synopsis feeds
[indiscernible but synopsis is considered quite small. So the caveat of that since we produce
these kinds of boundary estimations instead of real answers we have might have false positives.
So in this case CP solver is not producing solutions like in traditional case where it produces
candidate solutions. Again, we guarantee that we don't have false negatives but we might have
false positives so that's where the second component [indiscernible] where the data comes in
which basically validates these candidate solutions based on the real data. So it won't go to the
synopsis, although it might as an optimization, but in general it will go to the regional data array
basically to verify the constraints.
And this is actually a very general thing to do because basically data is not something like
custom written for every constraint. We don't say [indiscernible] constraint type, this
constraint type, this constraint type, constraint type this. It actually employs also CP solver to
validate the constraints because [indiscernible] is also a CP problem. You have decision
variables, you have constraints, and the solution is just assignment of decision variables. So
basically you can take a solver, assign the decision variables and the solver will automatically
check the constraints. So what it means for us is basically we just can clone the model, by the
model I mean the decision variables and the constraints from the main CP solver and basically
just make the validation based again on the solver itself.
>>: So does it mean if you're constraints involve some [inaudible] function like a brightness you
need to know that up front? Because it looks like the synopsis array [inaudible] compute those
aggregates base on this [inaudible]?
>> Alexander Kalinin: So what happens here is that for traditional, so basically we assume, I will
talk about synopsis later, but we assume that for every kind of type of function we have a
different type of synopsis. For traditional aggregates, yes, we need to compute some aggregate
information like min, max and count for parts of the search space.
>>: So you cannot [inaudible]?
>> Alexander Kalinin: Yes and no, in some sense. So no if we want to add another function like,
for example, let’s I want to perform subsequence matching and I want to add a function like a
distance between the two sequences, right? In this case we have to pre-compute something to
be able to assess these distances. So in this case we would create some traditional index like in
subsequence matching. And yes in some sense that these are just building blocks. So I can take
these aggregates and combine them in different constraints. For example, if I want to perform
some kind of anomaly search I'm looking at a window and I'm looking at the neighborhood of
this window, I can measure the value for the window and the value for the neighborhood using
aggregate functions and I can combine these aggregate functions in a different way. But yes, so
we have to know what kind of functions you are going to use because synopsis is based on
these kinds of functions.
>>: Question. So you support ad hoc arrays and the more [inaudible] use [inaudible] any type
of array they want [inaudible]? So I was wondering like [inaudible] uses just one extreme of
value after [inaudible] for specific kind of queries whereas here what are the assumptions that
you are making about the workload that [inaudible]? Is it just the functions or is it you want to
analyze those arrays you have some idea how you want to optimize [inaudible] because there
are a few things that you mentioned. One was that for performance you want [inaudible]
synopsis to be small and fit in memory. For a given type of function you want to create a
different type of synopsis array. So there probably is something that's missing which is in this
space of the ability to support ad hoc queries versus making assumptions about the workload
and preparing [inaudible]. Can [inaudible]?
>> Alexander Kalinin: Right. So for the previous question I was against indexing I guess because
even if you take an average like average brightness there are different ranges of values that you
can explore so one user is interested in one branch of values and another in another and so on.
So eventually we would have to index every possible window because we need some kind of
inverted index from brightness to candidate windows. Here we make a kind of weak
assumption that we know only the functions. So I know we are going to use just average
brightness in our query and this is it. We don't need to know anything else. So whatever the
range the user throws at us or whatever the combination of the functions is going to be we are
going to handle this. So we don't tell you a lot of synopsis. Synopsis type is tailored to the type
of the functions we are going to use but it’s not tailored to the specific ranges of the values we
are looking for.
>>: And even in one of the slides you had [inaudible] DB data, [inaudible], I think that was the
previous. So [inaudible] can you give an idea of how big a synopsis you're creating and
[inaudible]?
>> Alexander Kalinin: Yeah. So I will talk about synopsis a little bit later. But for the size I don’t
quite remember. So we actually stored different synopsis and different resolution synopsis but
I think we can get away with pretty good times by using like 100 megabytes per 120 gigabytes
per data. It doesn't have to be much. The more you store of course the better estimation you
get but I'll talk about this actually. It’s very interesting. So this is basically how we do it. And
then the data actually checks the candidate solutions and basically just can filter out the false
positives.
So, as I said, synopsis is a kind of general concept here. So it allows, I don't like approximate
answers actually here, it basically just creates bounds for API cores but this kind of lower and
upper bounds they have 100 percent confidence so we don't have false negatives for sure. And
we assume that again, as I said, they differ for different types of functions we use. So we use
aggregate weight for arrays. If we use the relational data, we didn't explore this, but we can do
something like [indiscernible] resolution aggregate trees where we basically [indiscernible]
trees would be trees with additional information or we can use something like DFTP indexing
for data sequences and so on so we don't [indiscernible] here.
And for aggregate synopsis, this is an example of aggregate synopsis basically, so what we do
let's say this is a regional array, [indiscernible], array we might have missing values; and what
we do basically we divide it into a grid and this kind of the size of this [indiscernible] parameter.
Here it’s basically we divide in cells of size of two by two and basically for each cell we store
some min, max and sum count information. So this actually based on the [indiscernible]
resolution aggregate tree work but in a regular base we don't need this kind of hierarchy. We
just can get away with flat aggregation in some sense. But this is going to be the synopsis. The
resolution of this synopsis is said to be like two by two because the cells have size of two by
two.
Then we basically what happens is that the solver often comes for estimations, basically for
here for the average function. The problem here is that you can see if you have such a wide
region to estimate the problem here is that for this cell we know the exact sum and count so
the average is kind of easy to estimate. But when we're starting intersect in these kind of other
synopsis cells we don't know the underlying distribution of the cell. We just know the min, max
sum and count. So basically what we do we come up with an upper and lower bound
estimation. That's why we create [indiscernible] real answers we get the boundaries. You can
see that basically if we have higher and higher resolution synopsis we should basically get
better and better estimations because we will cover these parts of the region more ideally. But
that's the idea, again, creating this kind of upper bound for each such partial we will come up
with upper and lower bound distributions, candidate distributions, if you will, which might not
necessarily basically be the real distributions but they actually are upper and lower bounds. So
we guarantee that if we say from 5 to 15 we guarantee that it’s not going to go beyond that
interval.
We moved beyond that because again, you can see what happens here is that this [inaudible]
poor coverage and we just basically cover only a small portion of the synopsis cell and so
basically it can go and use the traditional pyramid-based approach, right? We basically we have
a pyramid of different resolution synopsis and the trade-off here is that when we move towards
really low resolution synopsis like this one we can actually estimate everything from the guy
here but the estimation are going to be really, really loose which will basically are not allow us
any pruning. Although, if we move kind of to a more high resolution synopsis we incur a lot of
computation overhead because estimations are not cheap and we perform a lot of them.
Remember, like CP solver it doesn't only perform such estimations for candidate solutions but
also performs them in internal nodes of the search tree frequently because it has to prove that
it can prune actually the corresponding part of the search tree.
So we actually ended up using a simple heuristic where basically we just measured the kind of,
we actually do this kind of dynamic approach. So in Searchlight we can do everything
dynamically I guess. We try to look at the current state of what is happening during the query
relations. For example, if the solver asks us to basically compute, to estimate this region we
basically we look at the coverage and we can say this cell is covered fully, the region covers this
cell fully so we can just take some count from there and this cell is covered fine but this is
covered really poorly because we only touch the really small fraction of the synopsis cell and for
synopsis cell we know we need min, max, sum, count so the estimation is going to be really
poor. So basically just for such but for coverages we will basically go and only for this small
portion, for this red cell we will go to a high resolution synopsis and we'll improve our
estimations until we have good coverage of all of synopsis cells.
We perform kind of a lot of experiments for such smaller features. Again, it's actually worth it,
right? Because you can look at different resolutions. Here is I think it’s 100,000 by 100,000
array; I think it's synthetic data; and you can come up with a different resolutions here and the
right column is actually combined in the three in this dynamic approach. You can see
sometimes actually if you use really poor estimations you might not be able to finish the query
in a reasonable amount of time. But at the same time you can see that sweet spot here is
actually 100 by 100 synopsis because it generates a lot of candidate solutions but at least it isn't
stuck in expensive computation. But again, if we actually combine basically the approaches we
are able to actually do much better than that. Here it's not much better, but still we are able to
cut one minute, at least one minute from the best synopsis here. And what is actually good
about this the users don't have to pick the synopsis. Here we will have to pick the correct
synopsis. Here we basically just make the system decide and makes a very good choice here.
At the same time, for a small search space, we have another sweet spot here because here 10
by 10 synopsis performed better because again search space is small so number of candidates is
not that large. So 10 by 10 synopsis could get away with kind of performing less computations
so it was the best choice here. But again, if we gave this choice to the users, the users would
have to decide we don't want that. So again, you can see the dynamic approach kind of
basically adapts itself to the situation so I think it’s really neat.
So Searchlights is a distributed system basically. And as I said before, we kind of wanted to
basically distinguish between two layers. The first layer is a level of search where we basically
have a search space which is handled by a bunch of CP solvers and they work independently.
So we basically just divide the data, the search space between different solvers and let them
just play the disjoint parts of this search space, but I'll talk about balancing on the next slide.
On another level we partition our data for validation. So I have a bunch of validators going
corresponding to different data partitions. And they work completely independently of each
other actually. So basically my CP solver produces candidates, they send these candidates to
the appropriate validators so we can basically put them, if you have a cluster you can put them
on different machines, and do whatever you wish. We can have a different number of them
depending on what kind of resource you have and so on. This allows us actually to partition the
store search space much more freely because, for example, we can make balancing decisions
on the CP solver level without considering the data.
So the challenges though here are basically partition to search space and we basically need to
determine how to partition data. And the next question was actually very interesting when we
started doing this is that where should we send candidate solutions? Because basically when
this solver produced a candidate solution we don't know what day it is going to be needed for
its validation. We don't look inside constraints; we don't look inside functions. We don't know
what data we need to touch because CP solver touches only the synopsis. That's fine. But we
don't know what kind of data these guys are going to touch. In particular, we are interested in
what kind of data pages or inside DB what data chunks we will need, but these data chunks
actually determine to which validator we want to send this candidate because we want to send
this query kind of closer to the data so we want to send it to the validator that has most of the
data needed for the validation. So basically this is kind of the field I guess challenge or how to
do that.
So let me start from search partitioning where basically we perform static search partitioning in
the initial level which is kind of basically a very simple approach. So what we try to do since we
basically deal with continuous queries here we basically tried to cover hotspots with multiple
solvers so that solvers won’t sit idle. Again, the parameter here is the size of the slice. It's
really simple. So this is the initial phase. Then basically what might happen though is you can
see that here a hotspot is covered only by two solvers, three, four might sit idle for instance at
some point, or maybe we might get the slice size wrong and so on.
>>: [inaudible]? So how do you know which set of attributes is [inaudible] hotspots?
>> Alexander Kalinin: So [indiscernible] the attributes I guess. The hotspot here is basically a
part of this search space which has a lot of candidate solutions which is promising. We don't
know yet so this is why it’s basically just a heuristic. We try to kind of guess that they are going
to be counted as hotspots and we that's way we do this kind of [inaudible] distribution to try to
cover the continuous hotspots of multiple solvers. We don't know where the hotspot is; we
may actually get it completely wrong. That's way we have basically this second stage. Basically
this is a kind of embarrassingly parallel solution in some sense. You just create a lot of little
slices and just re-distribute it across solvers.
>>: Is it just [inaudible]? Would we need replication for correctness at some point?
>> Alexander Kalinin: [inaudible]?
>>: Sometimes your synopsis may [inaudible]. You can get [inaudible].
>> Alexander Kalinin: Right. So since the synopsis is small we can actually replicate it across all
the machines participating in the search since 100 megabytes is easily replicated. So what
happens here in the implementation is basically just transparently pulled. So we don't replicate
it beforehand but basically when the solver accesses, the synopsis array is also handled by
SciDB and basically when the solver just needs some data it will pull it from another machine.
Eventually what it means for us, eventually the synopsis is going to be replicated on all the
solver machines so there might be some data moving but it's [inaudible] going to be really
small.
So you might guess, of course, this [indiscernible] distribution wrongly. So what we have
basically is dynamic search balancing which is kind of a fallback solution which tries to remedy
the situation, and if we have [inaudible] solver basically just finished part of the search space
what we can do is we can basically make ourselves available to the busy solver and just move
its subtree to another solver. And moving subtrees is actually very inexpensive because solvers
know the model, the model comes from the query itself, so it’s basically just take this variable,
take its domain value, so an interval of its domain values and just kind of study it as your own
search tree. And then basically this left solver it actually backtracks from here and continues
with another part of the search space and this solver studies the search space and so on. So it's
a very dynamic thing and it depends on what solver are struggling right now and whatnot.
If you are familiar with constraint programming can see this is very similar to [indiscernible] in
constraint programming solver so they have been doing similar stuff. And it's a very dynamic
solution because basically we might have different helpers so we distribute. So if we have a
huge, if you have a lot of solvers and you have a large search tree still in explore it's still going to
be distributed dynamically across many solvers. And again, the balancing issue is going to be
distributed continuously until we basically, we’re trying to keep all the solvers busy during the
computation.
So a small example of what's going on basically for static and dynamic partition inside the
partition basically depends on the number of slices, the number of slices in static partition. You
can see what might happen here with the slides. We have eight solvers but four solvers
basically finish immediately here. So you cannot see this really small bar here but basically
some solvers finish really fast and sit idle, and at the same time if you get a number of slices
correctly you might get better balancing, but again, you have to give the size of the slice. If you
create too many slices it’s going to bring some overhead with it and it still might not work for all
the queries. And with limit partitioning we’re kind of, so this is dynamic partitioning.
[inaudible] it's not ideal because we still have granularity of this internal node and it depends
on where we cache this node, but again, we kind of try to keep all the solvers busy which I think
we do here.
So another thing is, as I said, data partitioning basically. So we don't perform anything really
special here. Basically we divide data partition in static way between validators. We don't
perform any data [inaudible]. Basically if candidate needs some data for the validation we just
fetch transparently as a kind of database. So we do perform some data transfer, which I
probably won't have time to talk about, but basically if we really need to perform some actually
data redistribution between validators. So we tried not to do that because basically we are
trying to bring queries to the data in the sense of bring candidate solutions to the data to the
appropriate validators, but if, for example, we might have a situation where one validator
actually gets all the candidates because it's really promising. It's actually the part of the data
that contains all the results. In each case where the letters 2 and 3 might sit idle in which case
we will perform some re-distribution and in each case these data partitions might not be as
kind of even as shown because we might move some candidates from validator one, to
validator two from [indiscernible] and redistribute some data from partition one to partition
two. So we perform that as well but we try to avoid it. So I just want to point out that data
partition is also dynamic but only in kind of critical situation because for search balancing it
doesn't require a lot. It does require moving data so it's really cheap, but for data partitioning it
might not be that cheap.
And the third problem was kind of again, determines where to send candidates. And again, as I
was mentioning at the beginning, we kind of tried to be really general. We don't want to look
at the functions; we don't want to look at constraints and kind of try to parse out what kind of
chunks we are going to read. So basically, as I said at the beginning, the validator basically, the
beginning of Searchlight portion of the talk, the validator actually also runs the CP solver. So
basically it validates candidates by using CP solver. So instead of validating candidates and
going to the real data, when we produce a candidate solution we basically perform a simulation
of the validation. So instead of going to the real data we basically just give kind of trivial
boundaries for the function without going to the data at all. It’s very cheap’ it’s just CP-based.
But in this case we can kind of log all the data; it could request that validation would make for
this particular candidate solution. In each case now, after we basically know what kind of
candidate solution, in this case for the candidate solution we know what kind of data access are
going to be required for the validation. In this case basically we move the candidate solution
over to the validator that contains most of the data. So basically we kind of try to avoid moving
data around.
So we perform a bunch of other optimizations which I probably don't have time to talk about.
We use synopsis for validations actually. So you can understand in some cases if our query
[indiscernible] is completely lined through the synopsis we don't have partial intersections with
synopsis cells we can basically use synopsis to validate aggregate functions which is a very
cheap way to do this. Basically, to avoid threshing we try to batch close candidates together so
that in some sense we have this complete division between solver and validator layers
candidate solutions can come from kind of different solvers in arbitrary ordering, right? So we
try to avoid basically performing a lot of reads from different parts of data file so we try to
combine candidates close to each other so that they would read basically approximately the
same data from the same locality.
So we perform solver validator balancing, we just basically just kind of redistribution of CP
resources between them because again, we try to be dynamic here as well because it really
depends on what's going on. At the beginning we don't have any candidates to check so we
don't want to direct resources to the validators. So in this case we basically direct all CP
resources to the solvers. At some point though we might have the case where we have a lot of
candidate solutions in which case we might kind of pause the search and divert more CP
resources towards the validators basically by starting more threads for instance, that’s what we
do. And we try to utilize idle time as much as possible. For instance, when we redistribute the
search space on the solver level there might be some idle times on the solvers so even in these
idle times we try to kind of perform more validations and so on. And again, when we have
validated all the candidates or the number of candidate solutions is really small we actually we
divert more resources to the search. And, as I said, we perform some candidate relocation
which is basically the distribution of the candidate solutions between different validators. But
again, we try to avoid these because this might result in large data movement.
So this is actually a result for real data set which is basically [indiscernible] SDSS. It's 80
gigabytes because it's kind of the real size. Again, it has a lot of attributes and SciDB kind of
support what you can kind of vertical decomposition. We access only database we need. So
here we actually access all the five spectral attributes. That's why it's kind of 80 gigabytes of
real data. You can see that basically we try to vary different parameters for the selectivity of
the query, basically region size, magnitude. Magnitude is basically this kind of spectral
characteristics of the regions, and we can see this as a good starting point because we actually
define the results really fast and the total completion time is also acceptable. Compared to
SciDB, for instance, SciDB actually cannot perform such queries at all. Either they are not
expressible at all in AQL in a regular language. You can express them by using complex scripts
like in [indiscernible] so you can ask several queries and try to combine the results by yourself
which is really prohibitive. So I think we run some of the queries for many hours and still
haven't got any results.
>>: Is this [inaudible] benchmark? [inaudible] queries that you made up?
>> Alexander Kalinin: These are the queries, our queries. So we're basically looking for regions
satisfying some ranges or special values. This is a good starting point for interactive
exploration. So this is kind of all of for the main work I guess. So we have some ongoing work
going on and I think if I have time I'll go into directions right now basically. The direction of
proving the generality of Searchlight which is basically again exploring new data sets and
constraints so, for instance, we wanted to show that, as I said at the beginning, that basically
something like subsequence matching is also a similar type of problem and we can actually
solve all these problems using the single system without using all these pulling solutions spread
around. So this is one direction which is basically kind of completed. It was actually really easy
to incorporate subsequence matching inside this framework and it works really well but results
are still pending. So [indiscernible] mentioned really expressible as a constraint program to the
system. Our system can handle such queries also really well. Of course it relies on the existing
research about indexing of time sequences about basically computing all these DFTP and all
these representations.
The second thing is query relaxation which is basically kind of the case where users do not quite
know what they're looking for or they get constraints wrongly. So I just said this from
[indiscernible] data set this is contains time series data for different ICU patients and the
queries we are looking for here are subsequence matching. So they can go beyond
subsequence matching. So subsequence matching is only one type of constraint. We can also
add other types of constraints with the same query not only you can look for, for instance, for a
sequence or a subsequence for a time series similar to the specified but you can also in the
same query specify additional constraints like the neighborhood must be of the particular, has
particular properties and so on.
Basically we kind of introduce new functions there are like distance-based functions and we
introduce synopsis as basically using the existing work on DFTP indexing and so on. And again
for query relaxation, which is ongoing work, we use query relaxation here in a traditional way.
What if the user doesn't get enough results or doesn't get the result right? So what we can do
about this? The problem is a little bit different from relational systems I believe because in
relational systems you can often estimate the cardinality of your result or you can at least use
index and try to understand how you want to change your constraints. Here it’s a little bit hard
to do because again indexing is hard to do and also since we have these kinds of general
constraints it's hard to estimate the cardinality of your final result. And the main idea here is
what we actually do. And like I recently implemented this, but again no results yet, but
basically we dynamically see what happens. If the solver fails at some point during the search it
fails for a reason. We know why it failed, which constraints failed and how they are failing. So
we basically use this information to remedy the constraints, the original constraints, replay the
fails and see what's going on.
Again the idea here is about kind of interactivity. So we want to give users results fast. So if a
user asks for average less than 10 we might be able to basically output 15, 12, 13. If you have
less than 10 results we'll put it eventually, but at least at the beginning if it's hard to find we will
output results that are close that satisfy the constraints but not quite. So this is not an
approximate answer in the sense that we get approximate results but we might at least start
from not exact results but close results. And quality will improve later. So it's similar I guess to
online aggregation when you think about it. And this is it I guess. That's all I have for today.
>>: So how do you compare your work with [inaudible] that has happened to combine
[inaudible] databases or like pushing [inaudible]? So do you have any thoughts about that?
>> Alexander Kalinin: I mean our queries kind of a little bit different. It’s the same idea. We
basically you have some>>: But you’re combining the constraint solver with the database because constraints always
cannot handle a lot data and the database cannot handle constraints, right?
>> Alexander Kalinin: Correct.
>>: [inaudible] plus Hadoop or DBMS those also try to address a similar space because
[inaudible] everything are very rich [inaudible] operators where they're [inaudible] data. So
have you looked in to that line of work to see [inaudible] to this approach or have you
considered pushing your constraints on [inaudible] database as well?
>> Alexander Kalinin: So in this framework we didn't look closely to these types of integration
like our integration yet. But I want to point out that I don't think actually we can express such
queries again with those approaches so I guess we go in a similar direction of combining these
things. And actually, about putting the solver there, the solver in Searchlight actually works
inside the database because SciDB basically allows users to define user-defined operators which
are basically part of the query plan. So we don't perform any specific optimization, but still it
sits inside. So we don't serialize data, we don't move data around so it sits really close to the
data. It's actually in some sense lightweight because it uses the SciDB infrastructure to basically
do all the other search stuff so data management like networking and so on. So actually we
also move in the same direction basically getting close to the database and try to combine
them.
And one of the ongoing work I guess which I probably won't be able to finish, was basically just
to think about how we can actually create multiple search operators inside the query plan and
basically if you have a search query we just create a single search operator and basically maybe
we can actually kind of divide our search query in multiple operators that work actually
together or maybe we can actually push some of the constraints inside the database to create
more elaborate query processing trees but we didn't kind of move to that yet.
>>: What about the lenience of the queries? What types of queries does your system.
[inaudible] can handle? As far as I understood we are kind of looking for some area of objects
which [inaudible] the [inaudible] function. You say that this is so lenient or we cannot guess,
for instance, the brightness [inaudible] or we can add a pair of ranges of periods of with some
[inaudible]function?
>> Alexander Kalinin: You can definitely find the star, for instance, because you basically
described it as it is couple of decision variables. I want to point this out, for point queries like
finding the bright star this is kind of possible to index because it is it can index every possible
object in the database or it can create this kind of index basically because the number of
objects in the database is smaller than the number of all possible regions for these objects. But
our system can handle that. It’s just that there are better solutions that might be available for
such point queries. Again, the star has coordinates so you can describe these via decision
variables and then you have basically have a constraint for brightness, a constraint for its
attributes so it can easily be handled.
As for the kind of the pairs of regions you can do this. It might be not as efficient as we would
like it to be but you can. Basically if you want to find, for example, two regions that are similar
to each other the queries I talked about like, for example, the difference between let's say you
want to find two regions A and B and the difference between the brightness is a small. So you
can actually described it as a constraint program because you describe one region as a bunch of
decision variables, region B is another bunch of decision variables, and it's easy to describe the
constraint. It's average of A minus average B is less than something so you can easily kind of
combine.
So to answer your first question the limitation we don't have my limitations, like we are less
efficient for such types of queries I guess where you have multiple objects flying around, but
the only limitations that we need to know what kind of functions you're going to use. Like if
you want to use average brightness, for instance, we can combine them in pretty elaborate
ways. So if you want to find two regions with some ranges of values it can definitely be done. If
it can be expressed as a bunch of decision variables, constraints, and these kinds of average
functions or other types of functions actually, we can handle that.
>>: And how does your system scale? Have you thought about that? As I said understood, the
synopsis should be more than [inaudible]. And how about the size of your database? Should it
be part of every node? Should be stored [inaudible] locally or are [inaudible]?
>> Alexander Kalinin: So for the synopsis it works best if you basically have it in memory. The
idea behind synopsis is it is much smaller than the data so we will still benefit. If it's memory
that's really cool because we will get most benefit. It doesn't have to, but in this case you might
have some basically moving of data around because we need some parts of the synopsis at one
point and at another point we might need some other part of the synopsis so as usual in
databases, right? But we assume for now if it's memory that's cool; they’re basically replicated.
As for the regional data size it doesn't matter because again, when we validate candidates
again, it's all about validating candidates, right? I see you're surprised, I guess. In some sense
what I mean is that if you have a lot of data, so we assume that we basically can throw a bunch
of nodes at it, you can distribute it highly, it grows different nodes inside SciDB cluster and the
basically this is it. So basically you have [indiscernible] of candidate solutions which you need
to validate and just stream them across, across, across all these different data partitions. So
the problem is no different from the problem of evaluating traditional aggregate queries across,
across, across, across large data in some sense. So basically we don't do anything special about
it. If the system can handle aggregate queries well efficiently across large data we definitely
benefit from it.
Download