>> Kaushik Chakrabarti: All right. Let's get started... pleasure to introduce today's speaker, Alexander Kalinin. Alexander is...

>> Kaushik Chakrabarti: All right. Let's get started here. Good morning everyone. It is my pleasure to introduce today's speaker, Alexander Kalinin. Alexander is a PhD student at the Computer Science Department in Brown University; and his thesis work he's focusing on developing data exploration techniques for large scientific data sets, specifically he had developed a couple of data exploration frameworks called Searchlight, another called Semantic Windows. I’m sure he is going to talk more about those today. So without any further delay I’ll hand it to Alexander so he can start his talk. >> Alexander Kalinin: Right. So I'll be talking about integrated search and exploration over large multidimensional data which is as Kaushik said my basically main dissertation work. And this is joint work with my basically advisors at Brown, Professor Ugur Cetintemel and Professor Stan Zdonik. Basically what we deal is we’re searching for the interesting within big data and they assume exploratory analysis tends to be very ad hoc and repetitive, right? Because equation is not well defined so since users have a hard time identifying what is interesting so they tend to ask some queries. Then we basically after we find something we go and find something else because users change their preference quite frequently, they change their constraints quite frequently. An interesting is also not only hard to find because we have a lot of different objects of interest, we have a lot to choose from but also it's hard to compute because objects of interest tend to [indiscernible] of data under the hood and I will talk about that this more. At the same time, even as I just said, we want fast online results because basically we don't want for the users to wait for hours for creators to finish, right? We want to identify [inaudible] some results really quickly. And to give a more concrete example of what we are looking for let's say this is basically a kind of Sloan Digital Sky Survey as DSS, data set. It contains information about different star objects, object in the sky, and what users might ask is like three types of queries, as an example. And the first type we should call first-order queries basically. We might want to find celestial regions here with some brightness constraints. And the idea here is that we can specify ranges for the shape of the regions, the user might not be sure about the correct shape, we specify constraints such as brightness. We might add more constraints as well. And the problem here is very large search space because the regions can be anywhere the user doesn’t specify where to look for them. So we have to go here and here because the regions can overlap, they can have different shapes, they can be anywhere in the data set. This creates a really large search space of possible choices. Or we might move to something like high-order queries which we call them is basically where we look at let's say pairs of regions of interest, the regions that are tied or linked by some constraints like maybe they have similar brightness or similar spectral characteristics or something like that. But again, the user specifies what they mean by this [indiscernible]. One of my goals to optimize queries were basically we are looking for something maximized in some objective function like maybe we are looking for maximum bright region. This is one example. This is kind of, we believe this is a very general thing to do. For example, we can go to a more common problem like subsequence matching which can also be in some sense seen as a constraint-based, problem, right? Again, so this is an example from the [indiscernible] data set which among other things contains recordings for ICU patients so we might do something like query by drawing where basically we are looking for similar subsequences. And it's a wellresearched study of course, what I would emphasize is that it’s basically the same query in nature, right? We're not told where to look so basically we just have some constraints. Here it might be the distance from the query sequence to the result sequence, subsequence and maybe some other constraints as well actually. So it might add some other constraints like maybe some constraints on the amplitude of the signal or maybe we can add some constraints on the neighborhood of the found results and so on. So we believe that there are actually two sides to such data exploration problems and the first side is search complexity because the search space is really large so enumeration is not feasible. So we cannot just basically enumerate all possible regions and just check them one by one. Constraints might be also more elaborate. So the one before, for example, was, again as I said, average brightness so we might have aggregates but we also might have something else, similarities more complex than that. Again, we might not only decide to study the region itself but also maybe the neighborhood around it and maybe compute the brightness of the neighborhood, the brightness of the region itself, compare them somehow, and we assume that it might be quite complex logic where the user actually defines it as a part of the query. So it's not necessarily a predefined range of constraints. On the other hand, we have what we call data complexity because we still have a large data set and we have to perform a lot of out of core computation and this might be expensive. What I mean here is that to basically check the brightness of region, for instance, we need to go and retrieve the corresponding objects. And the region might include a lot of stars, planets, whatever so we have to go and compute all these functions, all these aggregates that use the references. So we might basically incur a lot of computation overhead. And what I believe is that these two sides of data exploration they are covered really well by two different techniques. And search complexity is covered really well by constraint program which is very general technique in some sense that basically deals with exploring large search space efficiently and it's very customizable. So I will talk about this a little bit later. At the same time data complexity, this kind of [indiscernible] core computation is handled really well by traditional DBMS’s. So I guess the next question is that can it actually be used? So, since we have large data sets, can we use actually traditional DBMS’s for this kind of data exploration? And the problem with the traditional DBMS’s is that they don't have nice support for exploratory constructs. So in some sense when we look at regions it more resembles power sets. So basically if you think about, for example, [indiscernible] arrays we are not looking at like every single point in the array. We're looking at a sub set of points. It resembles more like a power set but databases don't have supports or power sets constructs and they have limited support for user-defined logic. For example, I cannot if I'm looking for a brightness region, if I’m solving an optimization problem, they basically don't allow you to define an objective function, for instance. They also don't allow you to easily define complex constraints and [indiscernible] are treated as black boxes. And they don't allow you to, for example, to perform some customized search like maybe steer in the search in the required direction which is common, right? Defining search heuristics is really common in search-based problems. Another problem is that they support interactivity poorly which is kind of a no-no for interactive exploration. But still it’s possible to kind of express some of the simple queries in SQL. Like we can take a simple query and basically what you can divide [indiscernible] in this kind of batch processing. So you just can still enumerate all possible regions by dividing data into cells, for instance, and these joint cells combine these cells together by using something like recursive [indiscernible], for instance, and then perform filtering. [indiscernilbe] common way to do such processing in the database systems. So such a query is really hard to maintain and they are really hard to optimize so this is a very simple query. If we had more constraints and we tried to express more queries the query is going to grow really fast. At the same time such queries don't allow any interactivity because almost surely we'll have a block in operations. Like here we have basically block in group by which basically is kind of the first point here like dividing data into cells requires group by. On the other hand, expressability-wise[phonetic], such queries, such exploration queries is very easy to express in constraint programming . So constraint programming basically specifies those queries by using a bunch of decision variables. So here, for instance, objects of interest can be easily expressed by four variables defined in the leftmost corner of the region itself and the possible ranges of lengths and then the user defines a bunch of constraints between these variables. So here we have only one constraint which is average brightness so I just put as average BR, for instance. But we can of course add more constraints; we can define a bright region, what does it mean to have a bright region. If you want to express, for example, the neighborhood of a region we can add more decision variables and, again, tie them by using constraints and so on. When you define such a problem and feed it to a CP solver the CP solver is actually very good at basically dynamically building the search space and exploring that in very efficient ways. Because I have a large number of heuristics, number of methods to explore large search spaces, a large number of heuristics already defined, they support a variety of constraints outside of box and they are also very highly extensible which is really important for ad hoc exploration if the user wants to define their own extensions. For example, users can easily define new functions or new constraints. They can easily define new search heuristics and that's important actually because in constraint programming it is often the case where for a specific problem we want to define a different search heuristic that will allow us to find the results much quicker than, for example, a general heuristic. This is actually supported in a really nice way in all CP solvers in a very natural way. To give us more kind of refresher on how constraint programming basically works this is the traditional backtracking CP solver which basically just builds a search space three starting from the initial values for the variables. So here we have, for example, we might have two variables, [indiscernible] and bound variables because they still have a branch of choices so [indiscernible] declination as SDSS, for instance. Then basically the traditional CP solver goes with the search heuristic which is defined by the user or might be some of the standard heuristics. So here this is a very simple what is called variable value heuristic where first we have to choose the variable which is not bound yet and we basically pick some value or values from each domain. So, for example, the first step might be [indiscernible] extension and this choice actually it’s highly customizable, right? So we might decide based on, for example, sampling and we might decide based on just random choice or whatever. It depends on the task at hand and it's really easy to define. And then after picking the variable basically the software takes, here it divides the domain into three parts and picks the middle part, again, based on some heuristic based on maybe some estimations. And it continues to do so until we have basically a leaf of the tree here where the variables are called bound and that's where we have basically a solution. Or, if we are lucky enough and that's what this is all about, is that we might get something like this here where we basically explore the right branch of the tree. So [indiscernible] to branches, the left branch and the right branch which are basically just joined but cover the whole search space. If we are lucky enough we can actually, at this point the solver might prove that there is no way constraints can satisfy for these ranges so [indiscernible] can prune the entire subtree from the storage space and this is really important for constraint programming. That's what CP solvers excel at. Sometimes it's called like a search fail. And we believe that for efficient exploration we need to combine this with technologists to make them work together as a single system. And that's basically the move of our work is CP plus DBMS for data intensive exploration. But first I want to talk about the first framework. Basically our first work we just call Semantic Windows. And Semantic Windows was our kind of first step towards constraint-based exploration so it still allowed us to define queries based on constraints but it was very custom and [indiscernible] solution for a specific problem. So it used a custom solver, not a general solver, just a custom solver a solver written by us which we used utility-based search which I’ll talk about and used basically a kind of sampling base exploration in a way what it allows us to do basically it allows us to steer the search in the promising directions by using sampling. So it wasn't general, as I mentioned, it supported partial only first order queries and again, by firstorder queries I mean queries with just kind of simple constraints like aggregate- based constraints. So it was not general; it was very specific, and it was hard to extend but I will talk about lessons learned later. So again, going back to this as example, again this is basically an example of a query. So we go and search for all of these regions satisfying the constraints and some [indiscernible] supports like two types of constraints. The first type we call them conditions actually in that research and we search these shape-based conditions or shape-based constraints which basically just specify, for example, the shape of the region; and it is important to distinguish them from the content-based conditions, from the content-based constraints because shape-based constraints they don't reference data they just basically defined the shape of the region, they define what we are looking at without looking at the data. But quantum-based constraints they basically go to the data. They’re more semantic like. And we called basically all multidimensional regions satisfying such constraints we basically call them Semantic Windows. And again, we didn't want to go with the SQL approach, which I outlined before with these huge queries although we of course performed experimental relation comparing ourselves with that, but we decided to make the process more search like. Basically as a solver would do that. So basically we dynamically enumerate all possible windows subject to pruning of course. We couldn’t use extensive pruning because we strived for the exact result and sampling doesn't allow you to perform 100 percent pruning, but we tried to at least take some information like from shared-based constraints, for instance, to perform at least some pruning. And the gist here was that we wanted to focus on on-line results. We don't want to just enumerate windows and check them one by one. We basically want to go to the most promising alias of the search space. So we basically started the windows, the numeric windows, yeah? >>: Is this sampling [inaudible] you don't want all answers? >>: Alexander Kalinin: Say that again? >>: Whatever completeness you don’t want all answers? >> Alexander Kalinin: What I meant is that we actually, the other way around. We want all answers. We want the exact result. And we didn't want, so basically the sampling wouldn't allow us to provably prune regions from consideration because if you prune it based on some estimation which is not 100 percent correct. So in some sense we guarantee [indiscernible] for provable pruning of regions. That's why we didn’t perform extensive pruning in this work. So we enumerated all windows and we started them in order of utility. We shall talk about on next slide, but here it basically defines is how promising the window is to be output to the user. So in this case we basically can quickly identify the results and move to the promising parts of the search space and identify and output the results to the user. And this kind of the solver, which is basically, you can say this is kind of a solver. It also builds some kind of a search tree. And it was close to where via this utility. So utility is basically a combination of the benefit and cost. Benefit measured how close the window is to satisfy the conditions. For example, if we have a brightness constraint we might estimate the brightness of [indiscernible] window by using sampling and make some guess at how close it is to user-defined constraints. And at the same time since we again perform out of core computations we also measure the cost of the window because again, we don't want to distinguish between distortion and data problem. We want to kind of work it closer together. So we have to measure how expensive it would be to read the window from disk, to read the corresponding data from the disk. So we measure basically, I say it in cells, but basically measured in data we have to read. And we've made some adjustments for skewed data because you have some skewness in data of course. Of course we also made some provisions for caching for instance because we have a lot of overlapping windows, one window brings some data with it, we cache it, and we have to make sure that we basically count it in the cost function so basically not to overestimate, >>: [inaudible] constraint using [inaudible]? >> Alexander Kalinin: So here we don't. So I will talk about this later when we actually use the constraint [indiscernible]. So here we basically wrote it by ourselves. So basically it's a custom solver in C plus plus as a kind of additional thing on top of actually [indiscernible] as a DBMS. So it’s a completely custom thing. I think we didn't use anything off-the-shelf except for PostgreSQL here. And so when we basically measure benefit and cost what we tried to do we tried to go for the beneficial windows which are hopefully are more or less cheap. And it's basically you can say that our search was basically best-first search when you think about it. So we might have kind of a priority queue of utility order of regions so we started them in the order of the utilities. For example, here we will look at region three and basically if it satisfies the constraints we output the user and that's why we need to check it basically if it actually satisfies the constraints because utilities are sampling-based. We don't read data yet. And then basically we can generate new windows from that by, for example, expanding regions more and more and basically put them in the queue and [indiscernible] until basically all regions have been explored. So I want to emphasize the fact that we have to kind of go on, go on right because since we want the exact result we cannot just stop at some point and say we are done until we explore everything again because we want exact result and sampling would not allow us to kind of say for sure that, for example 0.98 doesn't mean that region can be safely proved, for instance. We have to go and check. So we have to go through the whole data set, through the whole search space. So one thing, it's more system-like is that I was about this kind of optimization, CP DBMS optimization is that when we would encounter it actually as a problem where basically we have a lot of kind of small reads which disbursed around a data file so you basically go read something here, here, here, here and this basically creates a problem with six and problem with threshing and six understandable right because we study windows and the order of utilities so it's kind of we didn't take six into consideration here because it would be to prohibitive I guess. So when we go to promising regions, we go to bright region doesn't mean that, so what might happen is that we might jump between like interesting parts of the search space so we don't want to force sequential reads here. At the same time window locality doesn't imply disklocality because here we are talking about relational, yeah? >>: So my question [inaudible]? Like what's a basic data, what’s a basic [inaudible] data and how does it [inaudible] because you're looking at multidimensional data and there are a bunch of choices on how we organize disk-end and none of this discussion is highly specific to how we [inaudible]. >> Alexander Kalinin: Right. So that's actually what I was going to say for window-locality versus disk-locality. So if we look, for example on something like SDSS this is basically kind of an array, two-dimensional array, maybe like two coordinates. We actually will be performed the enumeration we tried different skills. Basically the underlying system is a relational system so you have some choices on how to put the data there. So we tried different things like maybe aligning it by one coordinate or maybe do something like [indiscernible] ordering, for instance. You try to force some locality because you’re still looking at continuous regions, right? You want force some locality but you cannot since tuples are linear you cannot enforce it like [indiscernible] locality basically, right? So in all experiments we tried different layouts for relational system. >>: So you have a [inaudible]? What's the architecture of your system? So it looks like you have a middle layer on top and underneath you have postscripts. And the queries may have certain constraints in them but they also have just regular predicates that the [inaudible] system can very well evaluate, right? So when you're given a query what did you do? Do you send some part of the query down to the [inaudible] then do the rest of it on top or how do you [inaudible]? >> Alexander Kalinin: So in this system the query is handled. So it's not a SQL query, so basically kind of a custom query where you basically specify I want these shape-based constraints for the shape of the region and I want these quantum-based constraints like average brightness, for instance. So it's no SQL-based. So some predicates, we don't push the query in the database. We do some basically SQL. So what happens here is that this custom thing, custom solver sits on top of a database system but not inside, just on top. So of course to compute something, for example to verify the window, we need basically to compute the average brightness. So to verify constraints we need access to data and we do this access via SQL queries. That's how it is done. So if you specify some additional constraints basically some of them are used like average brightness, some of them are used for steering the search. If you specify some other constraints they basically just are pushed without consideration to the corresponding SQL queries. >>: [inaudible] writing the database query processing functionality why use database? Couldn’t you have used a file and just used [inaudible] if all you're doing is pulling the data out and doing your processing [inaudible]? >> Alexander Kalinin: Well, we believe pulling the data out is actually very expensive because we envision that the data is still stored in the DBMS system, right? >>: [inaudible]? Are you pulling different and you're coming up with the priority order of pullings through the windows. Given a window you're pulling the data out of the database and processing [inaudible]? >> Alexander Kalinin: No. So basically we generate a corresponding query. So, for instance, to identify the constraints we basically generate the SQL query and push it into the database and we just get the result. So we try not to move the data around. >>: Some of these data sets are mostly [inaudible]. So currently the architecture seems to suggest you're [inaudible] and query processing together in online fashion. But if you think of architecture where you build something like views and these views could be way more complicated because they don't need to update. So if you have aggregate constraints, if you think some part of the data is interesting, put aggregate constraints on that part of the data exhaustively and run queries using those rules if you constrain that the other data [inaudible]? >> Alexander Kalinin: Right. So the problem we actually saw is that basically is you would have to enumerate everything. So if you create such an index, like let's say we assume that the users might ask different functions, like here I'm asking for average brightness, but think about this SDSS actually. It has 500 I think different attributes for its objects so the users might actually ask queries based on any of these attributes. So we assume that we want to be general. If I know that I'm going to ask a lot of queries based on the brightness, for instance, even in our case brightness is actually a function of multiple attributes so becomes a little more tricky than that. But let's say we just touch one attribute; we might basically create an inverted index by but this would require us, since you saw that windows can actually overlap and so on and they can be [indiscernible] arbitrarily because the user can change the shape of the region based on the shape-based constraints so it would need to enumerate all possible regions. And it's not only about bright region. We cannot just go to bright part of the data and enumerate everything there because some other user might actually need less bright values or maybe it's all about a range of spectral characteristics for these regions. So we also played the velocity. Maybe you want to go and see a region with high velocity objects, for instance, so in this case one user is interested in high velocity regions and the other might be interested with more slow I guess regions. So in some sense we assume that we cannot identify interesting, so if the user, this is kind of a different problem. If we often find the results in a particular part of the data, particular part of that storage space we might be able to do that. We wanted to be general in a way that the user’s constraints, the results for user’s queries might touch actually the whole data set so that's why we didn't do any indexing for this thing. >>: And [inaudible] did you mention that you have to touch the entire data in order to ask [inaudible] that you have [inaudible]? >> Alexander Kalinin: Yes. >>: And on the middle [inaudible] trying to decide [inaudible] for axis in the data. What do you do if you tested a sequential scan of the data and [inaudible] predicates as part of doing the sequential scan? Because anyway you have to do the scan all the data once it seems. >> Alexander Kalinin: Right. So good question. So a sequential scan is actually pretty good. If you have a small search space and a small number of possible windows you can actually basically go enumerate everything which was basically the SQL query I showed at the beginning [indiscernible]. The problem is such a query is that you have to blog basically. So you express as a SQL query, you put it inside a database and then basically have to wait until the results pop up. The sequential scan it goes in a specific way. If the results are at the end you have to wait for an hour for the query to finish. >>: But you are exposing some [inaudible] results to the user and the user can come and maybe it is not interesting stuff the query [inaudible]? >> Alexander Kalinin: She can. Yeah. So this was the idea, the interactive results, right? So that's why we basically tried to steer the search in a required direction so we can start quickly the output results to the user. And then the user might interrupt the query if she wants to. So the idea was to try to do it as fast as possible. >>: Do you still need to go complete scan before you can output it in one result, right? >> Alexander Kalinin: No. We basically, so when I show the [indiscernible], we have this kind of best-first search and when you take the currently best region we check it immediately and if it satisfies the constraints we immediately output to the user. >>: [inaudible] there could be certain constraints where you may need to look [inaudible]? >> Alexander Kalinin: So if your constraint touches the whole data yeah, you're screwed because you can't do anything about it. We assume that the constraints are kind of tailored to the window so hopefully it won't touch a lot of data. If it touches a lot of data we kind of cannot do much about it because if you have to compute it you have to compute it. So here since the window is more or less small I guess compared to the whole data set, the whole sky for instance, basically you just need to verify only that to read the data on the table to the window itself and that's why we basically can output quickly. >>: So it seems like using the shape-based conditions and then do some different term and then you do the other conditions in your middle layer? >> Alexander Kalinin: Right. So you can basically say that we use shape-based conditions to kind of generate all the candidate windows in some sense. We basically generate them by shape-based condition because the user tells us the window on the X axis the window must be from 3 to 5 degrees in the right [indiscernible] for instance; we kind of know what we have to generate in some sense. So basically we use shape-based conditions to generate candidate windows but then when we check quantum-based constraints that's when we basically assess them by using sampling and that's when we go to this to verify them. This is the main idea here. Again, interactivity, so we want to identify an output result very quickly. Possibly the user might actually want to stop the query at some point and not go for the final result. But if she doesn't eventually we have to get everything. And as the problem right, as I was saying we have not only small reads which are basically dispersed around the data file, but we also have this [indiscernible] locality versus this locality problem because windows are kind of semantic constructs, logical constructs and it doesn't mean that the corresponding data pages are going to be close to each other in database files so we have threshing problem. And we extensively measured this and started a lot of probes and PostgreSQL and the six times where actually very bad. So it what we decided to do a small trick, basically the traditional trick I guess using prefetching instead of reading just a window enabled with a window. Again, we have to read everything so it's fine if we have at least read something cache because it would result in better to [indiscernible], so prefetching we can do something like that basically just reading by larger portions. The trick here though is that you don't know how much to prefetch because if we start prefetching too much we start delaying on the line results because even if I have identified a promising window, if in addition to reading the window data also start reading a lot of data around it we basically delay online results. So we went with a progress-driven scheme where basically we kind of see what's going on right now and while we're still finding new results we basically prefetch a small amount which I think increases over time but still we kept a small value so we don't have interactivity, but when we stop finding new results again, remember that we have to read everything to confirm that. Basically we start increasing that exponentially which basically in some sense can be as similar as a kind of performing the whole this kind of remaining data to just confirm that we do have anything missing. >>: So benefit-based priority ordering is that you computed up front or it’s incrementally maintained as your making progress through y-axis? >> Alexander Kalinin: Priority based even for windows. So windows are generated dynamically basically. >>: [inaudible]? >> Alexander Kalinin: So basically when we go further and further [indiscernible], for instance, and [indiscernible] some data actually improve the quality of our estimation since windows can overlap and we generated more and more and more windows so we have a current queue of promising candidates. Since we bring some data we also actually improve the quality of our estimations to try to actually>>: Is that the dynamic nature of that the queue that makes your prefetching problem hard? >> Alexander Kalinin: Yes because we don't want to enumerate everything like the beginning basically. Plus we don't know the order in which we are going to read our windows because the in some sense the initial ordering depends on the sampling and then we started changing our estimations and changing our utilities. But that's a very good point. We didn't do any experiments with that but we could basically try to create the whole queue beforehand and try maybe to come up with an optimal prefetching core, but that's a good point. We didn't think about that at the time. And this is a small basically graph which actually shows why we don't want to use something like PostgreSQL for sequential scan because again the PostgreSQL here is the green line right here and you can see that it's actually in some sense it's fast. Sequential scans you go through everything but you output everything in one hour, for instance, which is not a good way to go. So the static approach is basically we do some prefetching but a really small amount. And you can see that while we can identify the results quickly so here it's actually not that quickly because of the prefetching actually. And we start basically going down because again, we perform too many reads, too many candidates perform too many reads and the overhead starts piling up. At the same time when we keep kind of adaptive prefetching is actually the first result so this is actually, I think in most cases the first result open up in 10 or 20 seconds despite the total time being one-hour, for instance. But here you can see that basically we kind of eventually don't lose to PostgreSQL that much because, again, we kind of tailor prefetching to the current situation. Yeah? >>: So one thing about the results that probably because sequential scans, as you might know, they have a lot of optimizations that can be done with sequential scans. [inaudible] PostgreSQL doesn’t have a lot of those optimizations from that. For example, you could build [inaudible] representation and scan all complex data and assuming a standard [inaudible] or a few harddrives spin those [inaudible] gigabytes of data in one scan you cannot scan into it again and again and again. Just one sequential scan can be done in a matter of less than a minute. I think we did the math of 150 megabytes per second of this bandwidth [inaudible] and[inaudible]. So coming back to the question is I understand your [inaudible] to give interactive results [inaudible]. If a lot of the queries are being terminated. [inaudible] after seeing some odd [inaudible] results then you gain a lot by giving the interactive results. But in the quest for getting the interactive results a question to ask is how much are you leaving on the table? You could have done a very efficient fast sequential scan of the data and the entire result that may not [inaudible] boundary in a matter of minutes probably. >> Alexander Kalinin: Well, that’s true, but in some sense we also depend on database optimizations, right? We perform the same queries as well. For example, PostgreSQL doesn't perform complex optimizations we also adhere by that. So it would be interesting to repeat this experiment because this experiment is kind of an old one. We didn't have access to SDSS or even in the VM simulator or something like that, right? It would be really interesting. But again, if we moved to a more complex query optimizer or if we move towards more efficient disks I guess, more efficient storage, we also benefit from this because since we also perform a lot of SQL queries and especially for prefetching we also retrieve a lot of data, we also benefit from that tremendously. So maybe the gap will be smaller between kind of doing the sequential scan and actually kind of outputting the results really quickly the gap might be smaller, but I think the gap will be super [indiscernible]. And this is actually the question was before about our architecture so was actually distributed solver which basically was up on top of the DBMS and it performed some search distribution across workers. Not much to talk about. It was basically just something to want to point out. This was a very custom thing, very custom solver, very custom solution. We kind of learned some lessons from semantic windows is that first of all semantic windows supports window queries which basically kind of imply simple constraints. So it's very hard to extend, if you want to add more types of constraints is kind of hard to do because it's hard to extend this kind of solver. This is very specific. We have to read everything because if something doesn't allow, as I said before, something doesn't allow provable pruning based on content so in some sense the completion time, if you want to really find everything, the completion time has a lower bound which is the sequential scan. And we didn't perform any search balancing here. So we distributed the search space around but at some point what might happen is that some nodes, due to the data skew for instance, some nodes actually finish their part of the search space really quickly and just sit idle during nothing. And I believe another thing is that we kind of tied search and data partitions together because we just distributed the data and each worker looked only have the corresponding data partition which was not the way to go. Basically we want to kind of distribute search and data on different levels which would allow basically I think much more interesting balancing. So I want to move to our actually I guess main system, Searchlight, which is actually kind of this implementation of this CP plus DBMS idea. And we moved to array databases and mainly SCiDB because mainly scientific data sets can be stored there efficiently. So we, for example, store SDSS there as a two-dimensional array in this [indiscernible] declination. If you think about something like time series data again, this is also a bunch of the error rates and so on. So we tend to take basically a traditional CP solver and we took the one from Google’s Or-Tools which is Operation Research Tools from Google which is open source, if you’re available. So that makes basically our queries CP-based so we don't use SQL, raQL [phonetic] or whatever the database gives us so we basically formulate our data exploration queries in form of constraint programs. So constraints still primarily dealt with aggregates because again, since we deal with these kinds of regions we kind of user-specify aggregates. In this case, the principal subsequence matching can also be seen as an aggregate because it deals with the distance between kinds of two sequences. But we kind of again actually extend this so this average function, like average brightness, that kind of building blocks. So in this case the system is extensible since it allows us to define new types of constraints based on these building blocks. For example, I might easily express the difference between the window itself and its neighborhood and compare it somehow or maybe I can easily define the difference of aggregates between two different windows in my search space. So constraints become more elaborate, more interesting. And I want to start from kind of the result of exploring alternatives which is basically, so what we have we have kind of two alternatives here, right? Again, as I said at the beginning, we see that exploration is kind of consisting of two phases in some sense having two sides to it. So basically we have two alternatives if we want to perform such exploration queries. And the first alternative is basically we just call CP here which is basically just taking constraint programming solver and making it work with out of core data which is a bad way to go because constraint programming solvers actually don't work with out of core data at all. They assume data feeds into memory and constraints are actually quite inexpensive to check. They don't deal with hardness of computation. They deal with large search spaces not large data sets. But we did this section. In this case CP is actually put really close inside the SciDB engine. We don't perform any specific special optimizations but it sits close to the data so we don't perform any unusual serialization, the serialization here for the CP approach. And SciDB is basically is another alternative where you can try to express these kind of window queries by using SciDB language; we just call it AQL, Array Query Language. It's not possible to do for all the queries we support but from for some queries it's possible. And when we start data with large search space, and if I remember correctly large search space was about 10 to the power of 8 or 9 windows basically, maybe more than that, so basically for some data, for some search space it's like really hard to finish in a reasonable amount of time because due to the result in a very expensive computation. So in this experiment we decided to limit the execution to one hour and basically this is a use case like that. So the user basically gives this query and says find me something in one hour, at least something. And you can say that with Searchlight, with using this combination of technologies, we can actually identify first result in about five seconds and by subsequent delays, delays between subsequent results because we find one result we need to kind of find another and we have some delays between the result. So basically what I want to show here is that Searchlight, the main goal of our system to basically to output results quickly and basically keep outputting the results as interactively as possible. At the same time CP and SciDB in one hour didn't produce any results and the reason for that is that SciDB basically again it employs the same traditional databased approach. So here, for example, you basically can define your query as a kind of moving window operator. You just move your window through your data but it does it in a very direct way from the left of left corner and so on through your data. So if your data is not in the top left corner, for instance, you might not actually be able to get to the end of the data set in one hour. At the same time you might not be able to find anything. At the same time when you look at CP, CP is smart about building the search space but it performs a lot of these kind of constraints and variations, not only the leaves of the search tree but also it has to make pruning decisions so it makes a lot of constraint computations at the internal nodes of the search tree and then basically it creates a lot of overlapping computations, a lot of overlapping data access to the database which cannot be optimized well because CP solvers are not aware of such optimizations. For the small search space we can actually kind of finish the query much faster. Again, in at least CP and SciDB approaches I'm able to do that. And again, they have the same problems. Even if you're able to finish the query the query it is still possible, by using pruning it's possible to see in the search required [inaudible] it is possible to find the results quickly; but also we can prune a lot of unnecessary data by using constraint programming solver. We can actually finish the query in five seconds while if we continue with these approaches, with the alternative approaches especially stuck for quite a while not only for finding the first results but also for performing in query completion times. So the main idea behind Searchlight was solve-validate approach. So the idea here is that since CP solvers are very bad at accessing out of core data so we don't want them to work with out of core data. So instead of going for the data immediately to basically assess all these constraints we actually precompute a synopsis array, as we call it, which basically contains some information allowing us to kind of answer these aggregate functions with boundaries. So, for example, if I'm looking at an average brightness of a region by using the data I might say the average brightness of the region is 10, for instance, right? But with a synopsis we might say that this is from 5 to 15, for instance, or something like that. So synopsis allows us to basically answer aggregates with these intervals but this interval is kind of one hundred percent correct in a way that the average is not going to be less than five or greater than 15 so it's definitely going to be from 5 to 15, for instance which allows us basically to make some quick estimations in the internal nodes of the search tree and prune unnecessary parts of the search tree and the data with it. And we assume that the synopsis thing actually feeds into main memory. It doesn’t have to. Actually the system will work without that but it will benefit really if synopsis feeds [indiscernible but synopsis is considered quite small. So the caveat of that since we produce these kinds of boundary estimations instead of real answers we have might have false positives. So in this case CP solver is not producing solutions like in traditional case where it produces candidate solutions. Again, we guarantee that we don't have false negatives but we might have false positives so that's where the second component [indiscernible] where the data comes in which basically validates these candidate solutions based on the real data. So it won't go to the synopsis, although it might as an optimization, but in general it will go to the regional data array basically to verify the constraints. And this is actually a very general thing to do because basically data is not something like custom written for every constraint. We don't say [indiscernible] constraint type, this constraint type, this constraint type, constraint type this. It actually employs also CP solver to validate the constraints because [indiscernible] is also a CP problem. You have decision variables, you have constraints, and the solution is just assignment of decision variables. So basically you can take a solver, assign the decision variables and the solver will automatically check the constraints. So what it means for us is basically we just can clone the model, by the model I mean the decision variables and the constraints from the main CP solver and basically just make the validation based again on the solver itself. >>: So does it mean if you're constraints involve some [inaudible] function like a brightness you need to know that up front? Because it looks like the synopsis array [inaudible] compute those aggregates base on this [inaudible]? >> Alexander Kalinin: So what happens here is that for traditional, so basically we assume, I will talk about synopsis later, but we assume that for every kind of type of function we have a different type of synopsis. For traditional aggregates, yes, we need to compute some aggregate information like min, max and count for parts of the search space. >>: So you cannot [inaudible]? >> Alexander Kalinin: Yes and no, in some sense. So no if we want to add another function like, for example, let’s I want to perform subsequence matching and I want to add a function like a distance between the two sequences, right? In this case we have to pre-compute something to be able to assess these distances. So in this case we would create some traditional index like in subsequence matching. And yes in some sense that these are just building blocks. So I can take these aggregates and combine them in different constraints. For example, if I want to perform some kind of anomaly search I'm looking at a window and I'm looking at the neighborhood of this window, I can measure the value for the window and the value for the neighborhood using aggregate functions and I can combine these aggregate functions in a different way. But yes, so we have to know what kind of functions you are going to use because synopsis is based on these kinds of functions. >>: Question. So you support ad hoc arrays and the more [inaudible] use [inaudible] any type of array they want [inaudible]? So I was wondering like [inaudible] uses just one extreme of value after [inaudible] for specific kind of queries whereas here what are the assumptions that you are making about the workload that [inaudible]? Is it just the functions or is it you want to analyze those arrays you have some idea how you want to optimize [inaudible] because there are a few things that you mentioned. One was that for performance you want [inaudible] synopsis to be small and fit in memory. For a given type of function you want to create a different type of synopsis array. So there probably is something that's missing which is in this space of the ability to support ad hoc queries versus making assumptions about the workload and preparing [inaudible]. Can [inaudible]? >> Alexander Kalinin: Right. So for the previous question I was against indexing I guess because even if you take an average like average brightness there are different ranges of values that you can explore so one user is interested in one branch of values and another in another and so on. So eventually we would have to index every possible window because we need some kind of inverted index from brightness to candidate windows. Here we make a kind of weak assumption that we know only the functions. So I know we are going to use just average brightness in our query and this is it. We don't need to know anything else. So whatever the range the user throws at us or whatever the combination of the functions is going to be we are going to handle this. So we don't tell you a lot of synopsis. Synopsis type is tailored to the type of the functions we are going to use but it’s not tailored to the specific ranges of the values we are looking for. >>: And even in one of the slides you had [inaudible] DB data, [inaudible], I think that was the previous. So [inaudible] can you give an idea of how big a synopsis you're creating and [inaudible]? >> Alexander Kalinin: Yeah. So I will talk about synopsis a little bit later. But for the size I don’t quite remember. So we actually stored different synopsis and different resolution synopsis but I think we can get away with pretty good times by using like 100 megabytes per 120 gigabytes per data. It doesn't have to be much. The more you store of course the better estimation you get but I'll talk about this actually. It’s very interesting. So this is basically how we do it. And then the data actually checks the candidate solutions and basically just can filter out the false positives. So, as I said, synopsis is a kind of general concept here. So it allows, I don't like approximate answers actually here, it basically just creates bounds for API cores but this kind of lower and upper bounds they have 100 percent confidence so we don't have false negatives for sure. And we assume that again, as I said, they differ for different types of functions we use. So we use aggregate weight for arrays. If we use the relational data, we didn't explore this, but we can do something like [indiscernible] resolution aggregate trees where we basically [indiscernible] trees would be trees with additional information or we can use something like DFTP indexing for data sequences and so on so we don't [indiscernible] here. And for aggregate synopsis, this is an example of aggregate synopsis basically, so what we do let's say this is a regional array, [indiscernible], array we might have missing values; and what we do basically we divide it into a grid and this kind of the size of this [indiscernible] parameter. Here it’s basically we divide in cells of size of two by two and basically for each cell we store some min, max and sum count information. So this actually based on the [indiscernible] resolution aggregate tree work but in a regular base we don't need this kind of hierarchy. We just can get away with flat aggregation in some sense. But this is going to be the synopsis. The resolution of this synopsis is said to be like two by two because the cells have size of two by two. Then we basically what happens is that the solver often comes for estimations, basically for here for the average function. The problem here is that you can see if you have such a wide region to estimate the problem here is that for this cell we know the exact sum and count so the average is kind of easy to estimate. But when we're starting intersect in these kind of other synopsis cells we don't know the underlying distribution of the cell. We just know the min, max sum and count. So basically what we do we come up with an upper and lower bound estimation. That's why we create [indiscernible] real answers we get the boundaries. You can see that basically if we have higher and higher resolution synopsis we should basically get better and better estimations because we will cover these parts of the region more ideally. But that's the idea, again, creating this kind of upper bound for each such partial we will come up with upper and lower bound distributions, candidate distributions, if you will, which might not necessarily basically be the real distributions but they actually are upper and lower bounds. So we guarantee that if we say from 5 to 15 we guarantee that it’s not going to go beyond that interval. We moved beyond that because again, you can see what happens here is that this [inaudible] poor coverage and we just basically cover only a small portion of the synopsis cell and so basically it can go and use the traditional pyramid-based approach, right? We basically we have a pyramid of different resolution synopsis and the trade-off here is that when we move towards really low resolution synopsis like this one we can actually estimate everything from the guy here but the estimation are going to be really, really loose which will basically are not allow us any pruning. Although, if we move kind of to a more high resolution synopsis we incur a lot of computation overhead because estimations are not cheap and we perform a lot of them. Remember, like CP solver it doesn't only perform such estimations for candidate solutions but also performs them in internal nodes of the search tree frequently because it has to prove that it can prune actually the corresponding part of the search tree. So we actually ended up using a simple heuristic where basically we just measured the kind of, we actually do this kind of dynamic approach. So in Searchlight we can do everything dynamically I guess. We try to look at the current state of what is happening during the query relations. For example, if the solver asks us to basically compute, to estimate this region we basically we look at the coverage and we can say this cell is covered fully, the region covers this cell fully so we can just take some count from there and this cell is covered fine but this is covered really poorly because we only touch the really small fraction of the synopsis cell and for synopsis cell we know we need min, max, sum, count so the estimation is going to be really poor. So basically just for such but for coverages we will basically go and only for this small portion, for this red cell we will go to a high resolution synopsis and we'll improve our estimations until we have good coverage of all of synopsis cells. We perform kind of a lot of experiments for such smaller features. Again, it's actually worth it, right? Because you can look at different resolutions. Here is I think it’s 100,000 by 100,000 array; I think it's synthetic data; and you can come up with a different resolutions here and the right column is actually combined in the three in this dynamic approach. You can see sometimes actually if you use really poor estimations you might not be able to finish the query in a reasonable amount of time. But at the same time you can see that sweet spot here is actually 100 by 100 synopsis because it generates a lot of candidate solutions but at least it isn't stuck in expensive computation. But again, if we actually combine basically the approaches we are able to actually do much better than that. Here it's not much better, but still we are able to cut one minute, at least one minute from the best synopsis here. And what is actually good about this the users don't have to pick the synopsis. Here we will have to pick the correct synopsis. Here we basically just make the system decide and makes a very good choice here. At the same time, for a small search space, we have another sweet spot here because here 10 by 10 synopsis performed better because again search space is small so number of candidates is not that large. So 10 by 10 synopsis could get away with kind of performing less computations so it was the best choice here. But again, if we gave this choice to the users, the users would have to decide we don't want that. So again, you can see the dynamic approach kind of basically adapts itself to the situation so I think it’s really neat. So Searchlights is a distributed system basically. And as I said before, we kind of wanted to basically distinguish between two layers. The first layer is a level of search where we basically have a search space which is handled by a bunch of CP solvers and they work independently. So we basically just divide the data, the search space between different solvers and let them just play the disjoint parts of this search space, but I'll talk about balancing on the next slide. On another level we partition our data for validation. So I have a bunch of validators going corresponding to different data partitions. And they work completely independently of each other actually. So basically my CP solver produces candidates, they send these candidates to the appropriate validators so we can basically put them, if you have a cluster you can put them on different machines, and do whatever you wish. We can have a different number of them depending on what kind of resource you have and so on. This allows us actually to partition the store search space much more freely because, for example, we can make balancing decisions on the CP solver level without considering the data. So the challenges though here are basically partition to search space and we basically need to determine how to partition data. And the next question was actually very interesting when we started doing this is that where should we send candidate solutions? Because basically when this solver produced a candidate solution we don't know what day it is going to be needed for its validation. We don't look inside constraints; we don't look inside functions. We don't know what data we need to touch because CP solver touches only the synopsis. That's fine. But we don't know what kind of data these guys are going to touch. In particular, we are interested in what kind of data pages or inside DB what data chunks we will need, but these data chunks actually determine to which validator we want to send this candidate because we want to send this query kind of closer to the data so we want to send it to the validator that has most of the data needed for the validation. So basically this is kind of the field I guess challenge or how to do that. So let me start from search partitioning where basically we perform static search partitioning in the initial level which is kind of basically a very simple approach. So what we try to do since we basically deal with continuous queries here we basically tried to cover hotspots with multiple solvers so that solvers won’t sit idle. Again, the parameter here is the size of the slice. It's really simple. So this is the initial phase. Then basically what might happen though is you can see that here a hotspot is covered only by two solvers, three, four might sit idle for instance at some point, or maybe we might get the slice size wrong and so on. >>: [inaudible]? So how do you know which set of attributes is [inaudible] hotspots? >> Alexander Kalinin: So [indiscernible] the attributes I guess. The hotspot here is basically a part of this search space which has a lot of candidate solutions which is promising. We don't know yet so this is why it’s basically just a heuristic. We try to kind of guess that they are going to be counted as hotspots and we that's way we do this kind of [inaudible] distribution to try to cover the continuous hotspots of multiple solvers. We don't know where the hotspot is; we may actually get it completely wrong. That's way we have basically this second stage. Basically this is a kind of embarrassingly parallel solution in some sense. You just create a lot of little slices and just re-distribute it across solvers. >>: Is it just [inaudible]? Would we need replication for correctness at some point? >> Alexander Kalinin: [inaudible]? >>: Sometimes your synopsis may [inaudible]. You can get [inaudible]. >> Alexander Kalinin: Right. So since the synopsis is small we can actually replicate it across all the machines participating in the search since 100 megabytes is easily replicated. So what happens here in the implementation is basically just transparently pulled. So we don't replicate it beforehand but basically when the solver accesses, the synopsis array is also handled by SciDB and basically when the solver just needs some data it will pull it from another machine. Eventually what it means for us, eventually the synopsis is going to be replicated on all the solver machines so there might be some data moving but it's [inaudible] going to be really small. So you might guess, of course, this [indiscernible] distribution wrongly. So what we have basically is dynamic search balancing which is kind of a fallback solution which tries to remedy the situation, and if we have [inaudible] solver basically just finished part of the search space what we can do is we can basically make ourselves available to the busy solver and just move its subtree to another solver. And moving subtrees is actually very inexpensive because solvers know the model, the model comes from the query itself, so it’s basically just take this variable, take its domain value, so an interval of its domain values and just kind of study it as your own search tree. And then basically this left solver it actually backtracks from here and continues with another part of the search space and this solver studies the search space and so on. So it's a very dynamic thing and it depends on what solver are struggling right now and whatnot. If you are familiar with constraint programming can see this is very similar to [indiscernible] in constraint programming solver so they have been doing similar stuff. And it's a very dynamic solution because basically we might have different helpers so we distribute. So if we have a huge, if you have a lot of solvers and you have a large search tree still in explore it's still going to be distributed dynamically across many solvers. And again, the balancing issue is going to be distributed continuously until we basically, we’re trying to keep all the solvers busy during the computation. So a small example of what's going on basically for static and dynamic partition inside the partition basically depends on the number of slices, the number of slices in static partition. You can see what might happen here with the slides. We have eight solvers but four solvers basically finish immediately here. So you cannot see this really small bar here but basically some solvers finish really fast and sit idle, and at the same time if you get a number of slices correctly you might get better balancing, but again, you have to give the size of the slice. If you create too many slices it’s going to bring some overhead with it and it still might not work for all the queries. And with limit partitioning we’re kind of, so this is dynamic partitioning. [inaudible] it's not ideal because we still have granularity of this internal node and it depends on where we cache this node, but again, we kind of try to keep all the solvers busy which I think we do here. So another thing is, as I said, data partitioning basically. So we don't perform anything really special here. Basically we divide data partition in static way between validators. We don't perform any data [inaudible]. Basically if candidate needs some data for the validation we just fetch transparently as a kind of database. So we do perform some data transfer, which I probably won't have time to talk about, but basically if we really need to perform some actually data redistribution between validators. So we tried not to do that because basically we are trying to bring queries to the data in the sense of bring candidate solutions to the data to the appropriate validators, but if, for example, we might have a situation where one validator actually gets all the candidates because it's really promising. It's actually the part of the data that contains all the results. In each case where the letters 2 and 3 might sit idle in which case we will perform some re-distribution and in each case these data partitions might not be as kind of even as shown because we might move some candidates from validator one, to validator two from [indiscernible] and redistribute some data from partition one to partition two. So we perform that as well but we try to avoid it. So I just want to point out that data partition is also dynamic but only in kind of critical situation because for search balancing it doesn't require a lot. It does require moving data so it's really cheap, but for data partitioning it might not be that cheap. And the third problem was kind of again, determines where to send candidates. And again, as I was mentioning at the beginning, we kind of tried to be really general. We don't want to look at the functions; we don't want to look at constraints and kind of try to parse out what kind of chunks we are going to read. So basically, as I said at the beginning, the validator basically, the beginning of Searchlight portion of the talk, the validator actually also runs the CP solver. So basically it validates candidates by using CP solver. So instead of validating candidates and going to the real data, when we produce a candidate solution we basically perform a simulation of the validation. So instead of going to the real data we basically just give kind of trivial boundaries for the function without going to the data at all. It’s very cheap’ it’s just CP-based. But in this case we can kind of log all the data; it could request that validation would make for this particular candidate solution. In each case now, after we basically know what kind of candidate solution, in this case for the candidate solution we know what kind of data access are going to be required for the validation. In this case basically we move the candidate solution over to the validator that contains most of the data. So basically we kind of try to avoid moving data around. So we perform a bunch of other optimizations which I probably don't have time to talk about. We use synopsis for validations actually. So you can understand in some cases if our query [indiscernible] is completely lined through the synopsis we don't have partial intersections with synopsis cells we can basically use synopsis to validate aggregate functions which is a very cheap way to do this. Basically, to avoid threshing we try to batch close candidates together so that in some sense we have this complete division between solver and validator layers candidate solutions can come from kind of different solvers in arbitrary ordering, right? So we try to avoid basically performing a lot of reads from different parts of data file so we try to combine candidates close to each other so that they would read basically approximately the same data from the same locality. So we perform solver validator balancing, we just basically just kind of redistribution of CP resources between them because again, we try to be dynamic here as well because it really depends on what's going on. At the beginning we don't have any candidates to check so we don't want to direct resources to the validators. So in this case we basically direct all CP resources to the solvers. At some point though we might have the case where we have a lot of candidate solutions in which case we might kind of pause the search and divert more CP resources towards the validators basically by starting more threads for instance, that’s what we do. And we try to utilize idle time as much as possible. For instance, when we redistribute the search space on the solver level there might be some idle times on the solvers so even in these idle times we try to kind of perform more validations and so on. And again, when we have validated all the candidates or the number of candidate solutions is really small we actually we divert more resources to the search. And, as I said, we perform some candidate relocation which is basically the distribution of the candidate solutions between different validators. But again, we try to avoid these because this might result in large data movement. So this is actually a result for real data set which is basically [indiscernible] SDSS. It's 80 gigabytes because it's kind of the real size. Again, it has a lot of attributes and SciDB kind of support what you can kind of vertical decomposition. We access only database we need. So here we actually access all the five spectral attributes. That's why it's kind of 80 gigabytes of real data. You can see that basically we try to vary different parameters for the selectivity of the query, basically region size, magnitude. Magnitude is basically this kind of spectral characteristics of the regions, and we can see this as a good starting point because we actually define the results really fast and the total completion time is also acceptable. Compared to SciDB, for instance, SciDB actually cannot perform such queries at all. Either they are not expressible at all in AQL in a regular language. You can express them by using complex scripts like in [indiscernible] so you can ask several queries and try to combine the results by yourself which is really prohibitive. So I think we run some of the queries for many hours and still haven't got any results. >>: Is this [inaudible] benchmark? [inaudible] queries that you made up? >> Alexander Kalinin: These are the queries, our queries. So we're basically looking for regions satisfying some ranges or special values. This is a good starting point for interactive exploration. So this is kind of all of for the main work I guess. So we have some ongoing work going on and I think if I have time I'll go into directions right now basically. The direction of proving the generality of Searchlight which is basically again exploring new data sets and constraints so, for instance, we wanted to show that, as I said at the beginning, that basically something like subsequence matching is also a similar type of problem and we can actually solve all these problems using the single system without using all these pulling solutions spread around. So this is one direction which is basically kind of completed. It was actually really easy to incorporate subsequence matching inside this framework and it works really well but results are still pending. So [indiscernible] mentioned really expressible as a constraint program to the system. Our system can handle such queries also really well. Of course it relies on the existing research about indexing of time sequences about basically computing all these DFTP and all these representations. The second thing is query relaxation which is basically kind of the case where users do not quite know what they're looking for or they get constraints wrongly. So I just said this from [indiscernible] data set this is contains time series data for different ICU patients and the queries we are looking for here are subsequence matching. So they can go beyond subsequence matching. So subsequence matching is only one type of constraint. We can also add other types of constraints with the same query not only you can look for, for instance, for a sequence or a subsequence for a time series similar to the specified but you can also in the same query specify additional constraints like the neighborhood must be of the particular, has particular properties and so on. Basically we kind of introduce new functions there are like distance-based functions and we introduce synopsis as basically using the existing work on DFTP indexing and so on. And again for query relaxation, which is ongoing work, we use query relaxation here in a traditional way. What if the user doesn't get enough results or doesn't get the result right? So what we can do about this? The problem is a little bit different from relational systems I believe because in relational systems you can often estimate the cardinality of your result or you can at least use index and try to understand how you want to change your constraints. Here it’s a little bit hard to do because again indexing is hard to do and also since we have these kinds of general constraints it's hard to estimate the cardinality of your final result. And the main idea here is what we actually do. And like I recently implemented this, but again no results yet, but basically we dynamically see what happens. If the solver fails at some point during the search it fails for a reason. We know why it failed, which constraints failed and how they are failing. So we basically use this information to remedy the constraints, the original constraints, replay the fails and see what's going on. Again the idea here is about kind of interactivity. So we want to give users results fast. So if a user asks for average less than 10 we might be able to basically output 15, 12, 13. If you have less than 10 results we'll put it eventually, but at least at the beginning if it's hard to find we will output results that are close that satisfy the constraints but not quite. So this is not an approximate answer in the sense that we get approximate results but we might at least start from not exact results but close results. And quality will improve later. So it's similar I guess to online aggregation when you think about it. And this is it I guess. That's all I have for today. >>: So how do you compare your work with [inaudible] that has happened to combine [inaudible] databases or like pushing [inaudible]? So do you have any thoughts about that? >> Alexander Kalinin: I mean our queries kind of a little bit different. It’s the same idea. We basically you have some>>: But you’re combining the constraint solver with the database because constraints always cannot handle a lot data and the database cannot handle constraints, right? >> Alexander Kalinin: Correct. >>: [inaudible] plus Hadoop or DBMS those also try to address a similar space because [inaudible] everything are very rich [inaudible] operators where they're [inaudible] data. So have you looked in to that line of work to see [inaudible] to this approach or have you considered pushing your constraints on [inaudible] database as well? >> Alexander Kalinin: So in this framework we didn't look closely to these types of integration like our integration yet. But I want to point out that I don't think actually we can express such queries again with those approaches so I guess we go in a similar direction of combining these things. And actually, about putting the solver there, the solver in Searchlight actually works inside the database because SciDB basically allows users to define user-defined operators which are basically part of the query plan. So we don't perform any specific optimization, but still it sits inside. So we don't serialize data, we don't move data around so it sits really close to the data. It's actually in some sense lightweight because it uses the SciDB infrastructure to basically do all the other search stuff so data management like networking and so on. So actually we also move in the same direction basically getting close to the database and try to combine them. And one of the ongoing work I guess which I probably won't be able to finish, was basically just to think about how we can actually create multiple search operators inside the query plan and basically if you have a search query we just create a single search operator and basically maybe we can actually kind of divide our search query in multiple operators that work actually together or maybe we can actually push some of the constraints inside the database to create more elaborate query processing trees but we didn't kind of move to that yet. >>: What about the lenience of the queries? What types of queries does your system. [inaudible] can handle? As far as I understood we are kind of looking for some area of objects which [inaudible] the [inaudible] function. You say that this is so lenient or we cannot guess, for instance, the brightness [inaudible] or we can add a pair of ranges of periods of with some [inaudible]function? >> Alexander Kalinin: You can definitely find the star, for instance, because you basically described it as it is couple of decision variables. I want to point this out, for point queries like finding the bright star this is kind of possible to index because it is it can index every possible object in the database or it can create this kind of index basically because the number of objects in the database is smaller than the number of all possible regions for these objects. But our system can handle that. It’s just that there are better solutions that might be available for such point queries. Again, the star has coordinates so you can describe these via decision variables and then you have basically have a constraint for brightness, a constraint for its attributes so it can easily be handled. As for the kind of the pairs of regions you can do this. It might be not as efficient as we would like it to be but you can. Basically if you want to find, for example, two regions that are similar to each other the queries I talked about like, for example, the difference between let's say you want to find two regions A and B and the difference between the brightness is a small. So you can actually described it as a constraint program because you describe one region as a bunch of decision variables, region B is another bunch of decision variables, and it's easy to describe the constraint. It's average of A minus average B is less than something so you can easily kind of combine. So to answer your first question the limitation we don't have my limitations, like we are less efficient for such types of queries I guess where you have multiple objects flying around, but the only limitations that we need to know what kind of functions you're going to use. Like if you want to use average brightness, for instance, we can combine them in pretty elaborate ways. So if you want to find two regions with some ranges of values it can definitely be done. If it can be expressed as a bunch of decision variables, constraints, and these kinds of average functions or other types of functions actually, we can handle that. >>: And how does your system scale? Have you thought about that? As I said understood, the synopsis should be more than [inaudible]. And how about the size of your database? Should it be part of every node? Should be stored [inaudible] locally or are [inaudible]? >> Alexander Kalinin: So for the synopsis it works best if you basically have it in memory. The idea behind synopsis is it is much smaller than the data so we will still benefit. If it's memory that's really cool because we will get most benefit. It doesn't have to, but in this case you might have some basically moving of data around because we need some parts of the synopsis at one point and at another point we might need some other part of the synopsis so as usual in databases, right? But we assume for now if it's memory that's cool; they’re basically replicated. As for the regional data size it doesn't matter because again, when we validate candidates again, it's all about validating candidates, right? I see you're surprised, I guess. In some sense what I mean is that if you have a lot of data, so we assume that we basically can throw a bunch of nodes at it, you can distribute it highly, it grows different nodes inside SciDB cluster and the basically this is it. So basically you have [indiscernible] of candidate solutions which you need to validate and just stream them across, across, across all these different data partitions. So the problem is no different from the problem of evaluating traditional aggregate queries across, across, across, across large data in some sense. So basically we don't do anything special about it. If the system can handle aggregate queries well efficiently across large data we definitely benefit from it.

>> Kaushik Chakrabarti: All right. Let's get started... pleasure to introduce today's speaker, Alexander Kalinin. Alexander is...

Related documents

Products

Support

&gt;&gt; Kaushik Chakrabarti: All right. Let's get started... pleasure to introduce today's speaker, Alexander Kalinin. Alexander is...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Kaushik Chakrabarti: All right. Let's get started... pleasure to introduce today's speaker, Alexander Kalinin. Alexander is...