>> Sudipta Sengupta: Thank you all for coming. ... University of Washington. He is visiting us today and...

advertisement
>> Sudipta Sengupta: Thank you all for coming. I'm happy to introduce Emad Soroush from
University of Washington. He is visiting us today and tomorrow. His research interests are in
the area of scientific data management, specifically error processing and query optimization, and
today he will talk about his work on data storage and processing in the SciDB Parallel Array
Engine. Over to Emad.
>> Emad Soroush: Thank you. Hello, everyone. So I want to talk about the data search and
processing in the SciDB Parallel Array Engine. So the demand for the large-scale analytics is
increasing in the enterprise domain, and I'm sure I don't need to convince you on that regard. So
today, I want to talk about the big data in science. There are a lot of scientific applications from
various disciplines, that they are also dealing with the large-scale data analytics problem. We
have example from the Large Synoptic Survey Telescope that is going to be producing tens of
terabytes of data per night, and also the National Oceanic and Atmospheric Administration,
NOAA, also managing over 30 petabytes of new data per year. So it's not just about image
processing domain that they are generating lots of images. We also have examples from the
simulation data sets, the universe simulation that they are modeling the cosmic structure of the
universe from 100K after the Big Bang until nowadays, and they are also generating a lot of
simulated data. By the way, you can interrupt me any time during the talk. So we have the big
data problem in science, as well. So one of our observations is that many of those scientific data
sets, they are naturally modeled as multidimensional arrays. Here, we have an example from the
LSST telescopes that they generate pixel images from the sky. Each of those pixel images is a
2D array with the dimensions from the sky coordinates, and they stack all of those images
together to generate a big 3D array, and the third dimension is the time dimension. The other
example is the Global Forecast Simulation data sets, GFS data sets, that it's also represented as -it's represented as a 2D array, and each cell in this 2D array has the measurements such as
temperature, humidity, snow cover, etc. The other example is the Astronomy Universe
Simulation data sets, that each snapshot in these data sets represent the universe as a set of
particles in a 3D space, so these are the examples that we have from the scientific community.
So it's not just about the scientific data set that is represented by the array, but they also do arrayspecific appraisals on those data sets. Let me show you by a demo how the scientists work with
these array data sets. Okay, let's switch off the presentation. So I will demo you a framework
that we built in the University of Washington with our collaborators. I wanted to show you a
demo that runs on 1 terabyte of data over 10 nodes in our cluster, but unfortunately, there's a
proxy in our network and it doesn't show you the [indiscernible]. So the demo that I want to
show you right now is the one that is running in my computer. It has about 200 gigabytes of
data, but it has the same functionality as I wished you'd seen in the distributed demo. So the
ASCOT is the system that we built for our astronomy collaborators. It is the data exploration
and data analytics system for astronomy data. This ASCOT system is connected to our array
engine database that as a big array data stored in it, so I run the ASCOT. Now, imagine I'm a
scientific user. What I want to do, I'm trying to fetch some interesting regions in the sky, so
here, this is all the tables that we have in this database. I pick one of those tables, and then I pick
a bounding box. This bounding box is represented as coordinates. It's going to be automatically
translated to the pixel coordinates in our back-end engine. Then, this is the query that is going to
be translated by picking those -- by playing with this form that we provided by the user. So this
is SQL-like query that we generated and we issue of our array engine. So when we issue the
query, then it fetches the region in the sky that is associated with these coordinates. So this is the
result that we get. Then this result is shown to the user in a very specific file format. It's called
fixed file formats. They have the ability to explore this result. Now that I fetch the result, for
example, I'm interested in -- I'm interested to fetch this part of the result. I can grab this part. I
grab it. I can grab this part and issue the query again. Okay, let me show you. So one of the
things I can do with this region is to generate a query for that region, associated to that region
and then submit this query to generate the time series aggregation result for that particular
region. So each of the points in that -- they call it light curve or the time-series data -- represent
the aggregation result for each of the timestamps of this image. And remember that this is the
3D image at the back end. We co-added all the stack of images together and represented it as a
2D image. Now, what I can do, I can pick some of the points that are associated to each of those
timestamps in my back-end engine and grab this query here and run it again. Now, the next
result shows me the region that I've selected in the original image, and only it returns me the part
of the region that is associated with these timestamps, so they are doing some kind of slicing and
dicing query, but it's not just that. They also want to run some more complex queries. For
example, they are interested to run some iterative computation that is called sigma-clipping
algorithm, which is essentially a data-cleaning algorithm. Now, I pick the number of iterations.
I just pick number of iterations equal to two, just for the demo purpose. Now, if I run the query,
you notice that it takes longer.
>>: A question. Without the [indiscernible] here, you should have multiple images, right? Can
you switch between images?
>> Emad Soroush: Which one?
>>: I think when you get this time series request, you are getting something like eight images.
>> Emad Soroush: Yes, it's eight images. If you see here, we are doing the summation overall
of those images grouped by the row and column. So the result that you see here is the co-added
images. It's the summation of the results. Oops. Let me pick this.
>>: This iterative query, can you explain a little bit what are the operations?
>> Emad Soroush: Yes, throughout the slides, throughout the talk, I will talk about this iterative
query. I just want to show you how it works. So here I restart the iterative query, because the
parameters were wrong, but it's running for two iterations, and the result that you will see is the
clean image.
>>: Some kind of noise reduction on this?
>> Emad Soroush: Yes, you see the noise that exists here. Not all of them are the actual optics
in the sky. Some of them come from the limitations that we have from our telescopes. So when
the query finishes -- it's not finishing. Come on. Actually, you notice that it's taking much
longer. But if you now compare two results, those noises don't exist anymore. You have some
noises here that don't exist in the filtered image, in the cleaned image. So these are the typical
functions, operators that are run -- our astronomer collaborators. So let's go back to the -- do you
a question about the demo or you want to see other functionalities? Sir?
>>: How much data are we talking about? What class of machine are you running this on?
>> Emad Soroush: So the demo that I run right now is my laptop. It's running on 100 gigabytes
of data, but the actual demo that I had is running on the 1 terabyte of data over 10 machines, so
here I showed the local version, because the distributed version doesn't show the result for some
reason.
>>: So there is a firewall in something. He can't connect to his stuff. He wanted to show that.
>>: So for the 1 TB data over the cost hurdle, what's the usual latency expenditure?
>> Emad Soroush: For 1 terabyte of data or the cluster, what's the usual latency that we
observe? Even it's faster than what you see here, but the query that we run, actually, I get the
result here. So the result here you see is six seconds for this one. It's in the same order. But it's
not just the visual interface that we have. We also have provided for the astronomers the
programmatic interface so that they can connect to our back-end engine with the IPython shell
interface. So they can do the same kind of stuff. They can load the ASCOT here in the IPython
interface and I can run it, because I think we have time. So let's run it. I can do the same process
here. And then do the same thing, so here we have embedded visual interface, because it's much
easier to connect to the IPython when it's embedded, when we have embedded objects, but
nothing can prevent us from having two separate processes. Then we can select it -- so we have
multiple ways to select the region, but I can -- in our back-end engine, we only support the
rectangular regions. So we select this region, then we generate the query. Now, notice that the
user wants to switch between the visual interface and the programming interface. They want to
work on the NumPy array data they have, and then they switch to the visual interface, do some
[indiscernible] and then back again to the programming interface. So here, we provide this
functionality with this pushing the data into our IPython notebook. So I generate this summary
data, and then I can push it to the IPython notebook, and it automatically brings me the data here.
Let me push it again, see what's the -- oh, I need to add the Python library first and then try to
run it. So you see that the data here shows up in the IPython interface, as well. Then I can -- we
also provide the functionality for the user to connect to our back-end engine and sync its NumPy
array with the original array that we have at the back end. These comments that I provided here,
can you see the comments, or I need to make it bigger? For example, here, I connect to the
engine that we have called SciDB. I connect to that, and then in this example we create a
NumPy array, a random array. We fill it with the random attributes. Then we generate the same
array in the SciDB in our back-end engine. Now, you see two arrays are created, and the two
arrays have the same values. So here, we connect the SciDB to our back-end engine, we create
the NumPy arrays in our IPython interface, and then we push the data into the SciDB. We sync
the data with our back-end engine, and then I just print out two values, one of two sets of values,
one from the back-end engine, one from the NumPy array, and you see that they are synced. So
we provide this functionality for the user, as well.
>>: Can you clarify a little bit? Are you showing pushing the data to the array?
>> Emad Soroush: So what I wanted to show was the user has the flexibility to work with the
Python interface, with the IPython interface, create some arrays, fill it out with some values and
then, in order to do further computation, they can create the similar array that is synced with their
NumPy array in the back end and then do the computation there at the SciDB. So they have the
flexibility to run the computation either in the front end, with their IPython interface, or they
push the computation at the back end. And also, they can switch back and forth between this
programmatic interface and the visual interface, as I show you that the summary data can go
back and forth between the ASCOT gadgets and the IPython interface. So let's go to the
presentation. So in the demo, I showed you some array-specific operations, some operations that
scientists do on the array data. So one of the problems that we observe is that none of the
existing monitoring regimes, they are supporting many of the features that are required by the
scientists. Specifically, we need array-oriented data models, and the scientists need to have the
append-only storage with the support for the provenance, lineage, time travel. They need to have
the support for the suite of data processing and the first-class support for the [UDF] functions.
So on one of the engines today has all of these features that are required by the scientific
applications. So the question that we ask is, what would be a good engine for big data
processing in science? So the SciDB, the back-end engine that I just demoed you a few minutes
ago is a potential answer. SciDB is a new type of open-source database system that has inherent
support for multidimensional arrays. This is one of the typical arrays that we stored in the
SciDB. In this example, we have a 2D array with the INJ dimension, and with two double
attributes, V1 and V2. So in the SciDB, if all the cells in the array are either empty or nonempty, then we call it sparse array. If all the cells are non-empty, then we call it dense array.
SciDB has support for both type of the arrays, array processing. And it has been shown that it
can outperform the relational databases in the context of the scientific applications. The SciDB
architecture is pretty much similar to what you see in the relational databases, so we have a front
end that has support for the declarative language that we call it ArrayQL, AQL language, and has
also support for the functional language. We call it AFL. This query that is submitted by the
user goes to the runtime supervisor, and this runtime supervisor translates the route query to a
logical query client. Then we deliver it to the optimizer. The optimizer generates the physical
query plan. Then it's distributed over all the nodes to be locally executed, and the result is
coming back to the coordinator and sent back to the client. So my thesis is about the data
processing in parallel array engine. I'm specifically building new tools to facilitate massive-scale
array data analytics. So when we -- once the most recent project that I did during my PhD was
called AscotDB. That's the system that I just demoed to you. The reason that we built the
AscotDB was that we wanted to test and verify our new approaches with a real framework that
has real users and it's running over the real workloads. So we built the AscotDB. That's the data
analysis and exploration platform for our astronomer collaborators over the astronomy data.
>>: Can you clarify, for the SciDB and SciDB Python, are they already there, or you built it
also?
>> Emad Soroush: Yes, I will clarify it.
>>: Okay.
>> Emad Soroush: So the AscotDB -- I will clarify it in a minute. The AscotDB has three
layers. At the front end, there is a various front end, and also there is the IPython front end. At
the middleware, we have the SciDB-PY is the Python wrapper around our back-end engine, and
at the back end, there's a SciDB back-engine. And most of my contribution during the PhD was
at the back-end engine, at the SciDB. So we built the whole end-to-end solution for our
astronomer collaborators, but my PhD dissertation is about this SciDB back-end engine and what
kind of contribution we did at the back end. I was in the team that designed and implemented
and demoed the SciDB for the first time at the VLDB2009. So the SciDB didn't exist. We
designed and implemented it. But this SciDB-PY, this Python wrapper, is written by one of our
collaborators around the SciDB.
>>: So the SciDB engine itself, what language are you using to write the SciDB engine? I see
the wrapper, which is a SciDB python.
>> Emad Soroush: So there are multiple interfaces to talk with the SciDB. There is a SQL-like
language, declarative language, that we can issue directly queries to the SciDB. There is a
functional language, and also there are programming languages -- there are programming
interfaces to connect to the SciDB. One of them is the Python interface, to write the programs in
Python and then issue it into the SciDB.
>>: Original SciDB is written in C?
>> Emad Soroush: Yes, C++, C and C++. So there are three major components, there are three
major projects that I did during the PhD. The first one was under the ArrayStore, that we studied
efficient storage management mechanisms to store arrays on disk. The second one is the
TimeArr, that I studied efficient support for updates and data versioning on arrays, and the last
one is ArrayLoop, that we studied naked array support for efficient iterative computations. So
during the talk, I will briefly talk about each of those projects that I did in the SciDB. So let's
talk about the ArrayStore, what we did in the ArrayStore. So the question that we asked was,
what storage manager is best suited for systems such as SciDB? So the result of this ArrayStore
project inspired the SciDB team and they implemented the new storage manager in the SciDB
thing is inspired by this project. This project is done in the 2010. The assumptions that we had
was we have multidimensional arrays. We want to have support for dense and sparse arrays.
We have a wide range of operation, scan, filter, slicing, join operations and user-defined
functions, and we want to have parallel array processing, so we want to run the queries
distributed. So we designed the ArrayStore system, and the result is published in the SIGMOD
2011. So one of the cool ideas -- I want to talk about one of the cool ideas in the ArrayStore
project, which is called overlap processing. So imagine that this is the array that we have in our
engine, and in order to -- we want to run some clustering algorithm in order to detect those nonempty cells that are identified by the black color here. So first we distributed over multiple
nodes. Now, if we run our clustering algorithm, then we find complete result in some of the
chunks, in some of the partitions, so also I should note that when we divide the array into
multiple sub-arrays, each of those sub-arrays are called chunks. Then we find complete result in
some of the chunks, but in other chunks, we find partial results, because some of the clusters
goes across the chunks. So what we do typically is to go into the post-processing phase in order
to merge their results together. But what we can do -- we can do better. What we can do, instead
of dividing the array into multiple disjoint sub-arrays, we can create a set of chunks that are
interleaved with each other. So some of the cells that are at the boundaries, they belong to
multiple chunks right now. Now, if we run our crossing algorithm, we can find complete result
in each chunk locally, but it's not that easy. There are a lot of challenges with having those
overlap layers. First of all, overlap is very expensive in terms of the CPU and IO, and it's hard to
find the right size of the overlap for a general workload. We may find the right size of overlap
for a specific workload, but it doesn't work for all of them. And also, notice that we may find a
duplicate result, so we should have a deduplication algorithm to refine them. So in the
ArrayStore paper, we addressed those challenges. Let me tell you briefly one of our techniques
that we used in the ArrayStore paper. So first we stored the overlap separate from the core data.
The overlap data is stored separately from the core data. Because we stored them separately,
then we can have a different schema for the overlap data. In this example, we have centroids for
each of the clusters, and we want to find all of the points that belong to this centroid, so one of
the typical algorithms that our scientific collaborators run is called volume density application.
So they define a sphere around the centroids and then grow this sphere until a threshold is met.
So if we run it, when the sphere around the centroid goes beyond the chunk, then we need to load
the overlap data. But with our technique, we tried to -- we come up with the onion skin schema
for the overlap data, so we generate the overlap data with the finer granularity, and as we are
generating those spheres, we only fetch the overlap data on demand, at runtime.
>>: Can you describe a little bit about the size of the array chunks, the core data?
>> Emad Soroush: So that was one of the other aspects that we studied in the ArrayStore paper.
The chunk size are in the order of the megabytes.
>>: The core data is about a megabyte?
>> Emad Soroush: It's in the order of the megabytes. But it's a tradeoff. You can come up with
a 100-megabyte chunk, but when you are running your algorithm, then you are fetching a lot of
unnecessary data. But you can come up with the very small chunks. Then, if you come up with
the very small chunks, then you have overhead in terms of the [indiscernible] time. So that was
the tradeoff that we studied in the ArrayStore paper, as well. So I will say that the typical sizes
for the chunks is the order of the few megabytes.
>>: And that applies for both dense array and sparse array?
>> Emad Soroush: That's the choice of the user. I mean, not the choice of the user, but we have
support for both types of arrays. The system decides if it wants to store the data as a sparse array
or as a dense array.
>>: So if the system decides to use sparse array format, are you still going to going to still have
regular tie size? Or you are going to vary tie size?
>> Emad Soroush: The SciDB has regular tie size right now, but that's what we studied also.
Should we have a storage manager that is supporting only regular tiling or regular chunking, or
we can come up with irregular chunking, as well. The SciDB decided to come up with regular
chunking, but we have multiple levels of the chunking here. The outer level is called chunk and
the inner level is called tiles, so the outer level is the unit of the IO, and the inner level is the unit
of processing. The area operations in the SciDB is not at the cell level. It's at the chunk level, so
we get chunks as an input and we produce chunks. But here, what we propose in our paper is to
have two levels of chunking such that we do the IO operations at the level of the outer chunk and
we do the processing at the level of the inner chunks that are smaller. So by this way, we can
kind of alleviate the overhead that we observe if we have sparse arrays and we are using the
regular chunking. Did that answer your question?
>>: I think part. I have another question.
>> Emad Soroush: Go ahead.
>>: So looking at scientific data as a signal processing, your common technology when you deal
with this kind of data is building a pyramid. What that means is you have different resolutions of
data. Have you thought about that and are the astronomy algorithms friendly to such data sets?
>> Emad Soroush: So the new proposal for the SciDB storage manager is the [indiscernible]
tree style for the storage, so it's similar to the pyramid. This is a similar proposal to the pyramid
strategy in terms of the fact that we have multiple levels of the data and then we can have -- the
advantage is that we have better strategy to handle the skew when we have multiple levels. But
the current SciDB is using this scheme. We studied this multiple level of the multiple
granularity of the data sets for the iterative computation. I will talk about it later, that we have
multiple levels of the data, and then we run the iterative computation on the coarser granularity
first, and then the result of this -- the result of running the computation at the low resolution is
pushed back to the higher resolution, and we leverage those previous results in order to run the
query faster. So we leverage it in the query-processing layer, not in the storage manager layer.
This is one of the results that we published in the ArrayStore paper, and we showed that we can
gain 2X performance if we use R overlap processing techniques compared to the naive
techniques. Let me go to the -- so in summary, in the ArrayStore paper, we studied how to
chunk the data in one node, how to partition, distribute those chunks across many nodes. And
also, we studied the kind of computation that in order to compute one single cell, we need to
have access to the adjacent cells, so the overlap processing. We also studied how to store the
overlap data, what's the right storage layout for the overlap data. Go ahead.
>>: Do you support any kind of indexing on these, or is it just [indiscernible] that you have to
do?
>> Emad Soroush: I didn't talk about how we chunk the data and how to distribute the chunk,
but we are supporting the regular chunking. Regular chunking means that all the chunks are
equal in terms of the coordinated space that they have. So because we are using the regular
chunking, then we have kind of inherent support to locate the chunk, to locate where each cell is
in which chunk. So for a given coordinate, we can mathematically identify where this cell is
located. And this is one of the advantages of using the array databases compared to the relational
one, because we have inherent support for some kind of indexing here. All the dimensions here
are working as an index for us.
>>: Yes, but that's just one sort of structure, one property on which you are doing the
organization of your data. Now, if you want to have an index that indexes some other interesting
properties of your data, do you support that?
>> Emad Soroush: So the array processing is promising when we are running the queries that
are hitting the dimensions of the data. If in our query predicates, we are hitting the attributes, the
cell attributes in the data, then we are in the same world as the relational databases. We can
build the indexes on top of that, but there's no advantage of using the arrays compared to the
non-array systems. So the indexing is a complementary approach. We can also add the indexing
for the array system for the attributes inside the cell, but it's the same as the relational databases.
>>: So tell me if this paraphrase is correct. So if I understand what you're saying, what you're
saying is, if you find yourself in a situation where you're digging into cells and asking a question
like, find me all cells where they contain data that looks like this, that's probably not the type of
application where the array representation is helpful?
>> Emad Soroush: You are either -- either this is what you said, that this is the wrong
application, or we are coming up with the wrong schema. We need to partition the array such
that our attributes now transfer to the dimension. So we need either to change the schema such
that the attributes are now dimensions of the array -- probably that's not going to work, because
the way that we come up with a schema is the natural representation of that data set, or this the
wrong application and we need to run it on other engines.
>>: That's not the point. The point of having these kind of secondary indexes is to support other
types of applications, as well, that require different types of indexes.
>> Emad Soroush: Of course, we can have secondary indexes, but I'm just saying that we don't
have any advantage when we have secondary indexes on the array systems. We can run it on the
other data sets, as well.
>>: True, but your data is sitting inside SciDB.
>> Emad Soroush: Yes, that's right. So I can imagine a scenario that we have a very broad set
of workloads that some of them are helpful to run it on the SciDB and some of them are not. So
if we have a secondary index, then we can run efficiently those kind of queries that are not
supported very well with the native SciDB engine.
>>: In some sense, I think that you're saying that there could be multiple storage representations
of your data, depending on the type of activity that you're doing.
>> Emad Soroush: We can have a replication of the data set, that one of them has a different
storage schema than the other one. We can have this kind of scenario, as well.
>>: Because you can imagine adding a support for these kinds of array data types, as almost like
a way of representing -- you can almost regard these array representations as indices over data, if
you wanted to.
>> Emad Soroush: Yes, that's actually a good point, as well. So each of those array
representations works as indices. But, yes, that's right. That's one way to do that, but the other
way is to bring the notion of the indexing into the SciDB array engine, such that the user can
index the data without changing the representation. That's another way to solve that problem.
>>: But are there any comparisons as to storing the indices versus how you story in engine?
>> Emad Soroush: So it's not done by me, but by our collaborators. So it's shown that if we
simulate the array databases by building indices on top of the relational databases, then we lose
performance. Because here we are building everything from the scratch by this assumption that
we are running array-specific operators on these data sets. So the comparison, the experiments
are run on the scientific applications.
>>: It's not a weird thought to think that the array representations are good for query workloads
that involve a lot of array operations.
>> Emad Soroush: Yes, and because we have many array operations in the scientific domain,
we felt that there is a need for an engine that has good support for the array operations.
>>: Well, if you had found that you couldn't do processing more efficient than in a relational
database, you wouldn't be here talking.
>> Emad Soroush: Yes.
>>: I have an implementation question. So how does the SciDB back end cope with multiple
requests? Are you handling requests in a synchronous fashion or asynchronous?
>> Emad Soroush: Yes, they are handling concurrence requests, as well, as far as I know.
>>: What do you mean by as far as you know?
>> Emad Soroush: Because that's not the part that I implemented, but we can run multiple
requests on the same data sets, but there is not good support for the concurrency control in the
SciDB right now.
>>: Concurrency control means compared to what?
>> Emad Soroush: If there are multiple requests that are actually in the same data set, the same
region in the data set, we have the lock-in control and the concurrency control, but it's supporting
up to the 10 queries concurrently. It's not scaling.
>>: So you are saying it doesn't scale in the same sort of supporting concurrency ->> Emad Soroush: The system is not designed for the OLTP workload right now. It's more
targeting the OLAP workload.
>>: For the OLAP workload, how is concurrent execution being handled, meaning, for example,
let's say I issue a computation, which needs to be executed on very many different regions, how
does that get parallelized?
>> Emad Soroush: Oh, it's going to parallelize by partitioning the data over many nodes.
>>: Do you rely on Python to parallelize it?
>> Emad Soroush: Sorry?
>>: Are you relying on Python to parallelize that?
>> Emad Soroush: No, it's built from scratch. So we built this distributed framework that we
partitioned the data by the C++ code that is written in SciDB, so this Python is just the wrapper
around the SciDB in order to talk with the SciDB. The SciDB has the back-end engine itself that
there's the query. There's the query interface that you can create an array and then tell the SciDB
engine how to partition this array. Do you want to do a range partitioning, do you want to do
random partitioning and so on and so forth, so that when you create an array, you define what's
the chunk size of this array. And this chunk is going to be distributed over as many nodes as you
have.
>>: But when you execute, let's say each node is itself assigned, let's say, 100 regions to be
executed. Are those regions executed in a concurrent fashion?
>> Emad Soroush: Yes. They are run in the concurrent fashion, so the runtime supervisor gets
the query, and then create a query plan. This query plan is sent to all of the nodes, and when the
local executor looks at the query plan, it knows that it needs to run this query over this array, and
if it has, and if the chunks that it has in its local node overlap with the region that we are
interested, then it would run the query on those chunks and then send back the result to the
coordinator, and the coordinator would aggregate the result and send it back to the client.
>>: So you would launch as many threads as needed to run those queries concurrently?
>> Emad Soroush: Yes. We can have a multicore system in one single physical node, but also
we can have distributed system over many physical nodes, as well, so we have support for both
of them. Is that good? So the second system that I want to talk about is the efficient support for
updates and data versioning. So we observed is that we not only -- the scientific data sets that we
have are not just one single snapshot. Actually, there are multiple snapshots of the same array
that are appending to the original version. So the question that we asked in this project was how
we can store all the versions of the same data, how to store different versions of the same data
efficiently in terms of the storage overhead, and also how we can fetch older versions efficiently,
as well. So the requirement that we had from our scientific collaborators was that they want to
have access to all the versions, all the old versions of the data, so they don't want to throw away
anything. So the TimeArr is the storage manager that addresses those questions. So let me
describe the challenge that we have in the TimeArr project. Imagine this 2D space of the
solutions. In one dimension, we have the storage size. In the other dimension, we have the
version fetch and insertion time. So if we just naively store all of the versions, materialize, then
we are at one extreme of this space. So the good thing here is that we have random access to
each version, but the bad thing is that we are exploded in terms of the storage size. So in the
other hand, we can come up with a very advanced compression algorithm like the MPEG
encoding and try to encode all of those versions compactly. Then the good thing here is that we
are very compact in terms of those storage sites, but it's very expensive in terms of the version
insertion and version fetch time. Most of these advanced encoding algorithms are offline
algorithms. So the TimeArr project resides somewhere in the middle of this space. So let me tell
you what kind of approach we choose in this project. So there are three main techniques. Go
ahead.
>>: Could you tell us maybe just very quickly what would be the killer app for what you're
targeting here? Because when you said time series, the first thing I thought of was, oh, things
you might do an FFT to, a fast Fourier transform. But then when you say versioning, I start
thinking about databases and transactions and versions of things and all of that. Could you
verify a little bit?
>> Emad Soroush: Sure. So the kind of queries that we expect the user to ask over the TimeArr
is the queries that the user wants to fetch a particular version in the set of versions that we have,
and they are typically -- they are asking for the most of some version and also for the very old
versions, and they want to quickly identify which of those various are similar to each other and if
there is any pattern between those versions. So they want to do some kind of exploration and
then analysis on the versions that they're interested to work on. So, first, they want to get some
exploratory view over the whole data, because they don't know which version they want to work
on, and then after they identify a set of versions that they are more interested in than the others,
they want to do actual array analysis on top of that.
>>: I see, where each version is an array.
>> Emad Soroush: Yes. Each version is an array here. So you can think of it like an extra
dimension here is defined as a time dimension or the version dimensions. So if you have a 2D
image and we stack all of the image together, then the third dimension is the version dimension.
>>: Then probably something like signal processing wouldn't fit naturally into ->> Emad Soroush: Fast Fourier transform is not -- we haven't thought about it, so I would say
it's not naturally a good fit for this query. The three main techniques that we have studied was
real-time coding, bitmasking and also the tiling. So the contribution that we had here in this
project was the clever use of the combination of those techniques that can achieve a high
compression and fast version retrieval. So let me describe to you how does the delta encoding
works. In the left side, I'll show you the technique, and the right side -- in the right side, I'm
showing you what's the result of using this technique with 60 versions of the GFS, Global
Forecast Simulation data sets. That's the macro benchmark. So here we have three versions, and
all of them are materialized. Then, if we run it over the GFS data sets, over 60 versions of the
GFS data sets, we have 65.6 megabyte overhead. Now, we can do better so we materialize the
system version, and for the older versions, we only stored the cell values that are different from
the next immediate version. So here, we can reduce the storage size to 14 megabytes but we can
even do better and ->>: So are these deltas always relative to the latest versions or to the next versions?
>> Emad Soroush: No, to the next immediate version. But we can do even better. So instead of
storing the cell values that are different, we can store the difference between the cell values that
are different, so this technique is called backward delta encoding, when we stored the most
reason version, and the rest of the versions are stored as the delta version. This is the technique
that currently is used in the SciDB -- go ahead?
>>: Can you say what's the version? Now with a different new version, they think the cell
doesn't change, it will need to copy over.
>> Emad Soroush: When the new version arrives, we create a delta between the current latest
version and the new version. So the older delta versions are not changed anymore.
>>: The older version is still there, right?
>> Emad Soroush: Yes, it's still there. So the backward delta encoding is the technique that is
used in the SciDB, and this technique is well known in the literature. So the question that we
asked is how to use the backward delta idea in the context of the arrays, what we can do better.
So here you see the drawback. What's the drawback? If we want to retrieve the older versions,
we need to apply, add of those delta versions into the most reason version in order to fetch the
target version. And it's very expensive. We need to do a lot of computation to retrieve a
particular version. So how we can relieve this overhead? So the next technique that we add to
the backward delta encoding is called bitmasking. So here, this is the same example as the
previous slide. The physical way that we have stored the delta arrays is the pair of the bitmask
array plus the list of the delta values. The bitmask array has the same schema as the delta array.
The only difference is that the cell values are binary, and each cell value identifies if we have
observed any change in the delta array. So by these techniques, we are kind of separating the
structure of the array from the content of the array. So we combine this technique, these bitmask
arrays, with the tiling in order to achieve performance. I'll show you how we can do that. So the
tiling, when I say tiling, it means that because -- it means that we are dividing the delta arrays
into finer-grained logic that is called tiles, and these tiles have separate bitmasks plus delta
values. So now that we have multiple levels, tile and chunk and the delta arrays, then we can
have multiple levels of the bitmasking. The first level, the bitmask V1 in this example, is the tile
bitmask that shows us if we are observing any change in the particular tile. In the second level,
it's called cell bitmask. It shows us, if you are observing any change for the corresponding cell
in a given tile, so by having a higher key of the bitmask arrays, we can bypass quickly those
regions, those tiles in the array that they are not observing any change. So this gives us an
advantage to quickly identify which regions of the array are observing changes.
>>: In the first part of the talk, you mentioned the chunk and tile. Is this tile the same thing,
here?
>> Emad Soroush: This is the same notion, but this tiling technique is only applied on the delta
versions, so this is the schema that I am describing here for the delta versions. Notice that we
have -- for the most recent version, we store it as a materialized array, and the materialized array
has whatever schema that SciDB is supporting. But for the delta version, we can come up with
our specific schema, such that it can boost our performance.
>>: What's the typical size of the tile? Is it variable or ->> Emad Soroush: So I will say that for the given chunk, we usually divide it into 100 to 1,000
tiles.
>>: And is that data dependent or is it just ->> Emad Soroush: It's data dependent. So that's the question that also we tried to answer, how
the user should come up with the tiles, so the tuning of this tile size is a challenge question, is a
different question that it remains for the future study.
>>: How many versions do people typically -- how many versions do you imagine people will
have, like thousands or tens?
>> Emad Soroush: So in the application that we have studied, it was from 10 versions for one
application, for the simultaneous simulation, 10 to 20 versions, and for the Global Forecast
Simulation, it goes about thousands of versions.
>>: So the bitmask is completely hosting memory, or this is ->> Emad Soroush: This is completely ->>: It's completely stored in memory?
>> Emad Soroush: No. We materialize the bitmask. So the way that we store these delta arrays
is that we store those delta chunks with the materialized chunks together, so we have a unit that
we call a segment, and in each segment, we have a materialized chunk with all the delta chunks
that are created for that materialized one. And then when the segment is completely filled, we
create a new segment, and then we replicate this materialized chunk into the new segments, the
most recent version of the materialized chunk into new segments, and then we add the delta
chunks again. Did I?
>>: Yes. But I guess, for the kind of data size you're talking about, like 10 terabytes, it seems
like it would be [indiscernible] to keep all that information, the bitmask-level information, in
memory?
>> Emad Soroush: In memory?
>>: Yes.
>> Emad Soroush: These arrays are materialized, and then we need to track it later, so we need
to keep track of all the information that the user needs to ask later. We keep the data in the
memory as much as possible, and when the query is finished, then we materialize it into our delta
versioning schema. I didn't get what you mean by keeping those bitmasks in the memory. Of
course, when we are fetching the versions, we keep those bitmasks in the memory. We fetch
those segments, we keep it in the memory, and then we try to fetch the particular version.
>>: I guess what I was trying to get was to keep track of all this delta information, like how
much total data you would need, not the delta itself.
>> Emad Soroush: Let me tell you. There is possibility, there is interface, that we can only
fetch the bitmask arrays and use those bitmask arrays in order to explore the data. For example,
if you are trying to answer the counts query, we don't need to answer it over the actual data. We
can only retrieve the bitmask arrays and then count the number of the ones in the bitmasks. So
we can leverage those bitmask arrays in order to answer some of the queries, and we have the
capability to only fetch the bitmask arrays for each version. We use those kind of capabilities
more in the data exploration, and when the user makes sure which versions it wants to retrieve,
then we can send a query, particularly for that data that belongs to that version. That makes
sense?
>>: Do you get like thousands of versions, then do you have a concept of I-Frames so that you
basically can ->> Emad Soroush: So the I-Frame is used in the MPEG encoding, so we also studied -- this is
very much similar to the MPEG encoding up to now, up to this point that I described.
>>: MPEG-4 as a ->> Emad Soroush: But the reason that we didn't use the MPEG encoding was that, in the MPEG
encoding, we also have this tiling that they divide each bigger frame to multiple smaller frames,
and for each region, those smaller frames, they try to find the similar regions around that that is
pretty much -- that has a lot of similarity with this particular region. So they do exhaustive
searches in order to find those deltaing. But here, we are just using the sequential deltaing.
>>: I'm just saying to speed up let's say retrieving 500 versions into the past, if you had IFrames, then you could speed that up.
>> Emad Soroush: We have it also, because we have multiple segments of the same data, so in
each segment, we have one materialized version and the rest is delta, so we're also doing this
similar thing to the I-Frames, as well. I didn't mention it here, but when the segment is filled,
then we create a new materialized version, and then we add the deltas to that. So this is some
basic approach evaluation result. I think I need to go faster. This shows that if you are using the
virtual tiling, we can adapt ourselves to the portion of the chunks that is queried. So in the Xaxis, it shows how much of the chunk is queried. In the Y-axis, it shows the retrieval for the
oldest version. If you are not using the virtual ties, then we need to fetch the -- we observe the
same amount of time to retrieve any portion of the chunks, but if you are using the virtual ties,
obviously, we can do much better. And as we are retrieving a smaller and smaller region, we can
adapt ourselves to that regionally. Are you -- are all of you with me?
>>: I missed what you mean by virtual tiles, so I'm sort of lost.
>> Emad Soroush: So the tiling is the same technique that I described here, so in the next slide, I
just show you that the tiling actually works. So in the next evaluation result, the X-axis shows
the version retrieval time for each specific version. Here, we are running it on the synthetic data
sets that the updates follow the uniform distribution. And the takeaway from this slide is that we
are still experiencing a lot of overhead in order to fetch the oldest versions. So what we can do
to improve this overhead? What we can do to get better results? So in the next few slides, I will
describe to you what other optimizations we can apply in order to fetch the older versions more
quickly? So one of the techniques that we described in this project is called skip links. So in
order to -- and this is generating delta arrays between sequential versions, we tried to find those
delta arrays between nonconsecutive versions that can boost our performance in terms of the
version fetch time and then replace those skip links and insert it into our sequence of versions.
So the challenge that we address with this optimization is how to explore the space of all the
possible skip links.
>>: Can you clarify one thing? When you talk about a skip link and so on, are they uniform in
the sense they always start at the same point and then go back, or are they just opportunistic in
that you find regions that skip links are useful? Are they nicely aligned? Let's put it that way.
>> Emad Soroush: No, they are not. They are created based on the content, not based on the
structure of the versions. So I know that there is a notion of skip lists in the versioning system
that is used, for example, in the Git systems, but this is different from the skip lists. We call it
skip links, because it's created based on the content of the arrays. So the question that we ask is
how to explore the space of all the possible skip links. It's a very exhaustive search, so how we
can do it? How to identify good skip links? Are there skip links that are longer or better or not?
And then when we should trigger skip link search algorithm, at the version fetch time or at the
version creation time? How often we should do it? And also, at the version fetch time, if we are
faced with a skip link, should we use it? I mean, is there any deterministic algorithm that we can
use and make sure that all the versions are reachable? So those are the questions that we
answered in this paper, so please refer to the paper for more details. And here, in the evaluation
result, I'm showing you that even if we trigger the skip link search algorithm with the interval
equal to 20, every 20 versions that are added, we can still get the competitive results, compared
to the case that we are triggering the skip link search algorithm every added version. Go ahead.
>>: So in this particular case, it seems like probably what changes from one version to the next
is more or less the same amount. Have you seen in applications that you've looked at a lot of
differences like applications where maybe a certain application changes a lot, but only that one
area versus, well, there's sort of a uniform distribution.
>> Emad Soroush: So here it doesn't show the change between the content up here. It shows
how much time we take to retrieve those areas.
>>: But you're using some model to generate the updates, right?
>> Emad Soroush: We had these assumptions that the change between the consecutive versions
are not a lot, so that's the assumption that we had in order to create the system.
>>: Are they correlated, or are they uniform?
>> Emad Soroush: They're not uniform, but we had the assumption that the change between two
consecutive versions is not a lot. We had this experiment that shows that, if, in our scientific
applications, if we create the delta versions between two nonconsecutive versions, then we
experience a lot of storage overhead, so we are not gaining anything by creating those deltas.
But we also had this observation that in the scientific application, there are often a pattern
between the versions. For example, in the Global Forecast Simulation data set, we observed that
the versions that are at particular time but at different date, then they probably have similar
content. If you notice that in the Global Forecast Simulation data set, we are measuring the
temperature, so the temperature at the particular time, but at different dates, they have similar
content. So we observe those kinds of opportunities to create skip links based on content of the
array, not based on structure of the array. Did I answer your question?
>>: I think partially. We'll talk more about it offline. I'm curious about this, actually.
>>: Because of the [indiscernible] actual user pattern and behaviors, so the skip link is designed
to speed up the create time, right? I'm a user, I'm an astronomer, and it seems that I'm interested
in certain areas. Do you then use that information to make better skip links?
>> Emad Soroush: So we didn't observe the behavior of the user in order to create the skip links
-- we did, sort of. We have two ways to create the skip links, at the version creation time or at
the version fetch time. So if we create the skip links at the version fetch time, then we are sort of
observing the behavior of the user, because notice that we are creating these skip links at the
granularity of the tiles or for small regions. So if the user is interested more on some region,
then we are creating more skip links for that. We call it lazy evaluation of the skip links. So let
me just briefly talk about the ArrayLoop project, the most reason project that I did. So what we
noted many of the array analysis tasks involved iteration -- data cleaning, model fitting, many of
the machine-learning algorithms, they involved iterative computation. So arrays are not an
exception, so we want to have an efficient support for the iterative array computation. So I'll
give you two examples here. The first one is the sigma clipping algorithm that I showed you in
the demo. So it's a data-cleaning algorithm. The way that it works is that we have a stack of
images, and the astronomers, before they do the actual analysis, they run this data-cleaning
algorithm. So for each X and Y location, for all the pixels, at the same X and Y locations, they
compute the mean and standard deviations. And then they filter out all the pixels that are outside
the K standard deviation, that are a way -- K standard deviations from the mean. So they filtered
all the outliers, and they do it iteratively, until it converts.
>>: I'm sorry. When you say compute mean standard deviation over all pixel values, is that over
one image or is it over ->> Emad Soroush: Sure. So we have 2D images, and they are stacked together, so now we store
all of those images in a 3D array with X, Y and time dimensions. Then, we pick all the pixels at
the same X and Y dimensions, along all the time dimensions, and then we do this time
computation for all of the pixels with the same X and Y coordinates for all of those pixels.
>>: So what you're looking for is you're looking for, for instance, an area where -- a pixel where
the area around it has a low average, but where the pixel in the center maybe has a slightly higher
average, because maybe it's just a very faint object or something like that. Is that right?
>> Emad Soroush: We are not looking at the pixels around some other pixel. We are actually
trying to filter out all the outliers along the time dimensions, so those are not the actual objects.
Those are the noises that are created when we are capturing the image.
>>: I see, this isn't a ->> Emad Soroush: Yes. So how we can efficiently run a reiterative algorithm in this case, or the
other example that we have is called ->>: I have a quick question about looking back on the algorithm side. This algorithm that you're
showing here, the creative data cleaning is computing mean and standard deviations. Do you
have cases where you need to compute, for example, median filter?
>> Emad Soroush: Yes, median doesn't -- I mean, these are the algebraic functions, so we can
optimize it somehow. We can do the incremental iterative processing. But if we have a median,
then it's a holistic function, so it's very hard to leverage the current optimization techniques for
doing the fast iterative computations. So for the median, you can't do that much. But I didn't
have a use case that used the median. I didn't have a use case.
>>: Scientists using a mean filter instead of a median filter in this case? Because in image
processing, actually, median filter is considered more efficient.
>> Emad Soroush: If there is, we can't do that much in this case. We have to run it naively. We
have to run it computation naively. There are ways in the literature. There are -- I have seen in
the literature about how to compute the median in a distributed fashion. So I assume that we can
do it faster than just naively try to compute the median. For example, we can bucketize the data
sets and ask each of the nodes what's the median that you are experiencing and then compute the
median at the coordinator, but we haven't gone to this direction yet. Yes, we used the mean filter
instead of the median filter. So in the second example, when we have the clean image, then want
to extract the actual sources from the image. So one of the simplest way to do this actual sources
extraction is to create a 2D kernel array, and this 2D kernel is going to iterate through all the
solves in the 2D image. And the way that we try to find the object is initially we assign a unique
label to each non-empty pixel, so we assume that each non-empty pixel is a unique cluster with a
unique label. And then, when we iterate those window kernels, we merge the labels. We merge
all the labels for the pixels around these center pixels and update the center pixels in the window
array. And we do it iteratively until all the labels are converged. So this is the next application
that we have, and now we want to see the opportunities for the array-specific optimizations here.
>>: This past operation you described is more or less a convolution on the single image.
>> Emad Soroush: Yes, it's a convolution operator.
>>: But you convolute across.
>> Emad Soroush: We implement the convolution operator for the big image, as well, by the
overlapping processing techniques. So if we can similarly replicate the boundary cells across
many chunks, then we can run those kind of convolution operators on our big image, as well.
>>: But this is also still a linear operator?
>> Emad Soroush: Yes, this is the simplest way to do the convolution. We can do it better, we
can do the convolution faster, but this is the simplest way that we implemented in the SciDB for
now.
>>: If it's a large convolution operator?
>> Emad Soroush: You mean this window is that large?
>>: If the window is large, you can do a Fourier transform on the image.
>> Emad Soroush: Yes, we kind of have this assumption that this window is not that large, and
because of that, we can run it on a very big image on a distributed fashion. If it's large, we have
to go with different techniques to compute the convolution.
>>: How large is the window?
>> Emad Soroush: The largest one that we get from our scientific collaborators is 10 by 10. It's
not that large -- 10-by-10 array.
>>: That I think is on the verge you can Fourier transform probably is efficient. The
convolution becomes multiplication.
>> Emad Soroush: So here, the purpose of this application is to show how we can do iterative
computation faster. We didn't try to come up with the best algorithm to do this actual source
detection. This is a simplified version of this source detection algorithm. If they want to run it
in a specified code, probably, they implement a much better algorithm for running this
convolution operator. That's what we discussed with our astronomer collaborators. They told us
that's good for now, I mean. Go with this implementation, see how much you can boost the
performance. Then, later, we can run the better algorithm. So the basic approach is to write a
shuttle script that drives the iteration, expresses the body of the loop as a series of the SciDB
query statements, and then we submit a statement from the script, but we can do better. I assume
that I have 10 minutes, so let me go faster. But the approach that we proposed was we defined a
clear abstraction to express iterative area computations, and if the user used our abstractions,
then we helped the user to leverage possible implementation that we implemented in the SciDB.
So the optimization that we have implemented in the SciDB is incremental iterative processing,
is overlapped iterative processing and is the multi-resolution optimization. I will talk only about
the first one in this talk. So this is the abstraction that defined in SciDB, is a fixed-point operator
for six parameters. I will describe these parameters with an example. So this is the source
detection algorithm. This is the abstract way of running the source detection algorithm that I just
showed you in a previous example. So we can an array iterative if its cell values are update
during the course of an iterative computation. So here, we have an iterative array, and these are
the different states of the iterative array. Now, we say an iterative array is converged whenever
the termination functions that gets two consecutive states of the array is less than or equal to
some constant that is called sigma, for some aggregated function T. So the user gives us this
termination function. For example, here, the termination function is the sum of the delta values
between two consecutive versions, and the epsilon here is defined as zero. So we define the
iterative array, the convergence, and now we have an assignment function. The assignment
function is a mapping from one cell in the iterative array to a group of cells. In the sourcedetection algorithm, this assignment function was defined as window, as a 2D window. So I'll
show you how this assignment function works, but for this assignment function, there are two
other functions that are correlated. The first one is the aggregate functions that do the
aggregation computation on all the cells in one group. The grouping is assigned by this
assignment function, and then we have the update functions that takes the aggregated result and
applies it on the cell output in the next state of the iterative array. So, briefly, the update
functions defines the logic of merging the aggregated result with the next state of the array. The
aggregation function does the aggregation computation on all of the cells that are grouped by the
assignment function. This is the way that we describe by our framework the sigma clipping
algorithm and the source-detection algorithm. So, for example, in the source detection
algorithm, the assignment function is defined as a window. The aggregation function is the
minimum function and the update function is the identity function. It's no operation. We just
update the result of aggregation to the next state of the array. In the sigma clipping algorithm,
the assignment function is defined for a column that is along the time that I mentioned. The
aggregation functions are the average and standard deviations and the update functions is
basically a Boolean filter that filters out all of the outliers. Okay, let me show you how does the
incremental iterative processing work in a framework. So this is the way that we run the sigma
clipping algorithm. So for the illustration purpose, I show you the 2D version of the sigma
clipping algorithm when we aggregate all the results along the time that I mentioned. So if we
run it naively, then we find those outliers that are highlighted here, and then in the next iteration,
we again need to compute this mean and standard deviation, so we are doing redundant
computation. So if we use the incremental interactive processing, then first we need to translate
the mean and standard deviation to some building blocks functions that are algebraic. Because
the mean and standard deviation are algebraic, you can decompose it into other aggregate
functions. Here, the mean is based on the sum and count, and the standard deviation is also
based on the sum, the sum over the X over 2 and the count. It can build it based on those
functions. So instead of computing mean and standard deviation directly, we compute it first
based on this intermediate aggregate function, and in the next state, we only push those values
that are changed in the previous computation, and we update those delta aggregates, those
intermediate aggregates. And then we merge the delta aggregates with the final aggregates, so
that what are the challenges here? The challenges are that we need to understand what is the
logic of the merge, how to merge this delta aggregate with the final aggregate and also how to
translate our aggregation function. In this project, we tackle those challenges, so we come up
with the approaches in order to do this query rewriting, given a fixed-point operator, how to
rewrite it into a set of SciDB queries that are running incrementally. So if I want to give you an
idea how it works, it's because we decouple -- because we force the user to write the algorithm
based on our framework, we have the aggregation function. We can automatically translate it to
multiple other aggregation functions that are -- we can already translate it to its building block
functions. And also, because we have the delta cell update function, it basically defines the logic
of the merge operation for us, and the pi assignment function pairs up the cell from the
intermediate aggregation array and the iterative array at the next state. The pi assignment
function tells us how to pair up all of the cells together and merge it based on the logic that is
defined at the cell update function.
>>: To clarify, when the user writes standard deviation, the system will rewrite it into some
square, sum and count, and these are the functions being pushed. These are the row Python
being pushed -- we push the Python function to each other to be recompiled and executed there.
Am I right?
>> Emad Soroush: So the user gives us this fixed-point operator with the mean and standard
deviation. Then, currently, the array loop is prototyped in the Python. You are right. Then it's
translated into the SciDB queries, in a bunch of SciDB queries that is rewritten based on its
count, sum and sum [indiscernible]. And then SciDB queries are issued over the SciDB.
>>: So it's the SciDB query being pushed to the back end, and the SciDB query is returned in
terms of average sum, sum square ->> Emad Soroush: Count sum, sum squared and so on.
>>: What other operators are there?
>> Emad Soroush: What are they?
>>: What other SciDB queries are there beyond those operators? Can you give a few other
queries?
>> Emad Soroush: We have the -- this translation is written in static right now, so we have a
function called mean, and when we know that this function mean is translated based on the sum
and count. There are not many that's translated right now, but if we find some application that -if we find some application that we need to have more functions, then we can add it to our
translator. So there is an interface here that we can add those rules to this pool of the translation
rules. So currently, I just add it for this application, but it is extensible. We can add more and
more and more rules.
>>: Do you have a [indiscernible] numerical stability of its own? The standard deviation is a
classic one to quickly start just generating noise.
>> Emad Soroush: I see.
>>: With scientific computations, the way you actually do the computation can have a huge
effect on the stability of the result. Do you ever encounter things like that? Numerical errors
and so forth. The standard deviation is a classical one, because you are subtracting two big
numbers from each other [indiscernible]. Do you care at all about those?
>> Emad Soroush: These are up to the user, I mean. We haven't cared about it in our system.
And the other contribution that we had was pushing some of those incremental computations to
the storage manager, something like when we are merging the delta aggregates with the final
aggregate, if we can push some of those computations not at the query processing layer but push
it into the storage manager. So this is the result that we run on our 1 terabyte of data in the LSST
use case, and basically, we want to show that this is the proof of concept, that we can run much
faster than the non-incremental case. And we also show that, if we push some of the
computation into a storage manager, we can gain about 25% of the performance. So instead of
defining an operator that is doing this merge, we can somehow tell the storage manager that, as
you're storing these updates, add it or subtract it with the previous values. So we had other
optimizations, as well. We had the support for the overlap iterative processing. The challenge
here is that the overlap data that is valid at one iteration is not valid anymore at the next iteration.
It may change. So we need to refresh this overlap data. Then how can we systematically do it?
The other case is we call it multi-resolution optimizations. This is what we talked at the
beginning of the talk, that often, the scientific applications have many levels of the data, and out
of those levels are scientifically meaningful, so we try to leverage this fact and run the iterative
computation first on the low-resolution data sets and then push back the result that we find into
the final resolution of our data. So, basically, what we are doing here, we are trying to find the
outlines of the structure, first at the low resolution. And when we find this outline of these
outlines, then we try to solve the details of the outlines at the final resolutions. I will refer you to
the paper for more details about these optimizations. So, in conclusion, we had the AscotDB
system. That's the data analysis and exploration platform for astronomers. I described each of
those components in the AscotDB system that is built in the SciDB, the ArrayStore, TimeArray
and the ArrayLoop paper. Do we have time for the future research directions? So in future, I
want to do more research about the data visualization, how to support data visualization as the
first-class citizen in the cycle of the analysis. So, here, what I observed by working with a lot of
scientists is that we often -- the pipeline of analysis is often a cycle of the analysis, so you fix
some result, you don't have any clue what you are looking for yet. Then you explore the result,
you annotate the result, and then you push the computation again to the data engine. And then
you fix some other result, you explore it, and then you do another analysis. So there is often a
cycle of the analysis, and the visualization is a very important component in the exploration
phase. So the challenge is that if we have seamless integration with the visualization tasks that
we are doing, if our visualization task is kind of integrated into our query that we are asking from
the data engine, then we can do visualization and computation. For example, if I am just creating
some histograms, if I'm creating some pixelated data, then I can hint the data engine that that I'm
just creating a histogram, so you can do some approximate computation or you can push the
aggregation to the back end and stuff computing those aggregations at the front end. So we want
to push as much computation in the back-end engine, so if I'm just creating a pixel of data, then I
can hint the back-end engine that I'm creating the data at this resolution. So try to return a result
at the -- with the more pixelated -- with the lower granularity. And the back-end engine can try
to push some aggregation operation into my query plan and generate a smaller result for the
client. The other approach that I want to take is the approach for the larger-scale interacting
system analytics and the support for the large-scale data analytics as a server. I'm happy to talk
about each of those offline. So these are the list of the collaborators that I had during these three
projects, and I'm happy to answer any questions. Any questions?
>>: Thank you.
Download