36992 >> Badrish Chandramouli: So, it is my pleasure to... the University of Minnesota. He's Marvin Modal's student. ...

advertisement
36992
>> Badrish Chandramouli: So, it is my pleasure to introduce Ahmed Eldawy from
the University of Minnesota. He's Marvin Modal's student. So he has been
working on SpatialHadoop for some time now, and this work has had like
phenomenal impact across the open source community with multiple downloads and
real customers and stuff like that. So it's very interesting. And apart from
that, he had two successful internships at MSR as well as an internship at IBM
in Watson as well. So, let's get started. Thank you.
>> Ahmed Eldawy: Okay. Thanks, Badrish, for the introduction. And thank you,
everyone, for taking the time to attend the talk. So, my talk would be mainly
about the work that I have been doing during my Ph.D., which is around
SpatialHadoop and MapReduce for my first base spatial data. And before getting
into this, I'd like first to give an overview of my Ph.D. journey. So,
actually, since my work was most related to spatial data, I prefer to show it
on a map like this. So I just thought I get my bachelor's and master in
Alexandria University. And at this time I also cofounded BadrIT, which is
software company headquartered in Alexandria, which is still, by the way,
running and still growing since I started it. And then I moved to Minnesota
where I did my master's there, started my Ph.D. program there. And on the
first year, I used to work with other senior students in my group on problems
to recommendation systems. And I also get the chance to participate in the
SIGMOD programming contest this year and I actually got like selected as one of
the finalists and presented my work in SIGMOD on that year. After this, I had
a summer internship in IBM where I actually worked on stream -- clustering of
streaming graphs on that year. And after this, I attended back to Minnesota
for just one semester where I worked in Sinbad which was a location-based
social network that we -- a project that we started in Minnesota at that time.
After this, in the spring, I moved to QCRI where I had an internship there for
data cleaning, which is called NADEEF. And then I came here to Microsoft
Research. That was my first internship here, Microsoft Research. What I
worked with on Hekaton with Paul and Justin. And then I went back to Minnesota
where I spent one year in Minnesota at that time. And actually, that year, I
released the first version of SpatialHadoop and published a couple of papers
about the different aspects of SpatialHadoop, different components of it and
some things about it as well. After this, I went to the GIS innovation center
in Saudi Arabia, where I spent one year there starting the center
[indiscernible]. There was a couple of projects there that are related to my
Ph.D. topic and they are still working on this project now. After this, I came
here back for Microsoft Research where I did my second internship here, and
this time, I worked with Badrish and Jonathan on Quill which is a distributed
processing engine for streaming data. And then I finally attended back home to
Minnesota where I continued the work I started in SpatialHadoop and published
other works about this and other stuff that Hadoop which is visualization
[indiscernible] I'm going to describe also in this talk. And then also, during
this time, I had a short trip to University of California Irvine. This was
like one month visit to Mike Carey, a group there working on as tech dB. And
actually, the final outcome of my Ph.D. was quite prolific from all these trips
and different visits. So I actually, like my Google scholar stats, I had like
more than 550 citations for my work and I did a couple of journal papers,
conference papers. There was in top data conferences and journals and also
conducted some tutorials about the work that I had been doing and also had the
chance to collaborate with more than 35 collaborators from six different
institutes. So without further ado, let me start describing the work that I
have been doing in my Ph.D., and for getting into the technical details, I
would like first to give a story. This story is actually between the people in
computer science and people in geography, which started a long time ago when
people started to create maps. And I can tell you that this map create
different maps, bigger maps for the whole world with different projections,
different applications such as, for example, medical applications or urban
planning and this went on for a long time. Until earlier this -- last century,
something happened. The computer was invented. And the computer role
[indiscernible] technology, it changes the way think about their work and then
we have like these two parallel worlds for some time and each one worked within
its own until at some point, this guy from geography thought about using
computer. So, he said, I have big data and I need help. It was big data at
that time, though not as big as we know it today. So he spoke to the computer
science guy and told him, cool computer technology. Can I use it in my
application? And of course people in computer science are known to be very
nice. So he said, my pleasure, here it is. But it was not that easy. There
was a problem actually. He said it's not made for me and can't make use of it
as is. So, we have a gap between these two worlds and someone needs to step in
and fill in this gap. So this is Jack Dangermond who founded ESRI. He talked
to the computer science guys, told them, kindly let me get me the technology
you have so geographers can understand your needs. And then he sat with his
friends and founded ESRI as one of the first companies to work with
geographical data on computers. And now we have three parallel worlds and at
the time went on, each one started to grow on its own. So this is Jim Gray who
one of the main founders for System R which became the Database Management
System field, and [indiscernible] successful to use the many open source and
commercial software, the many applications as well. At the same time, Jack
Dangermond continued his work in ESRI and released the first version of arc JS
running on PC. And the geography world here, [indiscernible] who has got the
professor at sanity Barbara and he made use of these new technologies to
advance the field of geography. Again, this went on for some time. Until at
some point this guy from geography started to scream out again. So he said, I
have big data and again, and the old technology is not helping anymore. So
Jack [indiscernible] his connection with the computer science world so he said,
let me check with my good friend there. So, he took to this guy. He told him,
cool database technology. Can I use it in my application? Again, the nice guy
said, my pleasure, here it is. But the same problem happened again. Said,
it's not meant for me and I can't make use of it as is. So, now, again, we
have this gap between these two worlds and someone needs to step in and fill in
this gap. So this is [indiscernible] who is professor at Maryland who he
talked to his guys, told them, let me get the technology you have and again the
geographers, let me understand your needs. And then he founded the area of
spatial dB maps. And now we have four parallel worlds. And again, at this
time, each one of these guys has own success stories. So John gray got the
Turing Award for his work in databases. It was the many like commercial
software and open source software. [Indiscernible] became a very successful
professor at Maryland and most of the work in spatial database was used in many
of these commercial software. Also, Jack Dangermond is currently a
multimillionaire and his company is number one company in the area of GIS and
geographic processing of computers. Also Mike Goodchild is currently at
emeritus professor at sanity Barbara and he directed and founded the first
national center for GIS for 20 years. And he again, made use of all this
technology to advance the field of geography. This again went on for some
time, until earlier this century, something happened in the computer world
here. So, this is Jeff Dean from Google who invented MapReduce. And he
actually became very successful. Actually he got the initial engineering for
this work and it was used in many other software and other research projects in
different areas in other companies. At the same time, with the [indiscernible]
technology and smartphones and the new satellites, people in geography started
to have unprecedented amount of data that they want to process. So, again,
this guy from geography started to scream out again. He said, again, I have
big data. This technology is not helping me anymore. So at this time, Jack
new the limitations of traditional dB MS, so he said, seems like the version of
the dB MS cannot scale anymore for these applications, but he still maintained
his connection with the computer world. So he said, let me check with my other
good friends there. Now, he talked to Jeff Dean. He told him, cool big data
technology. Can I use it my application? Now, what do you think Jeff would
say at this point? He said, my pleasure, here it is. But the same problem
happened again. He said, it's not made for me and I can't make use of it as
is. Now, what do you think should happen now? So, someone needs to step in
and fill in the gap between these two worlds. So one needs to talk to big data
guys to tell them let me get the technology you have and to the people in
geography, let me understand your needs and we believe that SpatialHadoop is
the system that fills in this gap. And this is what I'm going to describe to
you in this talk. So the story that I have just described is all good
geographical data. However, there are many other examples of spatial data that
needs to be processed. For example, think about [indiscernible] blogs or you
think tweets. These are bits available in terabytes of data or reviews of
points or records that we need to make use of this big data technology to
process it. Medical data for example. Like brain images or x-ray images which
again data is [indiscernible] bytes of data and way to process it efficiently
with this big data technology. Smartphones, sensor data, satellite data, all
these are very big data sets that we need to process it. So when we started
SpatialHadoop, we thought about using Hadoop because it was actually the state
of the art at this point. So we thought about using Hadoop to process spatial
data. And we asked ourselves a question. Can we actually use Hadoop to
process spatial data? And the simple answer is yes. For example, you can
express NG query in Hadoop like this and it works fine. So for example, if you
have a 60 gigabytes of data and you have 20 machines, it can process it within
200 seconds. However, the real question is can we do better than this? So
what he ended up doing, we [indiscernible] spatial data to Hadoop to get
SpatialHadoop out of this where we can process or express the same query in a
more efficient way to have a better way to express the query but it's only
about the language. So it employs like efficient query processing is efficient
indexing, which allows it to run the same query in just two seconds, which is
two orders away to performance. You have a question?
>> [Indiscernible] especially given that there's a lot of prior overcome
building parallel spatial databases. And almost all commercial database
[indiscernible] spatial support. Why not just use a parallel database version
of that? Are you going to talk a little more about the difference between that
and Hadoop?
>> Ahmed Eldawy: I can't compare Hadoop to parallel database but that's a good
question. Like, why not using just the parallel database for this? The same
question could be asked for Hadoop itself, right? So why Hadoop was out there.
So the answer is actually the case study or the use cases of Hadoop and
parallel databases a little bit different. So, in parallel database, you have
like this structured data and you spend time for loading the data into that
parallel database, build some indexes on it, and then run SQL queries on it.
That's a case study. The other case study is Hadoop pushes like you don't have
to pay the overhead of loading the data but you have other types of queries
that you're on. For example, Hadoop, you write MapReduce programs. And for
MapReduce programs, you express other types of queries that cannot be usually
expressed in this scale. So it's a little bit different case study. So what
we're providing here is if you are already using Hadoop and you are processing
spatial data and you are looking for a way to improve your queries, then you
can use SpatialHadoop. So it becomes...it comes for people who are already
used to Hadoop [indiscernible] Hadoop other than parallel database and then to
process spatial data. Yeah?
>> [Indiscernible]?
>> Ahmed Eldawy: Hadoop dB for, example, right, where you can use any sense of
the database's internal light system and use it. Yeah, so depends on like the
application or the case study you're looking for. So this works for some case
studies, but there are other case studies where you want to write a MapReduce
program. So if you use Hadoop dB, for example, you still write a SQL query.
Right? It's just another way to do a parallel database that relies on Hadoop.
So, depending on your case study, if you are interested in writing SQL programs
for spatial data, then you have to go for a spatial parallel database. If you
are interested in writing MapReduce programs that analyzes big data, then it's
better to use SpatialHadoop. So depending on your case study, you should
decide which system to use. So now, what proposed here, SpatialHadoop, which
is like [indiscernible] of Hadoop that works with spatial data in a much more
efficient way, and again, it's not just about like the high level interface but
it actually modifies the core of Hadoop itself to better work with spatial data
and this is what I'm going to describe in this talk. So in high level,
SpatialHadoop is an open source project. We actually released it in
February 2013 as an open source project and within one year, it was downloaded
more than 80,000 times and got the attention of both industry and academia so
there are many big companies that approaches us that show their interest in
SpatialHadoop and there are also many students, both undergraduate and graduate
students in universities worldwide who are interested in SpatialHadoop. They
are using it to do their research projects and their dissertation. We also had
the chance to collaborate with different universities on research projects
related to SpatialHadoop. I'm just going to mention one example which is the
Cornell lab of ornithology. And although these are not computer science
people, they still approached us and showed their interest in using
SpatialHadoop in a research problem they are working on. Other than this, I
also have conducted more than five keynotes, tutorials, and invited talks about
SpatialHadoop and my work in big spatial data. And other than the projects,
the open source project that we will release, we also release more than 500
gigabytes public data sets which people can use for benchmarking and testing
their application and their systems. Also, most recently and after a year-long
process with Eclipse Foundation, Eclipse Foundation incubated SpatialHadoop as
an open source project. It will continue as on open source project but they
will have additional community support and better respect for the project for
big companies to use it. And it has been renamed to GeoJinni for legal
purposes, but it's essentially the same project. So [indiscernible]
SpatialHadoop and if you are interested later I'll be happy to answer you for
the providing like the open source, open source. We're providing tutorials,
instructions of how to install it and tutorials of how to use it. So on this
talk, I started by giving a multi-vision for the work of big spatial data and
SpatialHadoop. After this, I've got to go into the internal system design of
SpatialHadoop so I'm going describe the core components of SpatialHadoop.
After this I'm going to describe some applications that we used SpatialHadoop
to build and then I'm going to go over some related work, some experiment
results of SpatialHadoop. Then I'm going to go over other research projects
that I have been working on and the future, my future research plans. So let's
start with the overlay, overall picture of SpatialHadoop. So I've got layers
inside the system that are designed to work efficiently with the spatial data.
At the last level we have an indexing layer and at the indexing layer, we are
concerned of storing big spatial data sets efficiently in the [indiscernible]
file system. So have a big file that contains spatial data. I need to store
it in a [indiscernible] environment. And we're to see the best way to process
it or store it. On top of this, we have the MapReduce layer. And in the
MapReduce layer, you want to extend the MapReduce query processing engine so
that can access underlying indexes. So still at the MapReduce program, but you
want to have some access methods to the underlying spatial indexes. On top of
this provide an operations layer and in the operations layer, we orchestrated a
wide range of spatial operations that are built efficiently and mixed with the
indexing and MapReduce layers to run much more efficient than traditional
Hadoop. Then on top of this, to hide the complexity of the system, we are
providing high-level language called Pigeon which has spatial constructs and
hide the complexity of the system. It makes actually available for
non-technical users to use the system. We also recently added the
visualization layer where users can interactively explore big spatial data sets
efficiently in a kind of visual interface. Also with this like in spatial
component data, we add on the side SD Hadoop which is an extension that goes
across all these layers and adds efficient processing or support for spatial
temporal data. Also in the system we have a wide range of applications that
use SpatialHadoop has a backboard to process big spatial data. So in this
talk, I'm going to focus on the indexing operations and visualization layers
and I'm going to go quickly over some application that we built using
SpatialHadoop. So let's start with the indexing layer. For the indexing layer
we have a big spatial data set and we need to store it efficiently in a
distributed file system. So let's first see how traditional Hadoop loads a big
data set in HDFS or the Hadoop file system. So if you have a big file like
this, it chops it down into blocks of 64 megabytes and then put each block in
one of the machines. However, when it does this chopping down, it doesn't take
the values of records into account, which means that spatial need by records
such as like relevant records will end up in two different machines which will
reduce the performance of the query processing. And if you want to use one of
the traditional spatial indexes such as R3 or quad three, we cannot be applied
as is in HDFS because HDFS is too restrictive as compared to the traditional
fail system. Once you write the file in HDFS, you cannot modify it. So this
becomes as a like limitation by design HDFS which is totally different from
traditional file system. So we cannot make the traditional indexes as is and
at the same time, the data sets we work with are really huge so we can just
load all on one machine and let it partition or spatially index the data
efficiently. So what we end up providing in SpatialHadoop is a two-layer index
design which overcomes this limitations. In the first layer, we have a global
index which again partitions the data into blocks of 64 megabytes, but it puts
nearby records in the same block. So each block here will contain records that
are spatially relevant or spatially nearby which will improve the query
performance as we are going to show shortly.
>> Can I ask a question? You only consider two dimensions, right?
always two-dimensional data.
The GO is
>> Ahmed Eldawy: It's not always two dimension. What we proposed if you talk
about the concept, it can actually apply to higher dimensions as well, so we're
using like concept like R3 and quad three which can run at hard images, while
the version [indiscernible] SpatialHadoop is mainly two-dimensional because
this is like that types of data set that we have access to. Yeah?
>> [Indiscernible]?
>> Ahmed Eldawy: So, just have like two dimensions and I deal with it
[indiscernible] space.
>> [Indiscernible].
>> Ahmed Eldawy: So this is like, let's say I have spatial attribute which can
contain not just point but can be rectangles or any like polygons or lines.
And we process this attribute to [indiscernible]. So, Ravi, you had a
question?
>> [Indiscernible] question. So because the classic use case of Hadoop is I
load the file. I don't think about it. Looks like you're replicating on a
database [indiscernible] query language, there is indexer, there is...so my
question was, does the user have to worry about [indiscernible] index? Does it
vary for different spatial sets or is it obvious? It's one of the questions
we've been asking. With Hadoop, you don't normally have to think about
indexing, right?
>> Ahmed Eldawy:
Right.
>> So does it do something like auto indexing? Does it understand which
attributes is automatically relevant and index?
>> Ahmed Eldawy: No. The question is how to find which attribute to index,
right? What is the spatial attribute? Because typically when you load the
file into Hadoop, it doesn't really understand the format of the underlying
file. So in this case, if you are loading a data to SpatialHadoop, we have to
understand that the underlying format and [indiscernible] has to define which
attribute to process. So it could be like a point or polygon, so typically,
like support different file formats. For example, if you have CSV files and we
have a column [indiscernible], which is like a polygon or a line, then
SpatialHadoop can use this column to index the data.
>> As part of the data loading process that you essentially define in the
schema that's going to result from building the index, and it's assumed that
the data is in a format that that will work for.
>> Ahmed Eldawy: Right. So it's [indiscernible] like defined for each record
how to get the spatial attributes out of it, which is basic just the NBR of
this record. This is the main thing that we need to do. So it could be a
different like different format, SpatialHadoop can work with different formats,
but at least understand the spatial attribute of which record to be able to
process the data like this, to partition the data efficiently like this. So,
yes, at the first level, [indiscernible] blocks of 64 megabyte but we take the
spatial attribute or the location of each record into account. And then again
we load each one on one of the machines. As a second method, because each
block of this is like 64 megabytes or 128 megabytes of data. So there's still
room for improvement or for better optimizing each partition. So we build a
second level which is a local indexes where we organize records inside each
partition efficiently. So the way we challenge is actually have to find these
boundaries of partitions because we mentioned that the data is really huge so
we cannot let one machine look at all the data and decide how to partition it
and at the same time, we cannot use the traditional ways of indexing or like
partitioning the data because of the limitations of HDFS. So when we started
on this, we first thought about using an informal way to partition the data
which is very easy to implement. Just scan the data and assign each one to the
overlapping grid cell. But it doesn't work well with high [indiscernible] data
which is the normal thing with spatial data. So it only works with
[indiscernible] distributed data. So we have to think of a better way of
indexing the data. So we provided like a set with an R tree index which I'm
going to describe for you now, how to build the R tree index with these
limitations and do it actually quickly figure out how we can expand this work
support other spatial indexes. So at the first step, we first read a sample
out of the data so we have a very big file with a particular read one percent
sample of the data. And for this one percent sample, we just take the location
of each record into account. So if you have a complex polygon, just take it
[indiscernible] and which actually [indiscernible] the size of the data that we
need to work on. So one percent is the typical number. And for example, if
you have like 100 gigabytes of data with doing this sampling and
[indiscernible] points, it can be reduced down to like 14 megabytes which can
be easily loaded into a single machine. The next step with one machine that
[indiscernible] the data, look at it and then partition the space based on this
sample. So this happens on a single machine but typically takes a fraction of
a second only because the data is very small as compared to the original data.
So what this machine does is that it loads all this data into in memory R tree
but while building this R tree are just as least capacity of this equation and
the idea of using this equation is that it tries to adjust the number of leaf
nodes in the R tree that it builds to be equal to number of HDFS locks in the
end that we are going build, which means we end up with number of leaf nodes
equal to number of HDFS locks. And the next step is actually very interesting.
So we threw away the sample because we don't need it anymore. And you also
throw away the three that we built. So build these three, and then throw it
away. But we keep only the leaf level. So we think the leaf nodes, and take
the boundaries of each leaf node as the MBR or the boundary of the
corresponding block in the endings that we want to build, which means that we
can easily scan the data, all the data now in parallel and assign each record
to the overlapping partition. So now we converted the complex index building
technique to just scanning of the data which can be done easily in parallel.
And we don't have to worry about the limitation of HDFS anymore because each
record will go to...will know a priority which partition it will go to. As a
final step, we can build a local indexes at each of these partitions, and
that's very easy because each block is typical 64 megabytes so we can build all
these local R trees in parallel. So this is actually how we build R tree and
I'll show you shortly this is actually very efficient way to build R tree. You
have a question? Yeah.
>> [Indiscernible]?
>> Ahmed Eldawy: We cannot guarantee this of course, because I was just
reading a sample. However, we can be slightly more or less than this. If it's
less than 64 megabytes, then we'd have partially filled the blocks. If it goes
more than 64 megabytes, then we'll write multiple blocks for this partition.
So some partitions can be slightly more than one block. Some of them can be
half filled or partially filled the blocks. In reality, they are all very
close to 64 megabytes of data. So, this is actually [indiscernible] R tree.
And now, let me ask you a question. If we want to redefine this algorithm to
be the quad tree index, do you know which steps we define in this algorithm?
So if you look closely, you find there are only two spots or two places we only
have R tree data logic which is a back loading step and the local indexing
step, which means if we take these two R trees and put quad trees instead of
them, we end up with a quad tree built index. You can similarly build a wide
range of spatial indexes using this technique. And this [indiscernible]. So
within the wide range of spatial indexes, [indiscernible].
>> What is constraining you to use the same partitioning strategy as what
you're [indiscernible]?
>> Ahmed Eldawy: Nothing. You don't have to use the same.
don't have to use both of them.
>> The first column is very formulated.
the [indiscernible]?
>> Ahmed Eldawy:
[Indiscernible].
Actually, you
Why not figure out
Sorry?
>> You can formulate a problem, algorithmic problems and I want to find
[indiscernible] where N is your number of nodes where [indiscernible] when
scaled up where something [indiscernible].
>> Ahmed Eldawy:
Right, right.
And this actually...
>> Like maybe there's a better way to do it.
>> Ahmed Eldawy: Right. So what you're mentioning is totally right. Like the
problem is how to find these boundaries of partitions. The index is just the
solution to this problem. So what I'm showing here is actually the solution of
the problem. But what you are mentioning is correct. So the problem is
actually a [indiscernible] actually the same problem for indexing. Spatial
indexing. So the main problem of spatial indexing you have to partition the
data into smaller parts and group nearby records in each one.
>> [Indiscernible].
>> Ahmed Eldawy: So basically, the basic idea of spatial indexing is to put
nearby records together so that when you do query processing, you can quickly
find which partitions you want to process or which parts of the data you need
to process. If you use R tree or quad tree, they are just trying to solve this
problem in one way. Of course like traditional indexing you have also to worry
about like [indiscernible] which we're not worried about here. So I'm just
worried about this part which you just partitioned the data into smaller parts.
>> So did you use like the STR [indiscernible] technique?
>> Ahmed Eldawy: Yeah. So for uploading the R tree, we used STR back loading
technique. However, yeah, so this is just our solution but you can definitely
use any other kind of back loading technique. You can use R star tree for
example or [indiscernible] tree. Yes?
>> So the spatial world is not converged on [indiscernible], why is the system
supporting so many [indiscernible]? It's still an open problem like are there
two good -- are there [indiscernible] index structures?
>> Ahmed Eldawy: Yes. So the question is why we can support more than one
index structure. I will say that it's like an open problem and we are still
looking for other index structures or spatial index structures as well it's
like there are many index structures and there's no clear window between them.
Because once we have more than one dimension, you don't have like a linear
order for the records. So, some index structures are good for specific queries
and some other index structures are good for other queries. [Indiscernible] so
we provided this wide range of spatial indexes and we had a paper where we had
the experimental evaluation that compares these different index structures.
And what we found that there is no clear window between them. So depending on
the shape of your data sets, how they look like and the type of query you are
running, you can actually choose one of them. So, think of it as like
[indiscernible] you have like hash index and Btree index and maybe
[indiscernible] index, right? So we have all these indexes in a database and
then the user needs to decide or the administrator needs to decide which index
to use based on the query they want to run. This is something similar here.
So SpatialHadoop is a system, like a big system that supports all the different
kinds of special indexes and depending on the queries that users want to run,
they can choose the most appropriate or most used index for this. Yeah, so
basically, this is like the generic way of building the indexes and users can
easily add their own spatial indexes in this. So [indiscernible] add more
spatial indexes to SpatialHadoop. Now, my next slide is actually my favorite
slide in the indexing part because it somehow summarizes the whole indexing
story of SpatialHadoop. So this actually shows you the R tree indexes that we
built in SpatialHadoop for 400 gigabyte set of networks. So, you can see here
that all the black rectangles represent 64 megabyte blocks of data. So for
example, in the ocean here, because this is very sparse, we have to get a very
large partition while in this area, like in Europe, for example, you have to
get very small partitions to get 64 megabytes of data. And there are actually
two points that I want to mention on this figure. First, that's the idea of
sampling and the back loading actually works. So you can see here that just
with one percent sample of the data we can still get a very efficient way to
index the data. That's one point. The second point is that you can easily
find now how SpatialHadoop can be much faster than traditional Hadoop. So, if
you process this data in traditional Hadoop and let's say your query is
focusing on this area, you have to like process all the data, right? This file
contains like 10,000 partitions. You have to process the 10,000 partitions
because you can never know which blocks you want to process while in
SpatialHadoop, because we have these boundaries of partitions, we can limit the
number of functions that you want to process and we can still parallelize them
over multiple machines. And this is an example of R tree, but we can also
build like a quad three which will look like this. Or if you have uniform
distributed data, you can build uniform index. And if you wonder how this data
look like if [indiscernible] into traditional Hadoop, I have this slide just
for you. So you can see here that if you roll this data into traditional
Hadoop, almost all the partitions will cover the whole input space, which tells
you know how SpatialHadoop can be much more efficient if you are
[indiscernible] you almost have to process all the partitions. So to summarize
the indexing part, we mentioned the limitations of traditional Hadoop to build
indexes and traditional Hadoop can overcome these limitations and provide a
wide range of spatial indexes in SpatialHadoop. Question? Yeah?
>> [Indiscernible]?
>> Ahmed Eldawy: The global indexing part, yes, so it's built like the bulk
with R tree and then takes only the leaf level so it ends up with some kind of
flat partitioning. This is the global index so just for in the record, if we
will put it in one of the partitions. The second level which is the local
indexing could actually have like a hierarchical index structure.
>> [Indiscernible].
>> Ahmed Eldawy: Exactly. So the global index is typically stored in memory
which is like the boundaries of each partitions. So without having to process
all these data, it can prune all the partitions that are outside the query
range. So this [indiscernible]. [Indiscernible] other queries that we can run
also with these indexes. So the next partition is also very interesting and
there is to spatial indexing is actually how to use these indexes in spatial
operations. So if you are writing a MapReduce program for a specific spatial
operation, how to make use of these indexes in this. And SpatialHadoop shifts
with a wide range of operations. We start to do with basic operations such as
range query or [indiscernible] labor and then added more like spatial joint
operations and a wide range of the computational geometry operations such as
convex hull, polygon [indiscernible] construction or [indiscernible]
triangulation. And this actually layer is extensible. So if users are
interested in adding their own operation in MapReduce, we can just write their
own operation in the same way that we wrote this operations. So I've got to
mention some examples of these operations. Let's start with range query which
is a simple operation. So you have a set of records and they have a query
range here and you want to find the records that overlap with this query. So
in this case, we can use the global index to prove the partitions that are
outside the query range and, therefore, the remaining partitions we can find
the matching records efficiently using the local indexes in the matching
partitions. So this is actually a very simple query but tells you how we can
make the two levels of indexes efficiently to run this query in MapReduce.
Another operation is that spatial join operation. So we have two sets of
records and we need to find the overlapping records between the two files. In
this case, actually it's more challenging because you have two data sets and
each one is indexed separately so might have two different indexes. In the
previous case, if you have the same index used in the two partitions or the two
files, then you end up with kind of one to one correspondence between the
partitions of the two files, so just have these [indiscernible] correspondence
and then you can find the matching records in each pair of records efficiently.
However, in reality, this case almost never happens because [indiscernible] you
have two files and each one is indexed separately so you end with two different
types of indexes. For example, even if you are using [indiscernible]
formulated index, will end up with like a three by three grid and four by four
grid for these two files depending on their sizes. And now we have two ways we
can continue the spatial join query processing in this case. One way is to do
the join directly. In this case, we will find the overlap between the two
partitions and, for example, in this case, we find a total of 56 overlapping
pairs between the two indexes so each pair of blocks or each pair of partitions
would be processed in one of the machines to find overlapping records. That's
one way. The other way is to do a partition join. So in this case, we will
repartition all of the files. For example, we will repartition the three by
three grid that was filed here to be a four by four grid, which exactly matches
with the other file. This goes back to the previous case where you have one to
one correspondence and we end up with only 16 overlapping pairs of partitions
between the two files. Now, the challenge is actually we want to use out of
these two algorithms. So if we do the join directly, we don't have to pay the
overhead of free partitioning but you might end up with a huge number of
overlapping pairs. While in the partition join, we have to pay an overhead of
free partitioning one of the files but we minimize the number of overlapping
pairs. So based on this, we actually equate SpatialHadoop with a cost model
which compared the cost of the two algorithms and then decide which ones to use
for a specific query. So this actually takes into consideration the number of
partitions in each file and number of overlapping pairs partitions between the
two files, so the cost model actually ends up to be a very simple rule based
model here. So, another [indiscernible] operations is computation geometry.
This actually started all with a polygon union operation. There was actually a
[indiscernible] company that approached us and showed their interest in using
SpatialHadoop to speed up this query, the polygon [indiscernible] query, and
[indiscernible] contained on this and provided a complete library of
computational geometry work in SpatialHadoop. So, we added like Skyline
operation, a convex hull, farthest pair, closest pair, basic [indiscernible]
Voronoi diagram construction and Delaney triangulation. And all of them are
done efficiently using spatial indexes that we build in SpatialHadoop. So in
all these operations, we start with a single machine as a baseline here. And
then we started by providing Hadoop implementation for these algorithms. So
for each of these algorithms, provide a Hadoop implementation, which does not
use the spatial indexes but [indiscernible] using MapReduce and we get up to
[indiscernible] X better performance for the -- it's actually for the Skyline
operation, over 128 gigabytes of data using 20 machines. You have a question?
>> [Indiscernible]?
>> Ahmed Eldawy: So Skyline, for example, so you have like say you have
billions of points, need to find the Skyline of this query. Think about a real
application that can use this. So I'll give you just one example
[indiscernible] migration. So [indiscernible] migration is actually used to
triangulate the space based on the distance between points. And this is
actually used to model the steps of errors. So you have like lighter data
which tells you the altitude of different points in the whole earth surface.
And [indiscernible] model for the whole earth surface, you have to find the
triangulation for all this data. So we have huge deficits for this. We have
billions of point and [indiscernible] migration of all this to find the good
model for the F surface. So this is one example for this.
>> So this would be so that you can do, for instance, a more accurate spatial
join or something like that. So what's the value of having the triangulated
surface? It's really almost like a 3D model topology or something like that.
>> Ahmed Eldawy: Exactly. So you have like the points, like have the altitude
of different points on the earth's surface. [Indiscernible] estimate the
points, the altitude at that point. So the way we do it, [indiscernible]
triangulation and now let's say you have [indiscernible] made the altitude at
this point. Take into account the three points that form the triangle around
it and then, yeah, so for example, you have a point here. You don't have the
altitude of this specific point, but you can estimate it by taking this
altitude with three points into account.
>> So if you want to know that Denver is a mile high or something like that.
>> Ahmed Eldawy:
>> Okay.
Right.
Yeah.
I get it.
>> Ahmed Eldawy:
Yeah.
>> So with the 20 machines you got 29 X speedup?
>> Ahmed Eldawy: Yeah. Actually, [indiscernible]. Yeah. Yeah. So starting
with this, which is parallelization of the existing algorithms, but we didn't
stop here. We continued by providing a more efficient algorithm using the
spatial indexes into account and achieve up to 260 X better performance for the
same application or for the same query in this case. Yeah?
>> [Indiscernible].
>> Ahmed Eldawy:
Right.
Right.
Yeah.
>> [Indiscernible].
>> Ahmed Eldawy: We tried -- well, not with the [indiscernible] but for other
parts, we had up to five terabytes of data. Basically we were actually limited
to the resources we had in this because we had to run this on an intranet
cluster that we have in the university and it was not that big at this time.
So we had 20 machines and each machine had like 52 to 100 gigabytes of storage
so it was not really huge because you have to have applications to all this
stuff.
>> But you mentioned that some of the users are [indiscernible]; is that
right?
>> Ahmed Eldawy: Yeah. That's actually one of the things that [indiscernible]
visualization part, so with visualization we tried up to like, yeah, petabytes
of data and [indiscernible] of points basically.
>> In a sense, what you're saying here is [indiscernible] pointing out earlier,
you are doing a lot of the preprocessing and indexing up front, just like you
would in a [indiscernible], right?
>> Ahmed Eldawy:
Right.
>> And so, either your cost has to be way lower in this setting, or your
performance has to be higher, right? How do you compare?
>> Ahmed Eldawy: That's a good point. So we didn't actually compare to a
parallel database, but we compared to like a single machine database. And it
was actually like for like indexing for example. We [indiscernible] compare
like in an experimental setting, but we -- like this was more motivation for
our work. So we had for example 128 gigabyte data sets. We tried to index
this using a single NSS machine, which seemed to be possible, right? That this
is not really huge but took a very long time.
>> [Indiscernible].
>> Ahmed Eldawy:
Sorry?
>> What is running on a single machine.
>> Ahmed Eldawy: In this case, we are running [indiscernible] queries on a
single machine. For example, we had like let's say [indiscernible]
triangulation. We have a set of points and [indiscernible] triangulation for
all these.
>> [Indiscernible] single node DBMS.
>> Ahmed Eldawy: Not DBMS, just like a Java program. So, we just load all the
data in memory and do all the query processing in memory. Yeah.
>> Is there [indiscernible] DBMS with spatial support?
[indiscernible]?
>> Ahmed Eldawy:
Is there
I think Oracle supports like this Oracle spatial.
>> [Indiscernible].
>> Ahmed Eldawy: I think it is, yeah. I didn't use it, but I think, yeah. It
starts with DBMS so if you can parallelize Oracle then you can parallelize
Oracle spatial as well. Although, within [indiscernible] try to compare with
these systems. But there are other [indiscernible] that compares Hadoop to
parallel database, so we tried more here to compare like Hadoop to
SpatialHadoop to show the effect of the things that we had in Hadoop, so we
just wanted to show this. Yeah?
>> [Indiscernible]?
>> Ahmed Eldawy: Right. So I [indiscernible] one example now which is like
the convex hull. So I've got to compare how it [indiscernible] in Hadoop and
SpatialHadoop. This will answer your question. So, in convex hull, we have a
set of points and we need to find the minimal convex polygons that contains all
these points. Now, let me show you how we implement this in both Hadoop and
SpatialHadoop. If we use traditional Hadoop, then the first step is actually
partitioning the data and if you use traditional Hadoop with a default or
[indiscernible] of partitioning the data which does not take the spatial
attributes into account, while if you use SpatialHadoop, we use the spatial
partitioning as you see here. Now, the next step is the pruning step, which
can only be applied in SpatialHadoop because it relies on the spatial
partitioning. So in this case we can prune this partition which is completely
outside the answer because the convex hull will actually go around it so we
have a formal rule of pruning partitions here and we can prune the partitions
that will not contribute to the answer. After this, we end up with a fewer
number of partitions, SpatialHadoop, here and they both system will compute the
convex hull inside each partition. So this is actually parallelized for each
system. However, SpatialHadoop would be much more efficient because it can
reduce the amount of partitions or the number of partitions that it needs to
process [indiscernible] spatial batching that we proposed. Then the final step
is to get the answers of all these local convex hulls and put them on a single
machine to get the final answer. So, this we can actually parallelize both
like the query in both Hadoop and SpatialHadoop but SpatialHadoop is much
faster because it can prune all the partitions that is not contribute to the
final answer.
>> [Indiscernible] the final answer should be small enough that it can
[indiscernible].
>> Ahmed Eldawy: Right. Right. For convex hull, this is typically the case.
In other algorithms such as [indiscernible], for example, the answer is
typically larger than the input. So we have to use other ways to -- other
techniques to do this work. I can discuss with you offline if you would like,
if you are interested in this. So to summarize the operations part, we showed
how we can make spatial indexes to speed up different types of queries which
are all like implemented as MapReduce programs in SpatialHadoop. Yeah?
>> [Indiscernible].
>> Ahmed Eldawy: We just [indiscernible] in this case, yeah.
that we added yesterday -- yeah, a question?
So another part
>> [Indiscernible]. So are these now available as libraries that people can
actually compose their larger MapReduce workflows when they're doing this and
then maybe feeding it to a different part of the pipeline? Is that something?
>> Ahmed Eldawy: Yeah. Right. So typically MapReduce, you run each phases of
MapReduce program. So we have like the MapReduce functions and the MapReduce
program to construct the [indiscernible] diagram. Once you have the output,
it's still stored in the HDFS so you can run other queries on top of this. Or
after this. Yeah. So all the operations are actually spread as MapReduce
which means you can have multiple phases out of this. So something we
[indiscernible] is actually the visualization part and the visualization, we
have huge data sets [indiscernible] image that describes how this data set
looks like. So, for example, if you have a set of live data we can visualize
it like this. If you have road network you can visualize a different way. And
actually all these different ways of visualization are already available. So
there are existing algorithms for these types of visualization. The main
limitation of these existing techniques cannot scale to the amounts of data
that we work with in SpatialHadoop. So what we wanted to propose here is a
framework what Hadoop is which can scale out this thing visualization
techniques. So in Hadoop we are not trying to propose a new visualization
technique, but we are trying to scale out existing techniques so that they can
work with the huge amounts of data that we can work on in SpatialHadoop. For
example, this is an example of visualizing [indiscernible] points which
represents the temperature in the whole world for over 60 years. So you have a
daily snapshot. Each snapshot contains 500 billion points for the whole world
and then we have these available daily for over six years. And then we
generate -- in this case we generate 72 frames. Each frame represents the data
in one month as one image and then it shows you how the temperature has changed
over time. And although you can do this in a single machine, it will take
around 60 hours to generate these 72 frames. While in Hadoop we can do it in
three hours only using thin nodes. [Indiscernible] net cluster. It is
actually quad core machines. So now let me show you how we actually visualize
or generate each one of these frames efficiently in HadoopViz. So the basic
idea is that we have a huge data set. We break it down into smaller parts,
process each part separately, and then combine them together to get the final
image. Now, there are actually two ways we can partition the data. If there
are ways of default Hadoop partitioning which shifts with traditional Hadoop or
with a spatial partitioning where we add in SpatialHadoop. If we use default
Hadoop partitioning, we end up overlaying the intermediate images to get the
final image. Yes, question?
>> [Indiscernible].
>> Ahmed Eldawy: So it's like new indexing technique but we are showing like
these are actually available as MapReduce programs so this is actually
basically how to write MapReduce programs that will generate huge like these
images efficiently. So we are using actually the same indexing technique or
partitioning technique that we provided.
>> [Indiscernible].
>> Ahmed Eldawy:
>>
Yeah.
You're moving up the step.
>> Ahmed Eldawy: Right. So this is actually on the top step so we are
actually making use of the underlying components to build these visualization
techniques. So, if [indiscernible] Hadoop partitioning, we end up overlaying
the intermediate images. So I think with this as like some transparencies so
we have multiple transparencies. Each one contains part of the image and then
when you put them on top of each other, you get the final image. While if you
use a spatial partitioning, we end up stitching the intermediate images to get
the final image. So think of this as a jigsaw puzzle so you have small tiles
and then you put them side by side to get the final image. Now, the challenge
is actually which one of these is more efficient. So, we support both in
SpatialHadoop or in HadoopViz and actually, you can see that there is a
difference between these two partitions. So this [indiscernible] partitioning
technique is much more efficient than spatial partitioning because it doesn't
have to take the attributes into account. But the overlay process is much more
costly than a stitch process because the intermediate images are very large so
it has to process all this big images to get the final image. So what we ended
up doing is that we proposed the cost model that compares these two techniques
and there is SpatialHadoop or HadoopViz can automatically decide which one to
use based on the images size. So what we found that as the image size
increases, we better go to the stitch process while for small images the
overlay courses will be just fine. And based on this, we can actually
implement an algorithm that can visualize each of these visualization types and
if you want to add a new visualization type, you can reimplement the algorithm
for this. [Indiscernible], this is not really good. Because we'll end up with
a huge number of implementations for the same algorithm, which is not good like
for a product because we cannot maintain all these implementations. So what we
ended up doing is that we proposed just to one algorithm that can support all
these different types of visualizations and if you add a new visualization
type, you don't have to define this algorithm. And the second series that we
provide visualization abstraction. So this abstraction contains five functions
and these five functions can be defined by the user or by the images designer
to define the visualization logic. So you can define this by functions in a
way to like visualize [indiscernible] data and then you can define them in a
different way to visualize [indiscernible]. So now we have a separation of
roles that images will focus on defining these five functions while the role of
HadoopViz is to generate the image at scale. So [indiscernible] is just
scalability issue. We don't have to worry about the logical visualization. So
now I will show you how the abstract algorithm looks like at the high level and
then I'm going to show some examples of how we can define these five functions.
So the first step is actually to partition the data, and we mentioned the two
ways we can partition the data. After this, we called the smooth abstract
function, which is defined by the user. This basically takes nearby records in
each partition and try to fuse them together to get a better looking image
depending on the visualization logic. Then we call a create canvas abstract
function which initializes an image structure that can be used for plotting or
visualization. Then we call a plots function which takes one record at a time
and updates this canvas to draw or plot the record on this canvas. So again,
this is an abstract function defined by the user and we call it in parallel to
visualize all the records. After this, we call the create canvas method again,
which is the final canvas that will hold the final image. Then we call the
merge function which is defined by the user to merge all these partial images
in parallel into the final canvas. And finally, we call the write function
which takes this canvas and the final image that's displayed to the user. So
let me show you some examples of how we have defined these five functions.
Let's say we wrote to July satellite data. The first step is that the smooth
function so the smooth function will actually take part of the data and try to
estimate missing points. So as you can see here, there is missing points due
to the clouds, and this function applies a two-dimensional interpretation
technique to estimate these missing points. Then the create canvas function
will initialize the 2D metrics and should all be zeros. These actually
represent different pressures, different pixels of the images we want to
generate. Then the plot function will take one record and depending on the
location of this record and temperature, it will update some entries in these
metrics. Then the merge function is as simple as metrics addition so just adds
up to what this is to each other and finally, the right function will map each
entry in these metrics to a color in the image and then write the final image
to the output. So, once defining these five functions, we can plug them into
the abstract algorithm to generate this image at scale. Similarly, if we need
to visualize a road network, which is a totally different type of
visualization, we can still express it using these five functions. And then if
the users want to add a new visualization technique, all they have do is just
define these five functions and then it runs automatically by HadoopViz to
generate the image at scale. So users don't have to worry about scalability
anymore. So all of what I described now is a way to generate a single frame of
the image. So we have the set to generate one image. Another way of
visualization is what we call multilevel images. So multilevel images, you
actually generate a image where users can zoom in and out to see more of the
fields of a specific area. This is actually similar to Bing Maps for example,
where you have like a very high level image and then as you zoom in and out,
you can see more details of a specific area. However, in our case, we actually
are generating these data for a specific data set. So you have your order set
and we want to generate this image. In this example, this one example which is
like two gigabyte data set, and although this data set is not really huge, due
to the huge size of the underlying image, it might take one hour in one machine
to generate this data. While in HadoopViz, we can do it in two minutes only.
As I'm going into the details here, we actually -- the basic idea is to
generate this pyramid of images tiles and generate more zoom levels, we can
actually allow the user to zoom in more to get more details. So I'm going to
get into the details here, but the basic idea that we partition this pyramid
into smaller parts and then generate each part separately. So we have custom
model to compare that different algorithms for generating this pyramid. And
then we can parallelize the work using multiple -- yeah?
>> Can you realize [indiscernible]?
>> Ahmed Eldawy:
Sorry?
>> Can I pose a query [indiscernible]?
>> Ahmed Eldawy: Yeah. So you can definitely do this. If the result has some
special attribute, then you want to visualize it, you can definitely do this,
yeah. So, yeah, as I mentioned, all the queries that we run in SpatialHadoop
run as MapReduce programs. And the visualization is the same case. So the
visualization is just the MapReduce program. So the output to MapReduce
program can be fit as the input to make the MapReduce program.
>> Basically all of the -- in some sense, all of these blocks of code are
composing through HDFS, through reading and writing ->> Ahmed Eldawy: Right, right. So the input is HDFS and the output is also
HDFS, which means you can feed out of one query to the input to the other
query.
>> [Indiscernible].
>> Ahmed Eldawy: You can still do it. So you can do -- let's say you build a
[indiscernible] diagram for example and then index this [indiscernible]
diagram. So it's something you can do.
>> You just have to pay the cost of the indexing in the pipeline, however long
it takes to build the index.
>> Ahmed Eldawy: Right. Right. Yeah. So if you are expecting to query let's
say the [indiscernible] diagram or the [indiscernible] multiple times, maybe
you can run a query on it. So run index on it so that you have the data
indexed.
>> How expensive is indexing the data compared to like doing a scan of the
data?
>> Ahmed Eldawy:
In terms of query processing?
Or --
>> The latency of the end to end time. If I sort of want to take a -- suppose
I want to issue -- do a query, some sort of query while I was looking at the
entire data set, you know, in Hadoop versus -- but maybe the query isn't
fundamentally complicated. I just want to look at all the data, you know,
versus the cost of producing an index. Is the index much more expensive to
compute or about the same or a little more expensive?
>> Ahmed Eldawy: So if talking about a MapReduce program, like, if it's just a
map phase which you can scan the data and write the output, it will be much
more efficient than indexing because indexing involves shuffling the data from
MapReducers. But if you are involving writing a complete MapReduce program,
it's not much different because at the end, indexing is just scanning of the
data but has to reshuffle the records.
>> So it's the shuffling part that's expensive, not really the index
construction.
>> Ahmed Eldawy: Right. Exactly. Yeah. So I'm a little bit over time, so
I'm going to go quickly over the application that we built using SpatialHadoop.
And basically, what I want to show here is that actually, SpatialHadoop can be
used in real applications and what I will say here that if nothing else with
the SpatialHadoop, we are actually using it ourselves to build these
applications. There are other examples of other universities who are using
also SpatialHadoop to build their own applications. So our application is
called Shahed. It's a system for querying and visualizing spatio-temporal
satellite data. So you can select any area in the world on this web URL and
then you can visualize it or run some aggregate or selection queries on it.
Another applications use open street map data. So it extracts the data out of
the 500 gigabyte data sets from open street map. Index of this data set
becomes available for users to query with just a few clicks. Another
application is called MNTG, which is a traffic generator, and this application
you can go to this URL, select any area in the world, and then the system will
simulate moving objects on the underlying area, which can be used for testing
and benchmarking your applications. And now, in the next part, I'm going to go
over some related work to SpatialHadoop and some experimental results. So,
actually, we are not the only player in the area of big spatial data. There
are many other systems that are trying to extend existing big data frameworks
to support spatial data. And if I want to differentiate between SpatialHadoop
and all these systems, just one statement. I'll say that SpatialHadoop is the
only system among all these that extends the core of Hadoop to make it
available for MapReduce developers. So all the other systems are somehow rigid
in terms of like the types of indexing they support or the types of queries
they provide while SpatialHadoop is very extensible so you can add your own
spatial indexes, you can add your own spatial operations or your own
visualization, which makes it actually very attractive for researchers or
developers to add their own techniques to it because you don't have to stick
with the amount of like operations or indexes that ships with SpatialHadoop.
Yeah?
>> Do you [indiscernible] indexes or do you support extensible indexing?
are very different.
>> Ahmed Eldawy:
These
We support extensible indexes.
>> So you have [indiscernible] people can build their own indexes.
>> Ahmed Eldawy: Right. There are actually existing projects that people
adding spatial indexes to SpatialHadoop. So they don't have to go through
whole pipeline of writing the whole algorithm for indexing. You just have
define like an interface, an abstract interface and then these defines how
index will be built. So it's like one class that you will implement.
are
the
to
the
>> Maybe a related question. So if you modify the core of Hadoop, if the core
of Hadoop evolves, do you have to kind of adjust SpatialHadoop? Because the
advantage is, right, to use -- to only sit on top, then the open source
community produces new versions of Hadoop, you don't have to kind of -- you
can ->> Ahmed Eldawy: Well, yeah. So that's a good point. So actually, our
current implementation is somehow ships as an extension to Hadoop. However, it
actually, like, overwrites some Internet classes of Hadoop, which means, yeah,
if Hadoop is a totally -- like they change the underlying architecture totally
in a way that's not compatible with the current architecture, then, yeah,
SpatialHadoop will have to be modified somehow. However, we still somehow use
like some kind of extension or like expand existing classes so it can actually
run with different distributions of Hadoop so we don't have to stick with a
specific distribution. So there are some examples of people run it on
Hortonworks which is another distribution of Hadoop or Caldera Hadoop. So it
can actually run across different things. Although we bet it on Apache Hadoop.
>> So you're saying that enough people have taken a dependency from these
things that you have extended and probably is difficult to change them at this
point. It would break a lot of people's hearts.
>> Ahmed Eldawy:
Right.
Yeah.
[Laughter]
>> Ahmed Eldawy: I think a version of this and we are trying to make it easier
for other people. Yeah. So, let me -- okay. Yeah, question?
>> [Indiscernible]?
>> Ahmed Eldawy: That's a good point. Currently, SpatialHadoop is like sticks
with the limitation of HDFS, which means that you cannot modify the files in
HDFS. So currently the indexes are static. So once you write an index, you
cannot modify it. You can of course rebuild it if you have new data sets, but
you can't modify it. We have an extension which I didn't describe here which
is ST Hadoop. It works only for specialty parallel data. So if you are
expecting more data to come with a recent timestamp, for example, you have like
Twitter data, for example, and each day you have a new batch of records with a
data timestamp. We support some kind of update for the existing index but this
is somehow limited to this kind of spatial temporal data. But in general, if
you have a spatial index and need to add more records to it, currently this is
not supported. This is one of the open problems that we are thinking of
approaching in the future. Yeah. So let me quickly show some examples of
experimental results. So this shows the performance of range query. So as we
increase the input size from one gigabyte to 128 gigabyte, we can -- like
performance of Hadoop keeps decreasing the throughput of the system because it
has to present the whole file, while SpatialHadoop can limit the amount of the
data that is processed using the spatial indexes so it can keep its
performance. Other example is spatial join and this actually shows you like
give some motivation of the different spatial indexes that we support in
SpatialHadoop. So you can see how the same query runs at different like
performance numbers with different spatial indexes with Hadoop for the two
files. So it actually motivates people to add more indexes that can be more
efficient for specific queries. That part is like computational geometry work
so we can show you here how we can speed up different kinds of spatial
operations or [indiscernible] operations in SpatialHadoop. Finally, this is
the visualization part and we can show here how we can speed up different kinds
of visualization types by just defining these five abstract functions. So the
next part I'm going to go very quickly over other research projects that I have
been working on. One of them is Hekaton, which is here in Microsoft Research
with Paul Larson and Justin. Basically, it was motivated by the huge drop of
the prices of memory. I'm pretty sure you are familiar with it. So basically,
the project that I did was how to differentiate between hot data and cold data
and try to migrate the cold data to a cold store so that we can reduce the
memory footprint and keep the performance well. And this was actually
integrated with SQL Server. Other part is Quill, which I did with Patricia and
Jonathan. And basically, in this project we start to use Trill as a single
machine streaming processing engine and then provided an API that can
parallelize this work in the cloud. Other project is NADEEF which I did in
QCRI. This is an extensible data cleaning system. So we have some dirty data
and you have defined the cleaning rules and then the system applies these user
defined rules to clean this data. Another system is called Sphinx which is a
system that I did very recently. You can think of this as spatial Impala. So
[indiscernible] Impala is like a barrel database design by cloud data and tries
to reuse the spatial indexes that we built in HDFS by choosing the spatial
query processing on Impala and extends it to support these kinds of spatial
indexes. So this is mainly for SQL processing of big spatial data. So in the
last part, lt me go quickly over some future plans by research. So some of
them are actually open problems on the current work that I did. For example,
the visualization part, we can think of extending the abstraction to support
other types of visualization, so 3D visualization for example. And we can also
migrate the image initial process to other systems such as [indiscernible]
storm so that we can support realtime visualization. Other part is to expand
the query processing engine office port so that we can make use of the spatial
indexes that we built in spatial Hadoop, and this will allow us to support like
[indiscernible] queries. Other things to support like extend the spatial
indexes so that they can support interactive data where we can add more data,
so related to the question you asked. So the current index are static, but we
can actually in the future we can modify or expand these indexes so that we can
add more data to the index. For long term plans, actually we have an idea of
having identified big data interface among the -- on top of like the big data
frameworks and they actually were affected by the huge number of open source
big data frameworks such as Hadoop, Spark, and Impala. And actually, a
question that I usually get from users is which one of these systems I should
use for a specific application. And it's very challenging because first there
is an overlap in the functionality in these systems. For example, of Impala,
Spark is scale and [indiscernible] can support scale query processing. And
there's no clear winner in terms of performance between these systems. For
example, the papers that [indiscernible] Spark SQL in the last year and SIGMOD
is compared to Impala, I'm sure that Impala can be faster than Spark is scaled.
Another paper that would be published [indiscernible] later this year is
actually comparing Hadoop to Spark and showing that Hadoop can be faster than
Spark, although Hadoop is very old and want to be very bad, we're sure that
there are still case studies that Hadoop can be faster than Spark. So with
this actually, it's very challenging to decide which system to use. But the
good point of that, we don't have to choose one system because we can actually
run all these systems together in the same physical cluster because all of
these systems support HDFS as a file system and support [indiscernible] as a
resource manager, which means all of them could coexist in the same cluster.
However, my plan is to build [indiscernible] abstraction on top of them where
users can express their queries in identified language and these systems will
apply some cost model or some rules to choose which underlying system should be
used for this specific query. For example, if it's like a [indiscernible]
query, it can run on Spark. If it's [indiscernible] on Hadoop. We can also
apply query optimizations so that [indiscernible] query and try to optimize it
for the specific application or for the specific underlying framework which
shows or we can also apply some advanced query execution techniques. For
example, you can take a huge or complex query, break it down into smaller
parts, and run each part on a different framework and combine the results back
together. And with this, we can actually provide a better user experience for
users to use or choose between all of these different systems. Other long term
project that I can work on is geographic distributed clusters. And these
actually motivated by the systems that generate a huge amount of user content.
For example, think about Facebook or Twitter where people create a huge amount
of content. So, we have these users all around the world and they are
generating huge content and if you store all this data in one datacenter, it
cannot scale with the amount of data that we store. So what currently is done
in these systems that we have different datacenters, geographical distributed
among the globe and each one stores data from nearby users. And this works
fine for storing the data and retrieving it but when it comes to analyzing this
data, what they currently do is they ship all the data they want to analyze
into one datacenter and do all the query processing in this datacenter.
However, this can work in some cases but in some other cases, it's better to do
part of the query processing of the different datacenters and then combine that
partial results together in one datacenter to get the final answer. And the
challenge is that if you want to run Hadoop or Spark in a distributed cluster
or a distributed environment like this, in a geographical distributed
environment, these systems are not actually designed for this kind of query
processing. So they might keep moving or shuffling data between different
machines which are totally like separate in terms of geographic location. For
example, shuffle data from Europe to the U.S., for example, which will reduce
the query performance. And what makes it even more challenge something that
there is a [indiscernible] between the cost and processing time. So, what we
do here is to expand one of the big data frameworks. It could be like Spark or
Hadoop, and then try to modify its source that that it takes into consideration
that geo traffic location is machines, we try to optimize the query and try to
balance the cost and the processing time. So this is the last part of my talk.
Let me quickly summarize the talk and then I will be happy to get your
questions. So I've described the different spatial indexes that we support in
SpatialHadoop. Then I described different operations such as range of query or
special join or corporate [indiscernible] operations. And then I described the
visualization part where we can visualize data in both single level and
multilevel images. And then I described some applications that we built using
SpatialHadoop and then I went over some other projects that I worked on and
described my short term and long term research plans such as unify big data API
and geographical distributed clusters. So, with this, I'll end my talk here
and I'll be happy to get your questions. Thank you all for listening.
[Applause]
>> Ahmed Eldawy:
Any questions?
>> [Indiscernible] data set that you have tried?
>> Ahmed Eldawy: So I'll give you two examples of larger sets that we are
working with. For vector data where we have like points and polygons, work up
to like five terabytes of data. For the rest of our data, which is the
satellite data, we worked up to like 30 terabytes of data. Although like
[indiscernible] archive which is like one petabyte of data, but we didn't have
the capacity of machines to store all these data. So I took like 50 terabytes,
a sample of this data and worked with it.
machines that that I have.
>> And what did you find?
It is the largest I could get in the
Did it scale well in those cases?
>> Ahmed Eldawy: Yeah. Up to these numbers, it scales very well.
linear with this use data sets.
So almost
>> And [indiscernible] terabyte level, did you see the performance scaling
you'd expect -- you'd hoped for? I shouldn't say expect, but that you hoped
for when you go from certain terabytes to petabyte.
>> Ahmed Eldawy: I didn't try, but it would be very interesting to try. Like,
I didn't have the resources to try the one petabyte archive. I don't have the
resource to do this, but it would be very interesting to try it. Yeah.
>> [Indiscernible].
[Laughter]
>> Do these systems run IO bound or CPU bound or message bound or what exactly
[indiscernible] performance here?
>> Ahmed Eldawy: That's a good point, a good question. Because actually it
depends actually on the query that we are running. So, some queries are IO
bound such as spatial join for example. So it depends on how much data we need
to read from disk. Other queries such as like competition [indiscernible]
queries especially like [indiscernible] triangulation, which is very complex,
is actually CPU bound. So most of the time is actually spent in computing
[indiscernible] connecting them together. So it depends on the query actually.
In the visualization, we found that it's actually network bound. With
visualization, it's actually network bound or, yeah, it depends on the
communication between the different parts.
>> So I could ask a Trill question, which is 30 terabytes, you might run on a
single machine.
>> Ahmed Eldawy:
If you have --
>> [Indiscernible] scale, yeah.
>> Ahmed Eldawy:
If you have, yeah.
>> Yeah, so there are systems like that where you run 30 terabytes. Would
you -- what would happen if you tried that, number one? And number two, does
that suggest perhaps other approaches to the problem other than sort of
[indiscernible]?
>> Ahmed Eldawy: That's a good point. So, again, it depends on the like type
of query. So some queries could run by scanning the data. In this case, you
could run it on a single machine if you have 30 terabytes of data. Other
queries actually require like all the [indiscernible] to be in memory for
processing, right? So it needs to look at all the data. For example if you're
running [indiscernible] diagram and you need to use the typical single machine
algorithm, you actually need to do it over the type memory because it has to do
the viability concurrent keeps like merging the data together. For the type of
query processing that is done in Trill, well, I think actually like -- let me
think about it. So, for example, we can't actually run like if we use Trill as
is, I think it can work with the spatial data. But the good question is how
can we improve Trill to work better with spatial data? Right? This could be a
good part. I think for example like storing the data itself, so if you store
this data, if in Trill, you have to like store data first and then read it, so
it has to be stored somehow. If you have the control to store it more
efficiently, then we can actually reuse some of the parts that we did here.
For example, spatial index or spatial partitioning. It can still be applied in
a system like Trill or Quill because it's more distributed. So we can
actually, while you store the data Azure blobs, we can make use of the spatial
partitioning that we do so that we can actually do the query processing, we can
make use of this. One thing for example we can think I think for Trill is like
it's like somehow scans the data, right, so we have like batches and then you
run each batch separately. A good part is like if you can have spatial
boundaries for each batch then this can significantly speed up the query
processing. So just giving a batch of records and trying to process on them,
let's say you get a batch and then you know also the boundaries of this batch.
This can significantly speed up the query processing. So this can be parts
that we can -- yeah. Parts of this work can actually be migrated to a system
like Trill or Quill. Any other questions?
>> You talk about spatial [indiscernible]?
>> Ahmed Eldawy: Yeah, there was. So far we work with just plainer data. So
if we are using like geographical data like [indiscernible], we just assume
that it's flat on a map. We didn't take into account for example if the data
is on a sphere. But this could be something to, yeah, there could be an open
problem that we work on in the future. For example, if in this case we can use
like hierarchical triangular [indiscernible] instead of like using the normal
partitioning because it's more efficient or more suitable for this kind of
data. But so far, we just work on plainer data.
>> [Indiscernible] that I have to support.
>> Ahmed Eldawy: We didn't have this situation.
We can work with it in the future.
>> Badrish Chandramouli:
[Applause]
All right.
But it's very interesting.
Let's thank the speaker.
Download