36992 >> Badrish Chandramouli: So, it is my pleasure to... the University of Minnesota. He's Marvin Modal's student. ...

36992 >> Badrish Chandramouli: So, it is my pleasure to introduce Ahmed Eldawy from the University of Minnesota. He's Marvin Modal's student. So he has been working on SpatialHadoop for some time now, and this work has had like phenomenal impact across the open source community with multiple downloads and real customers and stuff like that. So it's very interesting. And apart from that, he had two successful internships at MSR as well as an internship at IBM in Watson as well. So, let's get started. Thank you. >> Ahmed Eldawy: Okay. Thanks, Badrish, for the introduction. And thank you, everyone, for taking the time to attend the talk. So, my talk would be mainly about the work that I have been doing during my Ph.D., which is around SpatialHadoop and MapReduce for my first base spatial data. And before getting into this, I'd like first to give an overview of my Ph.D. journey. So, actually, since my work was most related to spatial data, I prefer to show it on a map like this. So I just thought I get my bachelor's and master in Alexandria University. And at this time I also cofounded BadrIT, which is software company headquartered in Alexandria, which is still, by the way, running and still growing since I started it. And then I moved to Minnesota where I did my master's there, started my Ph.D. program there. And on the first year, I used to work with other senior students in my group on problems to recommendation systems. And I also get the chance to participate in the SIGMOD programming contest this year and I actually got like selected as one of the finalists and presented my work in SIGMOD on that year. After this, I had a summer internship in IBM where I actually worked on stream -- clustering of streaming graphs on that year. And after this, I attended back to Minnesota for just one semester where I worked in Sinbad which was a location-based social network that we -- a project that we started in Minnesota at that time. After this, in the spring, I moved to QCRI where I had an internship there for data cleaning, which is called NADEEF. And then I came here to Microsoft Research. That was my first internship here, Microsoft Research. What I worked with on Hekaton with Paul and Justin. And then I went back to Minnesota where I spent one year in Minnesota at that time. And actually, that year, I released the first version of SpatialHadoop and published a couple of papers about the different aspects of SpatialHadoop, different components of it and some things about it as well. After this, I went to the GIS innovation center in Saudi Arabia, where I spent one year there starting the center [indiscernible]. There was a couple of projects there that are related to my Ph.D. topic and they are still working on this project now. After this, I came here back for Microsoft Research where I did my second internship here, and this time, I worked with Badrish and Jonathan on Quill which is a distributed processing engine for streaming data. And then I finally attended back home to Minnesota where I continued the work I started in SpatialHadoop and published other works about this and other stuff that Hadoop which is visualization [indiscernible] I'm going to describe also in this talk. And then also, during this time, I had a short trip to University of California Irvine. This was like one month visit to Mike Carey, a group there working on as tech dB. And actually, the final outcome of my Ph.D. was quite prolific from all these trips and different visits. So I actually, like my Google scholar stats, I had like more than 550 citations for my work and I did a couple of journal papers, conference papers. There was in top data conferences and journals and also conducted some tutorials about the work that I had been doing and also had the chance to collaborate with more than 35 collaborators from six different institutes. So without further ado, let me start describing the work that I have been doing in my Ph.D., and for getting into the technical details, I would like first to give a story. This story is actually between the people in computer science and people in geography, which started a long time ago when people started to create maps. And I can tell you that this map create different maps, bigger maps for the whole world with different projections, different applications such as, for example, medical applications or urban planning and this went on for a long time. Until earlier this -- last century, something happened. The computer was invented. And the computer role [indiscernible] technology, it changes the way think about their work and then we have like these two parallel worlds for some time and each one worked within its own until at some point, this guy from geography thought about using computer. So, he said, I have big data and I need help. It was big data at that time, though not as big as we know it today. So he spoke to the computer science guy and told him, cool computer technology. Can I use it in my application? And of course people in computer science are known to be very nice. So he said, my pleasure, here it is. But it was not that easy. There was a problem actually. He said it's not made for me and can't make use of it as is. So, we have a gap between these two worlds and someone needs to step in and fill in this gap. So this is Jack Dangermond who founded ESRI. He talked to the computer science guys, told them, kindly let me get me the technology you have so geographers can understand your needs. And then he sat with his friends and founded ESRI as one of the first companies to work with geographical data on computers. And now we have three parallel worlds and at the time went on, each one started to grow on its own. So this is Jim Gray who one of the main founders for System R which became the Database Management System field, and [indiscernible] successful to use the many open source and commercial software, the many applications as well. At the same time, Jack Dangermond continued his work in ESRI and released the first version of arc JS running on PC. And the geography world here, [indiscernible] who has got the professor at sanity Barbara and he made use of these new technologies to advance the field of geography. Again, this went on for some time. Until at some point this guy from geography started to scream out again. So he said, I have big data and again, and the old technology is not helping anymore. So Jack [indiscernible] his connection with the computer science world so he said, let me check with my good friend there. So, he took to this guy. He told him, cool database technology. Can I use it in my application? Again, the nice guy said, my pleasure, here it is. But the same problem happened again. Said, it's not meant for me and I can't make use of it as is. So, now, again, we have this gap between these two worlds and someone needs to step in and fill in this gap. So this is [indiscernible] who is professor at Maryland who he talked to his guys, told them, let me get the technology you have and again the geographers, let me understand your needs. And then he founded the area of spatial dB maps. And now we have four parallel worlds. And again, at this time, each one of these guys has own success stories. So John gray got the Turing Award for his work in databases. It was the many like commercial software and open source software. [Indiscernible] became a very successful professor at Maryland and most of the work in spatial database was used in many of these commercial software. Also, Jack Dangermond is currently a multimillionaire and his company is number one company in the area of GIS and geographic processing of computers. Also Mike Goodchild is currently at emeritus professor at sanity Barbara and he directed and founded the first national center for GIS for 20 years. And he again, made use of all this technology to advance the field of geography. This again went on for some time, until earlier this century, something happened in the computer world here. So, this is Jeff Dean from Google who invented MapReduce. And he actually became very successful. Actually he got the initial engineering for this work and it was used in many other software and other research projects in different areas in other companies. At the same time, with the [indiscernible] technology and smartphones and the new satellites, people in geography started to have unprecedented amount of data that they want to process. So, again, this guy from geography started to scream out again. He said, again, I have big data. This technology is not helping me anymore. So at this time, Jack new the limitations of traditional dB MS, so he said, seems like the version of the dB MS cannot scale anymore for these applications, but he still maintained his connection with the computer world. So he said, let me check with my other good friends there. Now, he talked to Jeff Dean. He told him, cool big data technology. Can I use it my application? Now, what do you think Jeff would say at this point? He said, my pleasure, here it is. But the same problem happened again. He said, it's not made for me and I can't make use of it as is. Now, what do you think should happen now? So, someone needs to step in and fill in the gap between these two worlds. So one needs to talk to big data guys to tell them let me get the technology you have and to the people in geography, let me understand your needs and we believe that SpatialHadoop is the system that fills in this gap. And this is what I'm going to describe to you in this talk. So the story that I have just described is all good geographical data. However, there are many other examples of spatial data that needs to be processed. For example, think about [indiscernible] blogs or you think tweets. These are bits available in terabytes of data or reviews of points or records that we need to make use of this big data technology to process it. Medical data for example. Like brain images or x-ray images which again data is [indiscernible] bytes of data and way to process it efficiently with this big data technology. Smartphones, sensor data, satellite data, all these are very big data sets that we need to process it. So when we started SpatialHadoop, we thought about using Hadoop because it was actually the state of the art at this point. So we thought about using Hadoop to process spatial data. And we asked ourselves a question. Can we actually use Hadoop to process spatial data? And the simple answer is yes. For example, you can express NG query in Hadoop like this and it works fine. So for example, if you have a 60 gigabytes of data and you have 20 machines, it can process it within 200 seconds. However, the real question is can we do better than this? So what he ended up doing, we [indiscernible] spatial data to Hadoop to get SpatialHadoop out of this where we can process or express the same query in a more efficient way to have a better way to express the query but it's only about the language. So it employs like efficient query processing is efficient indexing, which allows it to run the same query in just two seconds, which is two orders away to performance. You have a question? >> [Indiscernible] especially given that there's a lot of prior overcome building parallel spatial databases. And almost all commercial database [indiscernible] spatial support. Why not just use a parallel database version of that? Are you going to talk a little more about the difference between that and Hadoop? >> Ahmed Eldawy: I can't compare Hadoop to parallel database but that's a good question. Like, why not using just the parallel database for this? The same question could be asked for Hadoop itself, right? So why Hadoop was out there. So the answer is actually the case study or the use cases of Hadoop and parallel databases a little bit different. So, in parallel database, you have like this structured data and you spend time for loading the data into that parallel database, build some indexes on it, and then run SQL queries on it. That's a case study. The other case study is Hadoop pushes like you don't have to pay the overhead of loading the data but you have other types of queries that you're on. For example, Hadoop, you write MapReduce programs. And for MapReduce programs, you express other types of queries that cannot be usually expressed in this scale. So it's a little bit different case study. So what we're providing here is if you are already using Hadoop and you are processing spatial data and you are looking for a way to improve your queries, then you can use SpatialHadoop. So it becomes...it comes for people who are already used to Hadoop [indiscernible] Hadoop other than parallel database and then to process spatial data. Yeah? >> [Indiscernible]? >> Ahmed Eldawy: Hadoop dB for, example, right, where you can use any sense of the database's internal light system and use it. Yeah, so depends on like the application or the case study you're looking for. So this works for some case studies, but there are other case studies where you want to write a MapReduce program. So if you use Hadoop dB, for example, you still write a SQL query. Right? It's just another way to do a parallel database that relies on Hadoop. So, depending on your case study, if you are interested in writing SQL programs for spatial data, then you have to go for a spatial parallel database. If you are interested in writing MapReduce programs that analyzes big data, then it's better to use SpatialHadoop. So depending on your case study, you should decide which system to use. So now, what proposed here, SpatialHadoop, which is like [indiscernible] of Hadoop that works with spatial data in a much more efficient way, and again, it's not just about like the high level interface but it actually modifies the core of Hadoop itself to better work with spatial data and this is what I'm going to describe in this talk. So in high level, SpatialHadoop is an open source project. We actually released it in February 2013 as an open source project and within one year, it was downloaded more than 80,000 times and got the attention of both industry and academia so there are many big companies that approaches us that show their interest in SpatialHadoop and there are also many students, both undergraduate and graduate students in universities worldwide who are interested in SpatialHadoop. They are using it to do their research projects and their dissertation. We also had the chance to collaborate with different universities on research projects related to SpatialHadoop. I'm just going to mention one example which is the Cornell lab of ornithology. And although these are not computer science people, they still approached us and showed their interest in using SpatialHadoop in a research problem they are working on. Other than this, I also have conducted more than five keynotes, tutorials, and invited talks about SpatialHadoop and my work in big spatial data. And other than the projects, the open source project that we will release, we also release more than 500 gigabytes public data sets which people can use for benchmarking and testing their application and their systems. Also, most recently and after a year-long process with Eclipse Foundation, Eclipse Foundation incubated SpatialHadoop as an open source project. It will continue as on open source project but they will have additional community support and better respect for the project for big companies to use it. And it has been renamed to GeoJinni for legal purposes, but it's essentially the same project. So [indiscernible] SpatialHadoop and if you are interested later I'll be happy to answer you for the providing like the open source, open source. We're providing tutorials, instructions of how to install it and tutorials of how to use it. So on this talk, I started by giving a multi-vision for the work of big spatial data and SpatialHadoop. After this, I've got to go into the internal system design of SpatialHadoop so I'm going describe the core components of SpatialHadoop. After this I'm going to describe some applications that we used SpatialHadoop to build and then I'm going to go over some related work, some experiment results of SpatialHadoop. Then I'm going to go over other research projects that I have been working on and the future, my future research plans. So let's start with the overlay, overall picture of SpatialHadoop. So I've got layers inside the system that are designed to work efficiently with the spatial data. At the last level we have an indexing layer and at the indexing layer, we are concerned of storing big spatial data sets efficiently in the [indiscernible] file system. So have a big file that contains spatial data. I need to store it in a [indiscernible] environment. And we're to see the best way to process it or store it. On top of this, we have the MapReduce layer. And in the MapReduce layer, you want to extend the MapReduce query processing engine so that can access underlying indexes. So still at the MapReduce program, but you want to have some access methods to the underlying spatial indexes. On top of this provide an operations layer and in the operations layer, we orchestrated a wide range of spatial operations that are built efficiently and mixed with the indexing and MapReduce layers to run much more efficient than traditional Hadoop. Then on top of this, to hide the complexity of the system, we are providing high-level language called Pigeon which has spatial constructs and hide the complexity of the system. It makes actually available for non-technical users to use the system. We also recently added the visualization layer where users can interactively explore big spatial data sets efficiently in a kind of visual interface. Also with this like in spatial component data, we add on the side SD Hadoop which is an extension that goes across all these layers and adds efficient processing or support for spatial temporal data. Also in the system we have a wide range of applications that use SpatialHadoop has a backboard to process big spatial data. So in this talk, I'm going to focus on the indexing operations and visualization layers and I'm going to go quickly over some application that we built using SpatialHadoop. So let's start with the indexing layer. For the indexing layer we have a big spatial data set and we need to store it efficiently in a distributed file system. So let's first see how traditional Hadoop loads a big data set in HDFS or the Hadoop file system. So if you have a big file like this, it chops it down into blocks of 64 megabytes and then put each block in one of the machines. However, when it does this chopping down, it doesn't take the values of records into account, which means that spatial need by records such as like relevant records will end up in two different machines which will reduce the performance of the query processing. And if you want to use one of the traditional spatial indexes such as R3 or quad three, we cannot be applied as is in HDFS because HDFS is too restrictive as compared to the traditional fail system. Once you write the file in HDFS, you cannot modify it. So this becomes as a like limitation by design HDFS which is totally different from traditional file system. So we cannot make the traditional indexes as is and at the same time, the data sets we work with are really huge so we can just load all on one machine and let it partition or spatially index the data efficiently. So what we end up providing in SpatialHadoop is a two-layer index design which overcomes this limitations. In the first layer, we have a global index which again partitions the data into blocks of 64 megabytes, but it puts nearby records in the same block. So each block here will contain records that are spatially relevant or spatially nearby which will improve the query performance as we are going to show shortly. >> Can I ask a question? You only consider two dimensions, right? always two-dimensional data. The GO is >> Ahmed Eldawy: It's not always two dimension. What we proposed if you talk about the concept, it can actually apply to higher dimensions as well, so we're using like concept like R3 and quad three which can run at hard images, while the version [indiscernible] SpatialHadoop is mainly two-dimensional because this is like that types of data set that we have access to. Yeah? >> [Indiscernible]? >> Ahmed Eldawy: So, just have like two dimensions and I deal with it [indiscernible] space. >> [Indiscernible]. >> Ahmed Eldawy: So this is like, let's say I have spatial attribute which can contain not just point but can be rectangles or any like polygons or lines. And we process this attribute to [indiscernible]. So, Ravi, you had a question? >> [Indiscernible] question. So because the classic use case of Hadoop is I load the file. I don't think about it. Looks like you're replicating on a database [indiscernible] query language, there is indexer, there is...so my question was, does the user have to worry about [indiscernible] index? Does it vary for different spatial sets or is it obvious? It's one of the questions we've been asking. With Hadoop, you don't normally have to think about indexing, right? >> Ahmed Eldawy: Right. >> So does it do something like auto indexing? Does it understand which attributes is automatically relevant and index? >> Ahmed Eldawy: No. The question is how to find which attribute to index, right? What is the spatial attribute? Because typically when you load the file into Hadoop, it doesn't really understand the format of the underlying file. So in this case, if you are loading a data to SpatialHadoop, we have to understand that the underlying format and [indiscernible] has to define which attribute to process. So it could be like a point or polygon, so typically, like support different file formats. For example, if you have CSV files and we have a column [indiscernible], which is like a polygon or a line, then SpatialHadoop can use this column to index the data. >> As part of the data loading process that you essentially define in the schema that's going to result from building the index, and it's assumed that the data is in a format that that will work for. >> Ahmed Eldawy: Right. So it's [indiscernible] like defined for each record how to get the spatial attributes out of it, which is basic just the NBR of this record. This is the main thing that we need to do. So it could be a different like different format, SpatialHadoop can work with different formats, but at least understand the spatial attribute of which record to be able to process the data like this, to partition the data efficiently like this. So, yes, at the first level, [indiscernible] blocks of 64 megabyte but we take the spatial attribute or the location of each record into account. And then again we load each one on one of the machines. As a second method, because each block of this is like 64 megabytes or 128 megabytes of data. So there's still room for improvement or for better optimizing each partition. So we build a second level which is a local indexes where we organize records inside each partition efficiently. So the way we challenge is actually have to find these boundaries of partitions because we mentioned that the data is really huge so we cannot let one machine look at all the data and decide how to partition it and at the same time, we cannot use the traditional ways of indexing or like partitioning the data because of the limitations of HDFS. So when we started on this, we first thought about using an informal way to partition the data which is very easy to implement. Just scan the data and assign each one to the overlapping grid cell. But it doesn't work well with high [indiscernible] data which is the normal thing with spatial data. So it only works with [indiscernible] distributed data. So we have to think of a better way of indexing the data. So we provided like a set with an R tree index which I'm going to describe for you now, how to build the R tree index with these limitations and do it actually quickly figure out how we can expand this work support other spatial indexes. So at the first step, we first read a sample out of the data so we have a very big file with a particular read one percent sample of the data. And for this one percent sample, we just take the location of each record into account. So if you have a complex polygon, just take it [indiscernible] and which actually [indiscernible] the size of the data that we need to work on. So one percent is the typical number. And for example, if you have like 100 gigabytes of data with doing this sampling and [indiscernible] points, it can be reduced down to like 14 megabytes which can be easily loaded into a single machine. The next step with one machine that [indiscernible] the data, look at it and then partition the space based on this sample. So this happens on a single machine but typically takes a fraction of a second only because the data is very small as compared to the original data. So what this machine does is that it loads all this data into in memory R tree but while building this R tree are just as least capacity of this equation and the idea of using this equation is that it tries to adjust the number of leaf nodes in the R tree that it builds to be equal to number of HDFS locks in the end that we are going build, which means we end up with number of leaf nodes equal to number of HDFS locks. And the next step is actually very interesting. So we threw away the sample because we don't need it anymore. And you also throw away the three that we built. So build these three, and then throw it away. But we keep only the leaf level. So we think the leaf nodes, and take the boundaries of each leaf node as the MBR or the boundary of the corresponding block in the endings that we want to build, which means that we can easily scan the data, all the data now in parallel and assign each record to the overlapping partition. So now we converted the complex index building technique to just scanning of the data which can be done easily in parallel. And we don't have to worry about the limitation of HDFS anymore because each record will go to...will know a priority which partition it will go to. As a final step, we can build a local indexes at each of these partitions, and that's very easy because each block is typical 64 megabytes so we can build all these local R trees in parallel. So this is actually how we build R tree and I'll show you shortly this is actually very efficient way to build R tree. You have a question? Yeah. >> [Indiscernible]? >> Ahmed Eldawy: We cannot guarantee this of course, because I was just reading a sample. However, we can be slightly more or less than this. If it's less than 64 megabytes, then we'd have partially filled the blocks. If it goes more than 64 megabytes, then we'll write multiple blocks for this partition. So some partitions can be slightly more than one block. Some of them can be half filled or partially filled the blocks. In reality, they are all very close to 64 megabytes of data. So, this is actually [indiscernible] R tree. And now, let me ask you a question. If we want to redefine this algorithm to be the quad tree index, do you know which steps we define in this algorithm? So if you look closely, you find there are only two spots or two places we only have R tree data logic which is a back loading step and the local indexing step, which means if we take these two R trees and put quad trees instead of them, we end up with a quad tree built index. You can similarly build a wide range of spatial indexes using this technique. And this [indiscernible]. So within the wide range of spatial indexes, [indiscernible]. >> What is constraining you to use the same partitioning strategy as what you're [indiscernible]? >> Ahmed Eldawy: Nothing. You don't have to use the same. don't have to use both of them. >> The first column is very formulated. the [indiscernible]? >> Ahmed Eldawy: [Indiscernible]. Actually, you Why not figure out Sorry? >> You can formulate a problem, algorithmic problems and I want to find [indiscernible] where N is your number of nodes where [indiscernible] when scaled up where something [indiscernible]. >> Ahmed Eldawy: Right, right. And this actually... >> Like maybe there's a better way to do it. >> Ahmed Eldawy: Right. So what you're mentioning is totally right. Like the problem is how to find these boundaries of partitions. The index is just the solution to this problem. So what I'm showing here is actually the solution of the problem. But what you are mentioning is correct. So the problem is actually a [indiscernible] actually the same problem for indexing. Spatial indexing. So the main problem of spatial indexing you have to partition the data into smaller parts and group nearby records in each one. >> [Indiscernible]. >> Ahmed Eldawy: So basically, the basic idea of spatial indexing is to put nearby records together so that when you do query processing, you can quickly find which partitions you want to process or which parts of the data you need to process. If you use R tree or quad tree, they are just trying to solve this problem in one way. Of course like traditional indexing you have also to worry about like [indiscernible] which we're not worried about here. So I'm just worried about this part which you just partitioned the data into smaller parts. >> So did you use like the STR [indiscernible] technique? >> Ahmed Eldawy: Yeah. So for uploading the R tree, we used STR back loading technique. However, yeah, so this is just our solution but you can definitely use any other kind of back loading technique. You can use R star tree for example or [indiscernible] tree. Yes? >> So the spatial world is not converged on [indiscernible], why is the system supporting so many [indiscernible]? It's still an open problem like are there two good -- are there [indiscernible] index structures? >> Ahmed Eldawy: Yes. So the question is why we can support more than one index structure. I will say that it's like an open problem and we are still looking for other index structures or spatial index structures as well it's like there are many index structures and there's no clear window between them. Because once we have more than one dimension, you don't have like a linear order for the records. So, some index structures are good for specific queries and some other index structures are good for other queries. [Indiscernible] so we provided this wide range of spatial indexes and we had a paper where we had the experimental evaluation that compares these different index structures. And what we found that there is no clear window between them. So depending on the shape of your data sets, how they look like and the type of query you are running, you can actually choose one of them. So, think of it as like [indiscernible] you have like hash index and Btree index and maybe [indiscernible] index, right? So we have all these indexes in a database and then the user needs to decide or the administrator needs to decide which index to use based on the query they want to run. This is something similar here. So SpatialHadoop is a system, like a big system that supports all the different kinds of special indexes and depending on the queries that users want to run, they can choose the most appropriate or most used index for this. Yeah, so basically, this is like the generic way of building the indexes and users can easily add their own spatial indexes in this. So [indiscernible] add more spatial indexes to SpatialHadoop. Now, my next slide is actually my favorite slide in the indexing part because it somehow summarizes the whole indexing story of SpatialHadoop. So this actually shows you the R tree indexes that we built in SpatialHadoop for 400 gigabyte set of networks. So, you can see here that all the black rectangles represent 64 megabyte blocks of data. So for example, in the ocean here, because this is very sparse, we have to get a very large partition while in this area, like in Europe, for example, you have to get very small partitions to get 64 megabytes of data. And there are actually two points that I want to mention on this figure. First, that's the idea of sampling and the back loading actually works. So you can see here that just with one percent sample of the data we can still get a very efficient way to index the data. That's one point. The second point is that you can easily find now how SpatialHadoop can be much faster than traditional Hadoop. So, if you process this data in traditional Hadoop and let's say your query is focusing on this area, you have to like process all the data, right? This file contains like 10,000 partitions. You have to process the 10,000 partitions because you can never know which blocks you want to process while in SpatialHadoop, because we have these boundaries of partitions, we can limit the number of functions that you want to process and we can still parallelize them over multiple machines. And this is an example of R tree, but we can also build like a quad three which will look like this. Or if you have uniform distributed data, you can build uniform index. And if you wonder how this data look like if [indiscernible] into traditional Hadoop, I have this slide just for you. So you can see here that if you roll this data into traditional Hadoop, almost all the partitions will cover the whole input space, which tells you know how SpatialHadoop can be much more efficient if you are [indiscernible] you almost have to process all the partitions. So to summarize the indexing part, we mentioned the limitations of traditional Hadoop to build indexes and traditional Hadoop can overcome these limitations and provide a wide range of spatial indexes in SpatialHadoop. Question? Yeah? >> [Indiscernible]? >> Ahmed Eldawy: The global indexing part, yes, so it's built like the bulk with R tree and then takes only the leaf level so it ends up with some kind of flat partitioning. This is the global index so just for in the record, if we will put it in one of the partitions. The second level which is the local indexing could actually have like a hierarchical index structure. >> [Indiscernible]. >> Ahmed Eldawy: Exactly. So the global index is typically stored in memory which is like the boundaries of each partitions. So without having to process all these data, it can prune all the partitions that are outside the query range. So this [indiscernible]. [Indiscernible] other queries that we can run also with these indexes. So the next partition is also very interesting and there is to spatial indexing is actually how to use these indexes in spatial operations. So if you are writing a MapReduce program for a specific spatial operation, how to make use of these indexes in this. And SpatialHadoop shifts with a wide range of operations. We start to do with basic operations such as range query or [indiscernible] labor and then added more like spatial joint operations and a wide range of the computational geometry operations such as convex hull, polygon [indiscernible] construction or [indiscernible] triangulation. And this actually layer is extensible. So if users are interested in adding their own operation in MapReduce, we can just write their own operation in the same way that we wrote this operations. So I've got to mention some examples of these operations. Let's start with range query which is a simple operation. So you have a set of records and they have a query range here and you want to find the records that overlap with this query. So in this case, we can use the global index to prove the partitions that are outside the query range and, therefore, the remaining partitions we can find the matching records efficiently using the local indexes in the matching partitions. So this is actually a very simple query but tells you how we can make the two levels of indexes efficiently to run this query in MapReduce. Another operation is that spatial join operation. So we have two sets of records and we need to find the overlapping records between the two files. In this case, actually it's more challenging because you have two data sets and each one is indexed separately so might have two different indexes. In the previous case, if you have the same index used in the two partitions or the two files, then you end up with kind of one to one correspondence between the partitions of the two files, so just have these [indiscernible] correspondence and then you can find the matching records in each pair of records efficiently. However, in reality, this case almost never happens because [indiscernible] you have two files and each one is indexed separately so you end with two different types of indexes. For example, even if you are using [indiscernible] formulated index, will end up with like a three by three grid and four by four grid for these two files depending on their sizes. And now we have two ways we can continue the spatial join query processing in this case. One way is to do the join directly. In this case, we will find the overlap between the two partitions and, for example, in this case, we find a total of 56 overlapping pairs between the two indexes so each pair of blocks or each pair of partitions would be processed in one of the machines to find overlapping records. That's one way. The other way is to do a partition join. So in this case, we will repartition all of the files. For example, we will repartition the three by three grid that was filed here to be a four by four grid, which exactly matches with the other file. This goes back to the previous case where you have one to one correspondence and we end up with only 16 overlapping pairs of partitions between the two files. Now, the challenge is actually we want to use out of these two algorithms. So if we do the join directly, we don't have to pay the overhead of free partitioning but you might end up with a huge number of overlapping pairs. While in the partition join, we have to pay an overhead of free partitioning one of the files but we minimize the number of overlapping pairs. So based on this, we actually equate SpatialHadoop with a cost model which compared the cost of the two algorithms and then decide which ones to use for a specific query. So this actually takes into consideration the number of partitions in each file and number of overlapping pairs partitions between the two files, so the cost model actually ends up to be a very simple rule based model here. So, another [indiscernible] operations is computation geometry. This actually started all with a polygon union operation. There was actually a [indiscernible] company that approached us and showed their interest in using SpatialHadoop to speed up this query, the polygon [indiscernible] query, and [indiscernible] contained on this and provided a complete library of computational geometry work in SpatialHadoop. So, we added like Skyline operation, a convex hull, farthest pair, closest pair, basic [indiscernible] Voronoi diagram construction and Delaney triangulation. And all of them are done efficiently using spatial indexes that we build in SpatialHadoop. So in all these operations, we start with a single machine as a baseline here. And then we started by providing Hadoop implementation for these algorithms. So for each of these algorithms, provide a Hadoop implementation, which does not use the spatial indexes but [indiscernible] using MapReduce and we get up to [indiscernible] X better performance for the -- it's actually for the Skyline operation, over 128 gigabytes of data using 20 machines. You have a question? >> [Indiscernible]? >> Ahmed Eldawy: So Skyline, for example, so you have like say you have billions of points, need to find the Skyline of this query. Think about a real application that can use this. So I'll give you just one example [indiscernible] migration. So [indiscernible] migration is actually used to triangulate the space based on the distance between points. And this is actually used to model the steps of errors. So you have like lighter data which tells you the altitude of different points in the whole earth surface. And [indiscernible] model for the whole earth surface, you have to find the triangulation for all this data. So we have huge deficits for this. We have billions of point and [indiscernible] migration of all this to find the good model for the F surface. So this is one example for this. >> So this would be so that you can do, for instance, a more accurate spatial join or something like that. So what's the value of having the triangulated surface? It's really almost like a 3D model topology or something like that. >> Ahmed Eldawy: Exactly. So you have like the points, like have the altitude of different points on the earth's surface. [Indiscernible] estimate the points, the altitude at that point. So the way we do it, [indiscernible] triangulation and now let's say you have [indiscernible] made the altitude at this point. Take into account the three points that form the triangle around it and then, yeah, so for example, you have a point here. You don't have the altitude of this specific point, but you can estimate it by taking this altitude with three points into account. >> So if you want to know that Denver is a mile high or something like that. >> Ahmed Eldawy: >> Okay. Right. Yeah. I get it. >> Ahmed Eldawy: Yeah. >> So with the 20 machines you got 29 X speedup? >> Ahmed Eldawy: Yeah. Actually, [indiscernible]. Yeah. Yeah. So starting with this, which is parallelization of the existing algorithms, but we didn't stop here. We continued by providing a more efficient algorithm using the spatial indexes into account and achieve up to 260 X better performance for the same application or for the same query in this case. Yeah? >> [Indiscernible]. >> Ahmed Eldawy: Right. Right. Yeah. >> [Indiscernible]. >> Ahmed Eldawy: We tried -- well, not with the [indiscernible] but for other parts, we had up to five terabytes of data. Basically we were actually limited to the resources we had in this because we had to run this on an intranet cluster that we have in the university and it was not that big at this time. So we had 20 machines and each machine had like 52 to 100 gigabytes of storage so it was not really huge because you have to have applications to all this stuff. >> But you mentioned that some of the users are [indiscernible]; is that right? >> Ahmed Eldawy: Yeah. That's actually one of the things that [indiscernible] visualization part, so with visualization we tried up to like, yeah, petabytes of data and [indiscernible] of points basically. >> In a sense, what you're saying here is [indiscernible] pointing out earlier, you are doing a lot of the preprocessing and indexing up front, just like you would in a [indiscernible], right? >> Ahmed Eldawy: Right. >> And so, either your cost has to be way lower in this setting, or your performance has to be higher, right? How do you compare? >> Ahmed Eldawy: That's a good point. So we didn't actually compare to a parallel database, but we compared to like a single machine database. And it was actually like for like indexing for example. We [indiscernible] compare like in an experimental setting, but we -- like this was more motivation for our work. So we had for example 128 gigabyte data sets. We tried to index this using a single NSS machine, which seemed to be possible, right? That this is not really huge but took a very long time. >> [Indiscernible]. >> Ahmed Eldawy: Sorry? >> What is running on a single machine. >> Ahmed Eldawy: In this case, we are running [indiscernible] queries on a single machine. For example, we had like let's say [indiscernible] triangulation. We have a set of points and [indiscernible] triangulation for all these. >> [Indiscernible] single node DBMS. >> Ahmed Eldawy: Not DBMS, just like a Java program. So, we just load all the data in memory and do all the query processing in memory. Yeah. >> Is there [indiscernible] DBMS with spatial support? [indiscernible]? >> Ahmed Eldawy: Is there I think Oracle supports like this Oracle spatial. >> [Indiscernible]. >> Ahmed Eldawy: I think it is, yeah. I didn't use it, but I think, yeah. It starts with DBMS so if you can parallelize Oracle then you can parallelize Oracle spatial as well. Although, within [indiscernible] try to compare with these systems. But there are other [indiscernible] that compares Hadoop to parallel database, so we tried more here to compare like Hadoop to SpatialHadoop to show the effect of the things that we had in Hadoop, so we just wanted to show this. Yeah? >> [Indiscernible]? >> Ahmed Eldawy: Right. So I [indiscernible] one example now which is like the convex hull. So I've got to compare how it [indiscernible] in Hadoop and SpatialHadoop. This will answer your question. So, in convex hull, we have a set of points and we need to find the minimal convex polygons that contains all these points. Now, let me show you how we implement this in both Hadoop and SpatialHadoop. If we use traditional Hadoop, then the first step is actually partitioning the data and if you use traditional Hadoop with a default or [indiscernible] of partitioning the data which does not take the spatial attributes into account, while if you use SpatialHadoop, we use the spatial partitioning as you see here. Now, the next step is the pruning step, which can only be applied in SpatialHadoop because it relies on the spatial partitioning. So in this case we can prune this partition which is completely outside the answer because the convex hull will actually go around it so we have a formal rule of pruning partitions here and we can prune the partitions that will not contribute to the answer. After this, we end up with a fewer number of partitions, SpatialHadoop, here and they both system will compute the convex hull inside each partition. So this is actually parallelized for each system. However, SpatialHadoop would be much more efficient because it can reduce the amount of partitions or the number of partitions that it needs to process [indiscernible] spatial batching that we proposed. Then the final step is to get the answers of all these local convex hulls and put them on a single machine to get the final answer. So, this we can actually parallelize both like the query in both Hadoop and SpatialHadoop but SpatialHadoop is much faster because it can prune all the partitions that is not contribute to the final answer. >> [Indiscernible] the final answer should be small enough that it can [indiscernible]. >> Ahmed Eldawy: Right. Right. For convex hull, this is typically the case. In other algorithms such as [indiscernible], for example, the answer is typically larger than the input. So we have to use other ways to -- other techniques to do this work. I can discuss with you offline if you would like, if you are interested in this. So to summarize the operations part, we showed how we can make spatial indexes to speed up different types of queries which are all like implemented as MapReduce programs in SpatialHadoop. Yeah? >> [Indiscernible]. >> Ahmed Eldawy: We just [indiscernible] in this case, yeah. that we added yesterday -- yeah, a question? So another part >> [Indiscernible]. So are these now available as libraries that people can actually compose their larger MapReduce workflows when they're doing this and then maybe feeding it to a different part of the pipeline? Is that something? >> Ahmed Eldawy: Yeah. Right. So typically MapReduce, you run each phases of MapReduce program. So we have like the MapReduce functions and the MapReduce program to construct the [indiscernible] diagram. Once you have the output, it's still stored in the HDFS so you can run other queries on top of this. Or after this. Yeah. So all the operations are actually spread as MapReduce which means you can have multiple phases out of this. So something we [indiscernible] is actually the visualization part and the visualization, we have huge data sets [indiscernible] image that describes how this data set looks like. So, for example, if you have a set of live data we can visualize it like this. If you have road network you can visualize a different way. And actually all these different ways of visualization are already available. So there are existing algorithms for these types of visualization. The main limitation of these existing techniques cannot scale to the amounts of data that we work with in SpatialHadoop. So what we wanted to propose here is a framework what Hadoop is which can scale out this thing visualization techniques. So in Hadoop we are not trying to propose a new visualization technique, but we are trying to scale out existing techniques so that they can work with the huge amounts of data that we can work on in SpatialHadoop. For example, this is an example of visualizing [indiscernible] points which represents the temperature in the whole world for over 60 years. So you have a daily snapshot. Each snapshot contains 500 billion points for the whole world and then we have these available daily for over six years. And then we generate -- in this case we generate 72 frames. Each frame represents the data in one month as one image and then it shows you how the temperature has changed over time. And although you can do this in a single machine, it will take around 60 hours to generate these 72 frames. While in Hadoop we can do it in three hours only using thin nodes. [Indiscernible] net cluster. It is actually quad core machines. So now let me show you how we actually visualize or generate each one of these frames efficiently in HadoopViz. So the basic idea is that we have a huge data set. We break it down into smaller parts, process each part separately, and then combine them together to get the final image. Now, there are actually two ways we can partition the data. If there are ways of default Hadoop partitioning which shifts with traditional Hadoop or with a spatial partitioning where we add in SpatialHadoop. If we use default Hadoop partitioning, we end up overlaying the intermediate images to get the final image. Yes, question? >> [Indiscernible]. >> Ahmed Eldawy: So it's like new indexing technique but we are showing like these are actually available as MapReduce programs so this is actually basically how to write MapReduce programs that will generate huge like these images efficiently. So we are using actually the same indexing technique or partitioning technique that we provided. >> [Indiscernible]. >> Ahmed Eldawy: >> Yeah. You're moving up the step. >> Ahmed Eldawy: Right. So this is actually on the top step so we are actually making use of the underlying components to build these visualization techniques. So, if [indiscernible] Hadoop partitioning, we end up overlaying the intermediate images. So I think with this as like some transparencies so we have multiple transparencies. Each one contains part of the image and then when you put them on top of each other, you get the final image. While if you use a spatial partitioning, we end up stitching the intermediate images to get the final image. So think of this as a jigsaw puzzle so you have small tiles and then you put them side by side to get the final image. Now, the challenge is actually which one of these is more efficient. So, we support both in SpatialHadoop or in HadoopViz and actually, you can see that there is a difference between these two partitions. So this [indiscernible] partitioning technique is much more efficient than spatial partitioning because it doesn't have to take the attributes into account. But the overlay process is much more costly than a stitch process because the intermediate images are very large so it has to process all this big images to get the final image. So what we ended up doing is that we proposed the cost model that compares these two techniques and there is SpatialHadoop or HadoopViz can automatically decide which one to use based on the images size. So what we found that as the image size increases, we better go to the stitch process while for small images the overlay courses will be just fine. And based on this, we can actually implement an algorithm that can visualize each of these visualization types and if you want to add a new visualization type, you can reimplement the algorithm for this. [Indiscernible], this is not really good. Because we'll end up with a huge number of implementations for the same algorithm, which is not good like for a product because we cannot maintain all these implementations. So what we ended up doing is that we proposed just to one algorithm that can support all these different types of visualizations and if you add a new visualization type, you don't have to define this algorithm. And the second series that we provide visualization abstraction. So this abstraction contains five functions and these five functions can be defined by the user or by the images designer to define the visualization logic. So you can define this by functions in a way to like visualize [indiscernible] data and then you can define them in a different way to visualize [indiscernible]. So now we have a separation of roles that images will focus on defining these five functions while the role of HadoopViz is to generate the image at scale. So [indiscernible] is just scalability issue. We don't have to worry about the logical visualization. So now I will show you how the abstract algorithm looks like at the high level and then I'm going to show some examples of how we can define these five functions. So the first step is actually to partition the data, and we mentioned the two ways we can partition the data. After this, we called the smooth abstract function, which is defined by the user. This basically takes nearby records in each partition and try to fuse them together to get a better looking image depending on the visualization logic. Then we call a create canvas abstract function which initializes an image structure that can be used for plotting or visualization. Then we call a plots function which takes one record at a time and updates this canvas to draw or plot the record on this canvas. So again, this is an abstract function defined by the user and we call it in parallel to visualize all the records. After this, we call the create canvas method again, which is the final canvas that will hold the final image. Then we call the merge function which is defined by the user to merge all these partial images in parallel into the final canvas. And finally, we call the write function which takes this canvas and the final image that's displayed to the user. So let me show you some examples of how we have defined these five functions. Let's say we wrote to July satellite data. The first step is that the smooth function so the smooth function will actually take part of the data and try to estimate missing points. So as you can see here, there is missing points due to the clouds, and this function applies a two-dimensional interpretation technique to estimate these missing points. Then the create canvas function will initialize the 2D metrics and should all be zeros. These actually represent different pressures, different pixels of the images we want to generate. Then the plot function will take one record and depending on the location of this record and temperature, it will update some entries in these metrics. Then the merge function is as simple as metrics addition so just adds up to what this is to each other and finally, the right function will map each entry in these metrics to a color in the image and then write the final image to the output. So, once defining these five functions, we can plug them into the abstract algorithm to generate this image at scale. Similarly, if we need to visualize a road network, which is a totally different type of visualization, we can still express it using these five functions. And then if the users want to add a new visualization technique, all they have do is just define these five functions and then it runs automatically by HadoopViz to generate the image at scale. So users don't have to worry about scalability anymore. So all of what I described now is a way to generate a single frame of the image. So we have the set to generate one image. Another way of visualization is what we call multilevel images. So multilevel images, you actually generate a image where users can zoom in and out to see more of the fields of a specific area. This is actually similar to Bing Maps for example, where you have like a very high level image and then as you zoom in and out, you can see more details of a specific area. However, in our case, we actually are generating these data for a specific data set. So you have your order set and we want to generate this image. In this example, this one example which is like two gigabyte data set, and although this data set is not really huge, due to the huge size of the underlying image, it might take one hour in one machine to generate this data. While in HadoopViz, we can do it in two minutes only. As I'm going into the details here, we actually -- the basic idea is to generate this pyramid of images tiles and generate more zoom levels, we can actually allow the user to zoom in more to get more details. So I'm going to get into the details here, but the basic idea that we partition this pyramid into smaller parts and then generate each part separately. So we have custom model to compare that different algorithms for generating this pyramid. And then we can parallelize the work using multiple -- yeah? >> Can you realize [indiscernible]? >> Ahmed Eldawy: Sorry? >> Can I pose a query [indiscernible]? >> Ahmed Eldawy: Yeah. So you can definitely do this. If the result has some special attribute, then you want to visualize it, you can definitely do this, yeah. So, yeah, as I mentioned, all the queries that we run in SpatialHadoop run as MapReduce programs. And the visualization is the same case. So the visualization is just the MapReduce program. So the output to MapReduce program can be fit as the input to make the MapReduce program. >> Basically all of the -- in some sense, all of these blocks of code are composing through HDFS, through reading and writing ->> Ahmed Eldawy: Right, right. So the input is HDFS and the output is also HDFS, which means you can feed out of one query to the input to the other query. >> [Indiscernible]. >> Ahmed Eldawy: You can still do it. So you can do -- let's say you build a [indiscernible] diagram for example and then index this [indiscernible] diagram. So it's something you can do. >> You just have to pay the cost of the indexing in the pipeline, however long it takes to build the index. >> Ahmed Eldawy: Right. Right. Yeah. So if you are expecting to query let's say the [indiscernible] diagram or the [indiscernible] multiple times, maybe you can run a query on it. So run index on it so that you have the data indexed. >> How expensive is indexing the data compared to like doing a scan of the data? >> Ahmed Eldawy: In terms of query processing? Or -- >> The latency of the end to end time. If I sort of want to take a -- suppose I want to issue -- do a query, some sort of query while I was looking at the entire data set, you know, in Hadoop versus -- but maybe the query isn't fundamentally complicated. I just want to look at all the data, you know, versus the cost of producing an index. Is the index much more expensive to compute or about the same or a little more expensive? >> Ahmed Eldawy: So if talking about a MapReduce program, like, if it's just a map phase which you can scan the data and write the output, it will be much more efficient than indexing because indexing involves shuffling the data from MapReducers. But if you are involving writing a complete MapReduce program, it's not much different because at the end, indexing is just scanning of the data but has to reshuffle the records. >> So it's the shuffling part that's expensive, not really the index construction. >> Ahmed Eldawy: Right. Exactly. Yeah. So I'm a little bit over time, so I'm going to go quickly over the application that we built using SpatialHadoop. And basically, what I want to show here is that actually, SpatialHadoop can be used in real applications and what I will say here that if nothing else with the SpatialHadoop, we are actually using it ourselves to build these applications. There are other examples of other universities who are using also SpatialHadoop to build their own applications. So our application is called Shahed. It's a system for querying and visualizing spatio-temporal satellite data. So you can select any area in the world on this web URL and then you can visualize it or run some aggregate or selection queries on it. Another applications use open street map data. So it extracts the data out of the 500 gigabyte data sets from open street map. Index of this data set becomes available for users to query with just a few clicks. Another application is called MNTG, which is a traffic generator, and this application you can go to this URL, select any area in the world, and then the system will simulate moving objects on the underlying area, which can be used for testing and benchmarking your applications. And now, in the next part, I'm going to go over some related work to SpatialHadoop and some experimental results. So, actually, we are not the only player in the area of big spatial data. There are many other systems that are trying to extend existing big data frameworks to support spatial data. And if I want to differentiate between SpatialHadoop and all these systems, just one statement. I'll say that SpatialHadoop is the only system among all these that extends the core of Hadoop to make it available for MapReduce developers. So all the other systems are somehow rigid in terms of like the types of indexing they support or the types of queries they provide while SpatialHadoop is very extensible so you can add your own spatial indexes, you can add your own spatial operations or your own visualization, which makes it actually very attractive for researchers or developers to add their own techniques to it because you don't have to stick with the amount of like operations or indexes that ships with SpatialHadoop. Yeah? >> Do you [indiscernible] indexes or do you support extensible indexing? are very different. >> Ahmed Eldawy: These We support extensible indexes. >> So you have [indiscernible] people can build their own indexes. >> Ahmed Eldawy: Right. There are actually existing projects that people adding spatial indexes to SpatialHadoop. So they don't have to go through whole pipeline of writing the whole algorithm for indexing. You just have define like an interface, an abstract interface and then these defines how index will be built. So it's like one class that you will implement. are the to the >> Maybe a related question. So if you modify the core of Hadoop, if the core of Hadoop evolves, do you have to kind of adjust SpatialHadoop? Because the advantage is, right, to use -- to only sit on top, then the open source community produces new versions of Hadoop, you don't have to kind of -- you can ->> Ahmed Eldawy: Well, yeah. So that's a good point. So actually, our current implementation is somehow ships as an extension to Hadoop. However, it actually, like, overwrites some Internet classes of Hadoop, which means, yeah, if Hadoop is a totally -- like they change the underlying architecture totally in a way that's not compatible with the current architecture, then, yeah, SpatialHadoop will have to be modified somehow. However, we still somehow use like some kind of extension or like expand existing classes so it can actually run with different distributions of Hadoop so we don't have to stick with a specific distribution. So there are some examples of people run it on Hortonworks which is another distribution of Hadoop or Caldera Hadoop. So it can actually run across different things. Although we bet it on Apache Hadoop. >> So you're saying that enough people have taken a dependency from these things that you have extended and probably is difficult to change them at this point. It would break a lot of people's hearts. >> Ahmed Eldawy: Right. Yeah. [Laughter] >> Ahmed Eldawy: I think a version of this and we are trying to make it easier for other people. Yeah. So, let me -- okay. Yeah, question? >> [Indiscernible]? >> Ahmed Eldawy: That's a good point. Currently, SpatialHadoop is like sticks with the limitation of HDFS, which means that you cannot modify the files in HDFS. So currently the indexes are static. So once you write an index, you cannot modify it. You can of course rebuild it if you have new data sets, but you can't modify it. We have an extension which I didn't describe here which is ST Hadoop. It works only for specialty parallel data. So if you are expecting more data to come with a recent timestamp, for example, you have like Twitter data, for example, and each day you have a new batch of records with a data timestamp. We support some kind of update for the existing index but this is somehow limited to this kind of spatial temporal data. But in general, if you have a spatial index and need to add more records to it, currently this is not supported. This is one of the open problems that we are thinking of approaching in the future. Yeah. So let me quickly show some examples of experimental results. So this shows the performance of range query. So as we increase the input size from one gigabyte to 128 gigabyte, we can -- like performance of Hadoop keeps decreasing the throughput of the system because it has to present the whole file, while SpatialHadoop can limit the amount of the data that is processed using the spatial indexes so it can keep its performance. Other example is spatial join and this actually shows you like give some motivation of the different spatial indexes that we support in SpatialHadoop. So you can see how the same query runs at different like performance numbers with different spatial indexes with Hadoop for the two files. So it actually motivates people to add more indexes that can be more efficient for specific queries. That part is like computational geometry work so we can show you here how we can speed up different kinds of spatial operations or [indiscernible] operations in SpatialHadoop. Finally, this is the visualization part and we can show here how we can speed up different kinds of visualization types by just defining these five abstract functions. So the next part I'm going to go very quickly over other research projects that I have been working on. One of them is Hekaton, which is here in Microsoft Research with Paul Larson and Justin. Basically, it was motivated by the huge drop of the prices of memory. I'm pretty sure you are familiar with it. So basically, the project that I did was how to differentiate between hot data and cold data and try to migrate the cold data to a cold store so that we can reduce the memory footprint and keep the performance well. And this was actually integrated with SQL Server. Other part is Quill, which I did with Patricia and Jonathan. And basically, in this project we start to use Trill as a single machine streaming processing engine and then provided an API that can parallelize this work in the cloud. Other project is NADEEF which I did in QCRI. This is an extensible data cleaning system. So we have some dirty data and you have defined the cleaning rules and then the system applies these user defined rules to clean this data. Another system is called Sphinx which is a system that I did very recently. You can think of this as spatial Impala. So [indiscernible] Impala is like a barrel database design by cloud data and tries to reuse the spatial indexes that we built in HDFS by choosing the spatial query processing on Impala and extends it to support these kinds of spatial indexes. So this is mainly for SQL processing of big spatial data. So in the last part, lt me go quickly over some future plans by research. So some of them are actually open problems on the current work that I did. For example, the visualization part, we can think of extending the abstraction to support other types of visualization, so 3D visualization for example. And we can also migrate the image initial process to other systems such as [indiscernible] storm so that we can support realtime visualization. Other part is to expand the query processing engine office port so that we can make use of the spatial indexes that we built in spatial Hadoop, and this will allow us to support like [indiscernible] queries. Other things to support like extend the spatial indexes so that they can support interactive data where we can add more data, so related to the question you asked. So the current index are static, but we can actually in the future we can modify or expand these indexes so that we can add more data to the index. For long term plans, actually we have an idea of having identified big data interface among the -- on top of like the big data frameworks and they actually were affected by the huge number of open source big data frameworks such as Hadoop, Spark, and Impala. And actually, a question that I usually get from users is which one of these systems I should use for a specific application. And it's very challenging because first there is an overlap in the functionality in these systems. For example, of Impala, Spark is scale and [indiscernible] can support scale query processing. And there's no clear winner in terms of performance between these systems. For example, the papers that [indiscernible] Spark SQL in the last year and SIGMOD is compared to Impala, I'm sure that Impala can be faster than Spark is scaled. Another paper that would be published [indiscernible] later this year is actually comparing Hadoop to Spark and showing that Hadoop can be faster than Spark, although Hadoop is very old and want to be very bad, we're sure that there are still case studies that Hadoop can be faster than Spark. So with this actually, it's very challenging to decide which system to use. But the good point of that, we don't have to choose one system because we can actually run all these systems together in the same physical cluster because all of these systems support HDFS as a file system and support [indiscernible] as a resource manager, which means all of them could coexist in the same cluster. However, my plan is to build [indiscernible] abstraction on top of them where users can express their queries in identified language and these systems will apply some cost model or some rules to choose which underlying system should be used for this specific query. For example, if it's like a [indiscernible] query, it can run on Spark. If it's [indiscernible] on Hadoop. We can also apply query optimizations so that [indiscernible] query and try to optimize it for the specific application or for the specific underlying framework which shows or we can also apply some advanced query execution techniques. For example, you can take a huge or complex query, break it down into smaller parts, and run each part on a different framework and combine the results back together. And with this, we can actually provide a better user experience for users to use or choose between all of these different systems. Other long term project that I can work on is geographic distributed clusters. And these actually motivated by the systems that generate a huge amount of user content. For example, think about Facebook or Twitter where people create a huge amount of content. So, we have these users all around the world and they are generating huge content and if you store all this data in one datacenter, it cannot scale with the amount of data that we store. So what currently is done in these systems that we have different datacenters, geographical distributed among the globe and each one stores data from nearby users. And this works fine for storing the data and retrieving it but when it comes to analyzing this data, what they currently do is they ship all the data they want to analyze into one datacenter and do all the query processing in this datacenter. However, this can work in some cases but in some other cases, it's better to do part of the query processing of the different datacenters and then combine that partial results together in one datacenter to get the final answer. And the challenge is that if you want to run Hadoop or Spark in a distributed cluster or a distributed environment like this, in a geographical distributed environment, these systems are not actually designed for this kind of query processing. So they might keep moving or shuffling data between different machines which are totally like separate in terms of geographic location. For example, shuffle data from Europe to the U.S., for example, which will reduce the query performance. And what makes it even more challenge something that there is a [indiscernible] between the cost and processing time. So, what we do here is to expand one of the big data frameworks. It could be like Spark or Hadoop, and then try to modify its source that that it takes into consideration that geo traffic location is machines, we try to optimize the query and try to balance the cost and the processing time. So this is the last part of my talk. Let me quickly summarize the talk and then I will be happy to get your questions. So I've described the different spatial indexes that we support in SpatialHadoop. Then I described different operations such as range of query or special join or corporate [indiscernible] operations. And then I described the visualization part where we can visualize data in both single level and multilevel images. And then I described some applications that we built using SpatialHadoop and then I went over some other projects that I worked on and described my short term and long term research plans such as unify big data API and geographical distributed clusters. So, with this, I'll end my talk here and I'll be happy to get your questions. Thank you all for listening. [Applause] >> Ahmed Eldawy: Any questions? >> [Indiscernible] data set that you have tried? >> Ahmed Eldawy: So I'll give you two examples of larger sets that we are working with. For vector data where we have like points and polygons, work up to like five terabytes of data. For the rest of our data, which is the satellite data, we worked up to like 30 terabytes of data. Although like [indiscernible] archive which is like one petabyte of data, but we didn't have the capacity of machines to store all these data. So I took like 50 terabytes, a sample of this data and worked with it. machines that that I have. >> And what did you find? It is the largest I could get in the Did it scale well in those cases? >> Ahmed Eldawy: Yeah. Up to these numbers, it scales very well. linear with this use data sets. So almost >> And [indiscernible] terabyte level, did you see the performance scaling you'd expect -- you'd hoped for? I shouldn't say expect, but that you hoped for when you go from certain terabytes to petabyte. >> Ahmed Eldawy: I didn't try, but it would be very interesting to try. Like, I didn't have the resources to try the one petabyte archive. I don't have the resource to do this, but it would be very interesting to try it. Yeah. >> [Indiscernible]. [Laughter] >> Do these systems run IO bound or CPU bound or message bound or what exactly [indiscernible] performance here? >> Ahmed Eldawy: That's a good point, a good question. Because actually it depends actually on the query that we are running. So, some queries are IO bound such as spatial join for example. So it depends on how much data we need to read from disk. Other queries such as like competition [indiscernible] queries especially like [indiscernible] triangulation, which is very complex, is actually CPU bound. So most of the time is actually spent in computing [indiscernible] connecting them together. So it depends on the query actually. In the visualization, we found that it's actually network bound. With visualization, it's actually network bound or, yeah, it depends on the communication between the different parts. >> So I could ask a Trill question, which is 30 terabytes, you might run on a single machine. >> Ahmed Eldawy: If you have -- >> [Indiscernible] scale, yeah. >> Ahmed Eldawy: If you have, yeah. >> Yeah, so there are systems like that where you run 30 terabytes. Would you -- what would happen if you tried that, number one? And number two, does that suggest perhaps other approaches to the problem other than sort of [indiscernible]? >> Ahmed Eldawy: That's a good point. So, again, it depends on the like type of query. So some queries could run by scanning the data. In this case, you could run it on a single machine if you have 30 terabytes of data. Other queries actually require like all the [indiscernible] to be in memory for processing, right? So it needs to look at all the data. For example if you're running [indiscernible] diagram and you need to use the typical single machine algorithm, you actually need to do it over the type memory because it has to do the viability concurrent keeps like merging the data together. For the type of query processing that is done in Trill, well, I think actually like -- let me think about it. So, for example, we can't actually run like if we use Trill as is, I think it can work with the spatial data. But the good question is how can we improve Trill to work better with spatial data? Right? This could be a good part. I think for example like storing the data itself, so if you store this data, if in Trill, you have to like store data first and then read it, so it has to be stored somehow. If you have the control to store it more efficiently, then we can actually reuse some of the parts that we did here. For example, spatial index or spatial partitioning. It can still be applied in a system like Trill or Quill because it's more distributed. So we can actually, while you store the data Azure blobs, we can make use of the spatial partitioning that we do so that we can actually do the query processing, we can make use of this. One thing for example we can think I think for Trill is like it's like somehow scans the data, right, so we have like batches and then you run each batch separately. A good part is like if you can have spatial boundaries for each batch then this can significantly speed up the query processing. So just giving a batch of records and trying to process on them, let's say you get a batch and then you know also the boundaries of this batch. This can significantly speed up the query processing. So this can be parts that we can -- yeah. Parts of this work can actually be migrated to a system like Trill or Quill. Any other questions? >> You talk about spatial [indiscernible]? >> Ahmed Eldawy: Yeah, there was. So far we work with just plainer data. So if we are using like geographical data like [indiscernible], we just assume that it's flat on a map. We didn't take into account for example if the data is on a sphere. But this could be something to, yeah, there could be an open problem that we work on in the future. For example, if in this case we can use like hierarchical triangular [indiscernible] instead of like using the normal partitioning because it's more efficient or more suitable for this kind of data. But so far, we just work on plainer data. >> [Indiscernible] that I have to support. >> Ahmed Eldawy: We didn't have this situation. We can work with it in the future. >> Badrish Chandramouli: [Applause] All right. But it's very interesting. Let's thank the speaker.

36992 >> Badrish Chandramouli: So, it is my pleasure to... the University of Minnesota. He's Marvin Modal's student. ...

Related documents

Products

Support

36992 &gt;&gt; Badrish Chandramouli: So, it is my pleasure to... the University of Minnesota. He's Marvin Modal's student. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

36992 >> Badrish Chandramouli: So, it is my pleasure to... the University of Minnesota. He's Marvin Modal's student. ...