>>: So the next speaker, Rob Gillen, needs no introduction, because he was just introduced in the previous thing in the panel. But he's from Oakridge Labs, and he's been doing some interesting things with Azure and looking at a lot of problems from the national laboratories point of view. >> Rob Gillen: The original e-mail I got said this was a 30-minute talk. Then I got here on Thursday, and the slide said talks are limited to 20 minutes. I cringed and then I realized that the panel precedes in this one. I'm going to sort of count that as my introduction so we're just going to roll right in here. I want to talk about two main topics, or two topics sort of under the umbrella of data and the cloud. To give a little bit of context. We've been working on a project specifically focused on this. We are -- if you don't know much about Oakridge laboratory, we do a lot of computation, like a lot meaning we have both the first and third in the top 500 list super computer sitting on the floor and we're currently putting in our third petascale Fmachine. We do lots of computation. All of those bad things we heard about in the first opening keynote where we heard about MPI and four tran and C plus plus, that's all we do. So I'm probably the fish out of water in that group, but we do a lot of work with lots of tightly coupled simulation, though we also do lots of other areas of science. In other audiences I tend to say we cover all areas of science and I've been told that's technically incorrect and get in trouble for saying that. We do lots of science from biology to neutron scattering to take your pick. We do lots of different things. Our specific interest in cloud computing is how does it fit into our computational profile specifically, and what we define as mid-range computing. And because our definition may be different than yours, we define that as anywhere -- is computing where the problems need between 256 and 1,000 nodes. So that's sort of, that's our middle bounds for computational profiles. And it's a multi-year project. The first year we're focusing specifically on data. This is some thoughts and comments that come out of that. And we, like I say, the first aspect we're going to talk about is movement. There's two main topics. Movement and then data services. So movement, we're focusing on sort of three paradigms of very, very simple. One is if we have computation host locally at our facility and we're accessing data that's hosted in the cloud, the second being the reverse of that. We actually have some massive data sets, sort of petascale size, and if we wanted to make those, sort of looking at the case, what if we want to make those available to the cloud for maybe third parties or other people who are doing computation there. The third case, which is very easy, or easiest of the three, is scenarios where both computation and data live in the same cloud. While there are multiple cloud providers out there, we've been focusing our efforts on the two that are represented here, both Amazon and Microsoft and playing with different keys there. Probably unlike most, I'm more of a Microsoft coder by definition so our samples will lean that way that we've been working with both. We came across something I know that is going to amaze you all. The Internet is slow. So to put this in context, we live -- Oakridge has sort of a uniquely, a uniquely fortuitous network connectivity. There's national sort of a lot of academic networks that sort of make a loop around the country, and then there's this big fat pipe that goes from Atlanta to Chicago and it bends directly through our building. So we have multiple ten gig connections, a couple of 40 gig connections. We're in the process of putting in a 100 gig connection to the Internet right through our building. So we have significant band width. And so in that context, as we're studying things, we're trying to keep things in perspective, right. So things that may work for us may not work for other people. Things that work at our lab may not work for that typical scientist we were talking about earlier in the panel or at a research institute or collegiate institution or things like that. So we're looking at different scenarios. And one of the biggest things, you know, we started off with some very simple baseline parameter sweeps. We started off by generating a bunch of data and different file sizes and doing parameter sweeps of those file sizes, both up and down, going to Amazon's cloud, going from our place to azure's cloud, doing downloads and inter-cloud communications. And probably one of the most significant things that sort of came up is more educational not so much from my standpoint, coming from industry in the Internet, but for our researchers. They were caught off guard a little bit, having spent most of their time in working in local networks, was the variants that the cloud -- the variants that the internet introduces. So the next two slides sort of hint at this. This is a duration by file size showing one standard deviation. This is comparing the same profile of us against -- us. Microsoft against Amazon coming from our labs. The key here is not to say, well, Amazon's better or Microsoft's better because it's less variants. That's sort of irrelevant to the point. The point is there's a lot of variance. And there's a number of factors that play into it. This next slide actually illustrates it much more. This is a very non-intuitive variants pattern, and it's not a -- there's a big fancy scientific reason for why this happens. It's called network congestion, right. Has nothing to do with the science, nothing to do with the bits. It has everything to do with the fact that we ran our test sequentially. We worked from one file size all the way up. And when we hit this file size, it happened to hit a blip. There happened to be other traffic on the network that interacted with it that caused things to slow down. That's the reason. It's a very key and important thing for researchers, if you're doing this sort of detached or segregated computation, where you've got computation disjoint from your data say in the cloud to take into account in your computation. And in your patterns. So again, stating the obvious, probably that's from this is you want to put your data as close to your compute as you can. That's sort of an obvious known statement. But there's still case where's we think that this -- that there may be too expensive to move data. As you can see we did parameter sweeps from 2 KL up to a gig. This is not just moving massive files. This is moving small files simultaneously and parallel across things. So we spent some time looking at this and came up with again a somewhat obvious but yet effectual statement is that multithreaded downloads are file transfers make significant impacts, right. So again, something that may not be as intuitive to classical HPC people is that moving data across the internet doing multi-threaded transfers can provide some interesting improvements. So essentially what we did is we wrote some -- we adjusted the libraries of test harnesses that we had earlier to say rather than just opening up a socket and pulling a file down, give me first of all, acquiesce or ascertain what the file size is, split it into end number of parts, open up that number of threads and pull it down. And we did parameter sweeps for all of the file sizes as well as the number of concurrent threads. For downloads that's very easy because both providers and most do provide HTP interfaces that support byte ranges so you can simply give me a byte range for this. Different threads working on different byte ranges. That was very easy. The upload was actually much harder. And so what we did is we sort of, we looked at the different API styles and the things supported by the different providers and we sort of stole an implementation idea from Microsoft. So Amazon's, you know, their S3 thing when you want to put up a blog, you open up a pipe, say here's my data, go, go, go, go, go, closes it and you finalize your data. Very standard upload. Microsoft does the exact same thing until you hit a limit of 64 megs. Anything above 64 megs, they require you to break it up into blocks, somewhere, I think each block has to be less than four megs and you can issue a number of puts and then like they demonstrated the other day, once you get them all up there, you do a commit, say here's my list, commit that into symbols and a file. The beauty of the Microsoft approach is it allows us to do sub-file parallelization. So we can actually put one -- we can transfer one file to it and do parallel threads. What we did then is so we could sort of compare things similarly, is we established a data proxy so we took and basically built our own data service, deployed it both to Microsoft and to Azure and to EC 2 in this case. The point of that data service would be to allow it to behave very much like Microsoft's blob storage does in that it allows block transfer. It does a couple of additional things in that it allows us to do CRC checks on each block as well as doing adaptive compression on each block. As we're doing our transfers, we would do selective compression on each block and then do the reassembly on the client side or, excuse me, on the server side and then do, then optionally we would transfer that file from the compute location over to the storage location. And what we saw was, well, sort of what you'd expect. We saw some pretty amazing performance increases. This is actually the download direction. So even for a second thread, we saw stuff easily in -- or pushing double percentages all the way up to over 800% performance improvements for transferring stuff. Again, that's because we're taking advantage of sort of, if you will weaknesses in the internet and the way things work in opening up pipes. So one of the conversations or the topics that sort of we want to throw out is this all works and this is great, but if I'm a scientist or if I'm trying to consume one of these cloud services, we view this as an area that needs significant work. In neither of these cases were the local disks or the target disks the gating factor. It wasn't local network. It wasn't local memory or CPU, nor was it remote CPU or remote memory. It's simply the inter-connects between them. So we're sort of focusing our own efforts internally and also sort of making a call forward is how do we solve this problem. Why can't -- if I've got a gig connection at my facility and there's Amazon or Microsoft has a gig connection in their place and neither are full, why can't I use as much as is available, right? Why can't I get closer to saturating that network pipe. So that's an area of significant improvement or that we're looking to see and would sort of call upon the community to work on and focus on. Second topic is services, and this is sort of follows on to the conversation or one of the points we made earlier is when we -- I think we need to step and look, when we're talking about cloud computing, talking about what it's good for and how does it augment our computational platform, one of those areas that's sort of natural is that of data distribution and data sharing. And there's a lot of reasons, discussions as to why that should be. Our facility does a lot of data distribution. We do a lot -- we host a bunch of the IPCC data for the climate stuff and we -- I did a little digging as to how that works and there's actually this big massive tape library. And there's a web page you go to, and you select, you can go through catalogue, select things. You check a box, it submits a job, it goes to the tape library, moves it over to a computer. You get an e-mail with a download link that's good for an hour or something. You pull it down and so forth. It works, but it feels very arcane, very difficult to move. And what I'm getting in that case is I'm getting a bunch of net CDF files, which if you're a domain scientist or you work in that space, you know what that is and you know how to work with it and that's fine. If you're not, you look at the file and you say what's this. And if all you wanted to really know was for a given point in time what was the temperature for this lat/long combination, you've done an awful lot of work and you still have an awful lot of long way to go to get that answer, right. You have to find libraries to interact with it, which certainly they're there, but then you've downloaded this gig file for roughly a K's worth of information that you need. So we saw earlier a conversation about Dallas and there's some really interesting things, not necessarily from a products standpoint but conceptually what they're trying to do. They're taking a data service and they're saying one of their key selling points if you will, is we're going to provide a commonality across all of these data services. We're going to provide a similar interface, whatever it may be. In their case, it's O data, which gives you both JSON, JSONP optionally and ATOM pub. As we looked at that and were playing with it, we actually did some work with, I don't know if it's technically accurate to say a predecessor to Dallas, but it's called OGDI, the open government data initiative, which is an STK that was put together by Microsoft's DPE group out of Microsoft federal in DC. Basically the same notion of Dallas, just less polished and didn't have all of the magic corporate sponsor behind it. Same ideas, though. We did some kicking around with it, and again stating a brilliant observation, science friendly formats don't often or often don't equal internet friendly formats. We think about internet friendly formats, we're thinking about ATOM pub, XML, things that are easily consumed, JSON, things that people can, while they probably wouldn't want to make a steady diet of it, they can open up a text browser and see it and consume. It and sort of rock what the data is. On the other hand, sort of scientific formats, FASTA maybe being an exception to this, but even that's sort of a little weird to look at. Net CDF in particular, you get this big, it's this big hierarchy of binary data. Without the proper tools, there's no way to dig into it. So we do D some experiments. We took a couple net CDF files and tried to publish them through to the internet and say okay, if we wanted to make this available as a service, ignoring by choice the whole open DAP approach and what's sort of accepted because again that's domain specific within a realm. We're trying to say if you want to take just an internet friendly approach completely down the road. Again, we saw some amazingly, I guess, blatant or expected results in that -well, stepping back, we took a slice of data, which was all of the temperatures for a given point in time across the globe for a given, based on their grid, which I think is a five degree grid ended up being somewhere in the order of 8,200 data points, each data point consisting of a time stamp, lat/long and a temperature value. So not much data. As aggregated amongst the larger net CDF file, it was a K or a couple of Ks worth of data. When you flat than to a CSV file if you will and stream it across, you're talking about a couple hundred K worth of data. When we moved it and exposed it as an ATOM pub service, it got really big. So that same 8,200 records, if you want to look particularly at the first and third lines, that same 8,200 records if you want to look particularly at the first and third line. Became almost nine megs of XML coming down the line. There's an obvious problem here in that the data services at that time, and I don't think they still do, and Azure don't support native compression or compression automatically so you don't get HTTP compression on your transfers yet. I understand there's work coming there. Right now, it just gives you raw XML, which as we all know, raw XML is very bloated. JSON is better, but it's still, it's still pretty significant compared to the original, which was actually I don't have on the chart. So the payload got very, very big. So I guess the, again, stating future work here as well as sort of a call to the community is to put -- as we, as an industry, or as a research community, adopt cloud computing and adopt cloud computing strategies for distribution and data and so forth, I'd like to see us move towards consistency across the protocols that we distribute, protocols that we use. There's certainly a justification and a valid reason for binary or, if you will for their native formats, but I think it's incumbent along with those, at least in my semi perfect view of the world, for the notion broad distribution of data is, if you're going to distribute data in the native formats, it should be associated with sort of more internet friendly formats so I can get it in both ways or either way I'd like it to. Ideally, in a format where expose it as a service that will adapt based on my call. So if I use the same URI end points and basically pass flags that say maybe I want it as JSON or ATOM pub. Maybe I'm playing with a small data set and have a very easy way to interact with it. And as I'm moving down the chain and I figure out what I want I pass a different flag and get the same results that are larger results in a more friendly format for that particular thing. I think there needs to be work in the area of binary XML and some standards around that line wherein we can express formats that have some of the richness, if you will, of ATOM pub, because ATOM pub is great for a lot of things in that it's self-describing, it's richly and easily consumed by a number of clients, but so we need to shoot for formats that have those qualities without having the encumbrances of being overly bloated. So that's the future work. Things in sort of the last statement in this sort of goes to the overall thing is I think it's incumbent on us as researchers as we're using the cloud and looking at ways, how does it change what we're doing to be a little more open with the data that we're developing or data that we're publishing. Working through some papers and some research, or some papers that I was editing for somebody else, they made some very audacious claims with very little data to back it up. One of the things the internet provides us is the ability to link large data sets with those results and to publish those concurrently. Allowing people to do that validation. We sort of hinted at that in some of the other conversations. And I guess it's just, I think we can do better as community. I think we can provide that proof, provide that data such that a broad community of people can consume and use it. That's it. [applause]. >>: Questions? >>: I think you sort of hit on it. One of the reasons these why binary formats sort of split [inaudible] richer. >> Rob Gillen: Yes. >>: I think it's somewhat fundamental because the atomic unit of work decided to do this [indiscernible] time stamp [indiscernible] can be made, right, so if you actually want to draw that locally, because that's what you're doing, is maybe [indiscernible] splitting up a gig because it's XML. So I guess I was wondering, is there a way to make science friendly internet friendly rather than make, rather than take internet friendly and use it more for science? >> Rob Gillen: I guess that's the ->>: [indiscernible]. >> Rob Gillen: If every scientific domain used net CDFs, sure. That's the point, right. So ATOM pub, I guess the point I'm making is much the same way that ATOM pub has sort of become very widely adopted across, you know if you look at protocols like O-data and G-data and these things, that approach is very broadly, broadly used, right. And it's almost, it's sort of being used in science slightly backhanded coming from industry and from the larger group. I'd like to see -- right. The point is I'd like to see us take that same notion, whether it's ATOM pub or some variant, but providing it in a binary format, so it's domain discipline agnostic. >>: I guess -- I'm not trying to defend CS data base version or whatever, but they were saying that they are [inaudible] just because there's not [indiscernible] or something. >> Rob Gillen: Right. >>: Like they're just a raise. >> Rob Gillen: Yes, but if you're not a scientist, do you have any idea what an [indiscernible] is? Not even a domain scientist. >>: If you're not a scientist do you know [indiscernible] that's what they would say that their data [indiscernible]. Anyway ->> Rob Gillen: Whether it's data -- I think it's, whether it's the development of a new format, an adaptation of another one or simply the -- let's see, the popularization of an existing format, I think there's a work that could be done in a number of different ways. Because I don't frankly care what the format is so long as it's broadly available and sort of to a large populus, not just that 1% of people that know what MPI is. Yes? >>: Did you guys take a look at any sort of plan optimization technique, scientists or things like [indiscernible]? >> Rob Gillen: We did not. We sort of acknowledged that they were there and [inaudible]. Again thinking more of the generic consumer. Both assessing what's currently out there, but also looking at what that generic consumer would have exposure to or access to. >>: [Indiscernible]. >> Rob Gillen: Yeah, so, yeah, I didn't go into that very well. I really wanted people not to look at the fact that one was better than the other, because there are about 300 different reasons for that. We did -- I took them, sort of ran up to our network guys and said look at this, why? And we're connect the both to Internet 2 and ES net and we actually saw that because of the way Microsoft uses Akami, the southern crossing, which is one of the big network providers advertises, they do a lot of sort of chatty advertising on the routers, everything Microsoft this way, which is not actually the best route for us to get to Azure, right. So our traffic to Azure was actually taking a performance hit because of sort of other things on the net. That's one of the key things that we took away from that. It's not just -- there's a plethora of things you have to work through to get sort of that optimal network transfer that most people who traditionally live inside a given data center wouldn't think of. The point of that was to show the variants and that there were sort of odd things happening. I've been very careful not to say this is better, this is better. It's the deltas between the two ->>: Actually, we've had some experience where research [indiscernible] and it's totally different than how the stuff in the middle [indiscernible]. >> Rob Gillen: It does. >>: It's interesting, because I instantly thought of bit torrent when you were talking about the multithread. That was supposed to solve the bottle neck. But what you're saying is maybe the bottle neck has been in the network the whole time. If you could ->> Rob Gillen: So Amazon actually has support for bit torrent coming down, so if you're using -- which is novel in some ways if I'm using, say we've just done a big run of climate data and I've tasked with distributing it across the world and I've got lots of people pulling the same data at the same time. Then that model is really great, because in theory bit torrent only works well if there's lots of people seeding. If you're the only person pulling the data, doesn't matter, all right? So it's interesting in that what bit torrent really does, presumably some of the people who are seeding are closer to you than the actual source of the data, right. And if you actually are closer to the actual source, you're not going to see the benefit from the seeds coming from other people. >>: One of the significant things was your use of the parallel streams. The routing makes a huge difference. >> Rob Gillen: It does. >>: Is that piece of software available? >> Rob Gillen: It will be. The point there, though is yes, I sort of, we came up with a way to solve it, but I would like, I guess in the naiver, realistic view of my view of the world is I shouldn't have had to have done that, right. And that's not a, that's not a -- I guess that in and of itself is posing a problem to be solved, right, is that's not sort of a banter on any cloud provider as it is the view in between, right. In theory, I should be able to open a pipe and have a pipe open and say here's all my data and go. >>: [Inaudible] been trying to do that for [inaudible] never would drive. Liability was an issue. It's very [inaudible] and it is the way [inaudible]. I think at this point, we can thank Rob. >> Jie Li: Hello, my name is Jie Li, and today I'll talk about some of our experiences using windows Azure to process MODIS satellite data. This project has been a cooperation between University of Virginia eScience Group, University of California Berkeley, the Lawrence Berkeley National Lab and Microsoft Research. First I will give you some background information. So eScience today is becoming more and more data centric and data intensive, which brings great opportunities to accelerate scientific discoveries. On one hand, the growing, the increasing data availability from both large scientific instruments and those large scale inexpensive ground-based sensors have created an invaluable data repository to enabling -- to enable new science explorings. And on the other hand, various computational models with increasing complexities and precisions are being used today to produce better scientific results. So with these great opportunities, a natural question to ask is how can scientists easily access these tremendous volumes of raw data and apply complex computational models on this data to produce meaningful scientific results? And more specifically, do scientists have sufficient computational resources access and enough applications and tool support to enable them to easily manage these large-scale data and the computation. If the answer is not, then what could we do as computer scientists to help resolve this problem? So in this project, we have encountered a concrete example of this problem, and our goal is to build an application upon a scaleable infrastructure to help environment scientists to easily access, manage and analyze the large scale remote sensing data from the MODIS satellites. So give you some background, MODIS is short for moderate resolution imaging spectroradiometer satellites and there are currently two MODIS satellites which are viewing the entire earth surface every one to two days. And acquiring data in 16 spectral bands. This data is separated into multiple data products according to the surface types such as atmosphere, land and ocean. In general, the MODIS data are very important for understanding the global environment and various earth system models. However, currently, scientists have a number of barriers for using MODIS data in their scientific research. The first barrier is data collection. The MODIS source data are currently published and maintained on multiple FTP sites. Though this data are publicly accessible, it's not easy to query or get a subset of the source data for a specific area or geographic tile. Because the metadata for this source are maintained separately and there are no useful interface to support such query. And the second data barrier is that the data heterogeneity. For the different data products, there are different time granularities and imaging resolutions and before scientists can apply their computation models and doing analysis work on this data, these different, this heterogenous data must be transformed into a uniform format for different data fields. And even worse, this different products use two different project types, called swath and sinusoidal, and these two project types are using two geographic, different geographic coordinate systems for mapping the whole entire earth. So and also it is extremely expensive computing process to reproject one project type to the other. But for many science, scientists who want to work on all this data products, this is a required step. So we cannot avoid the computation here. And finally, the data management and computational resources requirement is overwhelming. For example, we currently have a use case and there is a graduated environments student from UC Berkeley and his name is Yung Will, and he's using a scientific model to compute a science variable from the totally ten years of data covering the whole U.S. continent, and he needs to get those results in order to finish his dissertation, Ph.D. dissertation, before this July. However, the data management requirements is really huge. It involves totally five terabytes of source data, which about 600,000 files, and after we do the reprojection, in order to generate the harmonized data, there will be an extra two terabytes of data and this whole process will take about 50,000 CPU hours of parallel computation, and the fact that this computation can be largely parallelized is important, because we need to scale out the computation and to use as many computational resources as we can. But still, there will be a long process to finish the computation. So we build Azure MODIS, which is a client plus cloud solution and this is a MODIS data processing framework we built in Microsoft windows Azure cloud computing platform. First, we leverage the high flexibility and the scaleability of the cloud infrastructure and the services as provided by windows Azure. Second, we leveraged a unique dynamic, on-demand resource provisioning capability of the cloud infrastructure in order to finish the computation in a timely manner. So in this data processing framework, we completely automated the data processing tasks and which would otherwise being manually done by scientists to eliminate the barriers and complexities. And finally, we provide a generic reduction service in which scientists can run their own arbitrary analysis executables in our service. Okay, next I'm taking an overview on the AzureMODIS framework. I guess all of you have seen this slide many times so I will just skip the basic of Windows Azure here. Okay. So this graph shows the high level overview of our data processing service. There are mainly two parts of it. The left side of the graph is the front end data service web portal for which scientists or the users and developers can use to interact with the system. The right side of the graph is the background computing system which includes three main stages. Data collection stage and the reprojection stage and also finally the scientific analysis and the reduction stage. Actually, we currently have two sub-stages in this analysis step. So there are actually a four-stage pipeline in the background system. A typical computation run includes a series of steps. First, the scientist will submit a request and in which he specifies the requirements for his computation through the web portal. Second, the request will be sent to the job request queue in the Windows Azure system. And then further, processed and parsed by the service monitor, which will then dispatch a large number of parallel tasks into the task queues. And there are a number of service workers working in the back end, and each worker will keep pulling the tasks from the queue. And after they get a specific task, they will query the metadata in the Azure tables in order to locate the specific source data files in external FTP sites and then download them to local storage. And this specified source data are then uploaded to the Azure blob storage and in order to cache the data for future use, which we'll talk about later. And then the heterogenous source data will be reprojected into a uniform format before scientists can run analytical work on them. So after the reprojection, scientists, when the scientist submits his request, he will also upload obvious executables to work on the data, and in this reduction stage, the worker will invoke these executables to do the analysis. Finally, when the computation finished, the system will send a single download link to the results as produced by these executables. Next I want to show you a short demo which shows our system in life. So this is the Azure web portal for our deployment. As you can see, there are a number of different roles. There's a web roll which is hosting the web portal called MODIS data service and there's one instance. Also there's also a single instance service monitor which is the master of the computing system. Finally, we can see we currently have one worker which is a service worker, because we currently have no computation ongoing so we keep a minimum size of the service to just maintain availability. And now, I will go to the web portal. And submit a test request. So we can go to data reduction service and specify the years and days. So I would just choose two days. And also, satellites, each of them and also the tiles is the streams for the geographical areas. And here, we'll choose all the U.S. tiles. And also, request e-mail I will use. Which is test account. Here is the step to upload the reduction executable and we do have a test executable we got from our members. And open it. Actually, there is an optional second stage reduction, but here we will just enable a single stage reduction. So then I will submit the request. Okay. It shows it has been sent successfully and then we retain -- written to the main page and hopefully, in a few seconds, I should receive a confirmation e-mail from the service, which shows, okay, so I already got the new notification e-mail saying okay, the service has been -- the computation has been started by the background system. So now, we just wait for the service to be done and then go back to the slides. And after we submit our request, so what is -- what actually happens behind the scene, so actually, our data processing framework is totally built on the three scaleable storage services of Windows Azure. So as we send a job request to the queue, there is a single service monitor which will pull the job and then parse it into a large number of parallel single tasks and then dispatch them to the specific task queue. And then we have numbers of generic workers roles running and they will keep pulling the single tasks from the queue. And when they got the specific task, they will stage in the necessary source data from the blob storages and working on them and then produce the scientific results and send it back to the blob storage place. And finally, if the whole service is done, then there will be a single download link sent back to the user. And also, we can see that we have persisted the status information of each request job and task and a single tasks into the Azure tables and this will be helpful both for logging our history and also help in the diagnosing of our computations, which I will show you later. So data reusing is a very common scenario in the processing of mod MODIS data. And so currently, in our system, we have implemented a two-level data caching. So we used unique global name space and each data file in the blob storage has a global unique identifier. And so that we can either pre-download or download all the source files from external FTP sites and then cache them in the blob storage for all future computations to enable reuse. And also, we can compute or pre-compute all the reprojection results for future reuse across different runs of computations. And on the local machine level, since each small size instance has around 250 gigabytes of local storage and we currently choose to cache the large size data files for reuse across the tasks. And also, why don't we cache all the data size, because given the 250 giga bytes, it's not sufficient, okay, to cache other data. And there will be some cost-related trade-offs to be made, but I will, given the time limit, I will not go through them here. Okay. So for the reduction service, as we see, scientists can upload their analysis binary tools up on the request and there are two main benefits. First, they can easily debunk and refine their scientific models in their code. And secondly, we can cleanly separate the system code debugging from the science code debugging and we have an optional second stage reduction. So this graph shows the scaleability of our service and we've done a series of experiments. For this specific experiment we shot performance using different numbers of instance working on totally 1,500 tasks. And also, we run the same amount of tasks on a single desktop machine. And although the desktop machine capacity is roughly double of the single Azure instance, we can still roughly get 90X speed-ups using 150 instances. And second, we got almost a linear scaleability. Next, I will talk about two important features that we think are the key capabilities provided by cloud infrastructure. First is dynamic scaleability. So we currently use Azure management API to dynamically scale up and down instances according to workloads as to improve the cost effectiveness and achieve better utilization. If we go back to the service, to the web portal, I think something interesting should be happening there. Because as we just submit a new request, the service monitor will have the information about the computational requirements, okay. So as we can see, it's updating the deployment. This is because the service monitor keeps monitoring the work, the current workload and it will estimate the total computational requirements and then start up new instances in order to finish the whole computation in the appropriate time deadline. So there could be some problems. And the most severe problem we observed is the instant shutdown. So why is it a problem? Because currently, only Azure system can decide which instance to shut down. So that means instances may be shut down by the, maybe randomly chosen by the Azure system and they may be shut down during task execution. That means we should either provide some failure recovery mechanisms or just wait for all instance to finish their jobs before we can invoke the instance shutdown API. So this could be some tricky issues here. And also, currently, computing instance usage are charged by hours. So which means the CPU minutes will be rounded up to an integral number of hours. So it's definitely not very cost effective. We start new instance and run for ten minutes and then shut down. And very recently, we have four test cases to test the performance to increase instance dynamically. For each of the test case, we increased the number of the generic worker role from one to larger number. So we have 1 to 13, 1 to 25, 50 and then 98. And there are two interesting points. First, it is generally, it takes longer to start more instances dynamically. And secondly, for all four cases or instances, all instance almost start up at the same time. This is different, this is different from what we previously observed, which is an incremental startup pattern. So that remind us we should not only focus on the development of the application itself, but we should also keep an eye on the back cloud infrastructure. By the way, in contrast, the shutdown time for the instances is relatively small, which is usually within three minutes. Okay. Another important feature is fault tolerance. Tasks can fail for many reasons. For example, there can be broken or missing source data files. Also, the reduction tool may crash due to the code bugs. And finally, the failures caused by the system instability can be very common when computation goes to a larger scale. In the first two are unrecoverable failures and the third one is actually recoverable. So we have implemented a customized task retry policies. So in short, task with timeout failures will be retried and also tasks with exceptions caught during the execution will be immediately resent, retried too. And also, the task will be cancelled after totally two retries, because if it's an unrecoverable, it's not helpful retrying it forever. So why not -- so a question may be raised, why not we just use queue message visibility settings for failure recovery? Given time limits, I cannot explain here. In short, it's not be able to handle all these three types of -it can only handle timeout failures, but not other types of failures. Okay. So I think to give some conclusion, cloud computing, we believe cloud computing provides new capabilities and opportunities for data-intensive eScience research. And also, the dynamic scaleability is a very powerful mechanism, but instance startup overhead is still not trivial currently. And finally, the built-in fault tolerance and diagnostic features are very important in the face of common failures in large-scale cloud applications and systems. And this is our list of future work. Do we still got more time? Okay. Thank you. So we plan to scale up computations from the U.S. continent to the global scale, because with our current system, our UC Berkeley graduate student could not only finish the ten years of U.S. data processing, but he can also go beyond that before this try. Also, we plan to develop and evaluate a generic dynamic scaling mechanism with AzureMODIS. Finally, particularly interesting, evaluate the similarities and differences between our framework and also generic parallel computing frameworks such as Map Reduce. Thank you. Questions? >>: I have two questions. One is short and the other might be long. The first one is, I work with a scientist that use MODIS to do analysis. To can I use your application, but I would need other kinds of analysis functions? This is the first question. And the second question is that they like to look at intermediate results. How would you go about doing this in this environment? >> Jie Li: Thank you for asking those two questions. I don't have enough time, but now I can explain more. So yes, for the first question, I think the arbitrary reduction executable upload is really powerful, because by that, scientists can upload their own obvious code using map lab and other [indiscernible] code, et cetera. So they can specify their own computational models and algorithms to work on the data. And the back end system is capable of providing a uniformed, reprojected. So they already got the same format for those data points. So scientists can -- yeah, we can support any other customized code to do the analysis work. And second question, I just skipped the slide here, and which is very important part of our system, and I can show you here. So after the scientists can send his request to the system, right, and in the scenario of cloud computing infrastructure, the user usually don't know how many instances are currently working on his computation so he must be curious what the progress is during the long time. So actually, we've implemented a separate component, which is for status monitoring and diagnosing. So we can just use our this request. For example, we got a unit computation ID and we can use this ID to monitor the progress in realtime, as we can show here. So from the left bar, we can show that that's the total tasks inside this computation we just submitted and there are four succeeded and there are 26 processing. So this shows progressively in realtime how much of the computation has been finished. And also, we can even click into the details and we can see output log, error log, which this computation looks fine, right. But if we go to some really large scale computations, failures could happen. I've switched to a previous request and computation which includes almost a 3,000 tasks and we got a total failure of 37. Which are unrecoverable. So we've finished this computation with these failures. >>: [Inaudible]. >> Jie Li: Yeah, there's also an interesting feature. We've estimated the total cost of the computation, also during run time, and for this specific computation, it takes 121 hours. So the cost is currently based on the CPU hours only, so it doesn't take account into network band width usage and the transaction because that's an almost negligible part of the cost, because CPU is the main component. >>: We have to do the due diligence to compare the billing? Would that -- and it's sufficiently close that it's round off [inaudible]. >>: So this was paid on Mario's credit card, which is why he was so concerned in the last panel session about costs. >>: No it's my credit card. >>: Oh, your credit card. >>: That's my AmEx card. >>: So related to this, so here you're counting the CPU kind of time. Cache all the original FTP files and to reproject the image and your blob storage. So how big is that amount of data in terms of giga byte that you continue to store even when no is running in this scenario. >> Jie Li: So for the total, for our use case, ten years data covering the U.S. continent and it's totally five terabytes source data from the FTP sites and two terabytes of reprojected data. So that's totally seven terabytes currently stored in the blob storage, which will cost around a thousand dollars per month. And actually, yeah, our estimation is for this project, for this use case, it will be, there will be a time span around six months. So actually, we've estimated that the storage cost will be around $6,000. And the CPU, the computation cost is around the same. Yeah. But why we choose data caching as opposed to recomputing or redownload the data, because network band width also costs money, and since we reuse is really common in our computation pattern and the reprojection process is extremely compute intensive, so it's not cost effective to reproduce the data on the fly every time. >>: So with the current system, for example, does not look at kind of removing files from Azure storage that have not been used for a long time and then opportunistically just fetching them like in three months when you need them again or ->> Jie Li: We think, given the project time span, right, it's six months. So we think it's not -- and our scientists are frequently using, almost using all the scope of data now. So it's not, yeah, very, economic. >>: At this point, I need to move on to the next. >> Jie Li: Thank you. >>: We can take some questions offline. >> Jie Li: Yeah, sure, we can talk. >>: So the last speaker, and we'll be finishing up a little bit early because there's a short break we can have before the final round. So Wen-Chih Peng from National Chiao Tung University will be giving the talk on monitoring and mining sensor data in cloud computing environments. And it's all there. >> Wen-Chih Ping: Okay, yes. My name is Wen-Chih Peng, and today I'll present the works, monitoring and mining sensor data in cloud computing environments. This works, I join the works with my colleague, Professor Yu-Chee Tseng and we're from the National Chiao Tung University, Taiwan. Here is my outline. First of all, I will tell you how we develop in our projects, I will tell you two platforms. The Tai-Chi platform and cloud platform. And then I will tell you what we choose to put these applications into cloud computings. First I will tell you our implementations and some observation and issues that arise through our implementations. And then finally, I will conclude with this talk. So from the key note and from the panel discussions, we find there are lots of sensors available. But we don't have a lot monies to buy those expensive sensors. We are focused in, like this sensor, in much smaller, and it can be deployed to monitor environments and traffics, okay, so we can find sensors can collect lots amount of data. What we are doing is we use this sensor to sense the physical words and collect this data into our servers. Remember that I use the servers, because in our [indiscernible] works, we only collect this datas into one single servers, but we try to use [indiscernible] servers to [indiscernible] cloud computing and try to monitor and mining knowledge from this sensor datas. Okay. So in our spools, we have one project is called ABC. It means that always best connected cloud service, service and access platforms. There are some sub-projects. One sub-project, of course, you should have service platforms. These service platforms would be multiple cloud computing platforms. For example, we use Hadoop, open source Hadoop versions. Of course, we have Microsoft Solutions, provided by Microsoft Research. And that was the project, we started the cloud device and some professor studies how to do the power, low power issues and connectivity issues of the cloud device. And you can see these figures of codes. We have multiple wireless access interface. So some professor studies how to provide best connective wireless, wireless communications between the mobile clients and service platforms. And, of course, we developed some applications, wireless sense applications, and I will tell you two wireless sense applications in the next slides. So here are some wireless sense applications we developed. One is the, what we call the carbon dioxide monitorings in around. National Chiao Tung University and this is the lines, small lines, the lines will track your breathing be heavier and change the line. And we will tell you the Tai-Chi. Tai-Chi is the slow motion Chinese kung fu. And it will keep you healthy and keep your mind at peace. So if you are going to Taibei, you can find these are all the peoples in the mornings that will play Tai-Chi in the park, okay. So but how about young people, young people like this one? He will use the mouse and the keyboard to click and play the virtual game, okay. So we try to combine the physical words and cyber words into one platform, what we call cyber physical platforms. In this example, you can find students, they are master Tai-Chi master and some students, they can be around the world, and they carries body sensor networks and their physical information will be shown on the websites; for example, Facebook. Okay. So this architecture shows you the Tai-Chi platforms. As I told you, that we have the body sensor networks. And totally, there are nine sensors around the bodies, and they try to detect the movements of the user. And through the sink, the sinks will collect the sensor data and transfer those sensor data to what we call Tai-Chi engines here. And the Tai-Chi engines will map the motions of the user, and the user will see other user while Facebook clients applications. So here are the detailed components of each [indiscernible] which develops. I want to skip this slide due to the time limits. Okay. This is how it's back. Sensor knows. We use this sensor, and we also have a sink, and a sink will have wireless communications interface. Maybe I should give you the video clip to show you how Tai-Chi is work. >>: Three users are at different locations. >>: I want to play Tai-Chi. What are Ju and Hong doing now? Ju is in China and Hong is in America. Let me check whether they are online on Facebook. >>: I have not to exercise for long time. I want to play Tai-Chi. Let me check whether I can take a Tai-Chi course on Facebook. >>: I'm bored now without anything to do. What are they doing now? Let me invite them to play Tai-Chi on Facebook. They are online. User C creates a Tai-Chi on Facebook. >> Wen-Chih Ping: [indiscernible]. It's some slow. >>: User A joins list room. User B also joins list room. Three users do Tai-Chi exercise at different locations. >> Wen-Chih Ping: There will be one master, and all the students will follow the master. They can find this on Facebook. >>: They can share their emotions with each other anywhere. >>: Let me check whether they are online on Facebook. >> Wen-Chih Ping: So there are only nine sensors, and [indiscernible]. You can, if you like, you can download applications and run this application on your Facebook. So this is Tai-Chi's platforms. And we are now [indiscernible] our works. We are thinking about more interesting topic to let more people join the Tai-Chi communities. For example, we try to measure the similarities of different users. This one is four sensors for one person. And the other persons will also have these features, and we won't try to formulate the similarity measurements between the user from their sensor readings. And if you have the similarity measurements, then we can do a lot of the recommendations. For example, we can recommend that you should -- the user should follow one master, the Tai-Chi behaviors are very similar. Okay. Another way we can recommend that you should join one Tai-Chi communities, which are favors your style, okay. So this is our future works. And also, we want to use the cloud computing for Tai-Chi. As you can see, the renderings of this, their motions, they are similar. So we try to use the cloud computing to speed up the renderings. Another issue is the similarities computations amongst users are very computation intensive, so we want to use the cloud computing to help us to achieve these goals. But this is our future work. Now, I will tell you another wireless sensor network about the environmental is monitoring service over cloud computings. As you can see these figures, that we have some vehiculars with carbon dioxide sensors and trivia sensors. And these cars has 3Gs wireless networks, and they can upload or what we call publish this sensor reading to the servers. And users from the desktops or from the other vehicular user can subscribe to service, and they can see, for example, in this one you can see the carbon dioxide map around the National Chiao Tung University. So we propose these concepts, and we implement what we call the car web platform. In car web platform, these are some vehiculars, and they can use their smart phone or GPS logger to log the trajectories of the user and upload to our server, then the server will have the data points and road segment points, and then we can use these sensor data to estimate or monitor the traffic centers around the road. We implement the car webs for about two years, okay. This is the first version so we use the Window mobile version. This is very ugly, ugly smart phones, and this is Window Mobile 6, and we also implement enjoy platform, like this one. This is the client version. And you can look into our -- you can download the applications and [indiscernible] applications on U.S. smart phones. As you can see the two figures, this can show you your location, and the user can upload their speed readings to the server. And this one shows you the nearby road segment, the traffic status around the nearby road segments, okay. The red line means the traffic jams happens. Okay. Then think about that you have the GPS data points and trajectories of user including the histories, GPS data points and realtime GPS data points. The problems we want to deal with is we want to estimate the traffic around the road, okay. So here are the proper formulations, and we -- before the proper formulations, we try to give you some assumptions. The first assumption that we have the road network like this one. In our car web platforms, we have the road network of Taiwan, and also we have the GPS data, like this one. In this one, you can see that this is one car and their location and a speed and the upload time. The problem will be like this one, the input will be the traffic data base and the queries. This queries is one road segment with the time. And output is the speed of the query road segment. For example, in this one, one users want to query, want to know the traffic status of road segment E at time T4. This is our output, 50. 50 kilometer, not mile. And what's the problems behind our problems? The most challenges point is less traffic information in real times are available, because not always users want to share their GPS data points. So our prior work proposed spatiotemporal weight approach to estimate the traffic, and this paper's published by SSTD 2009 and MDM 2009. And we observed two important factors. One factor is temporal factor. This means the traffics always have [indiscernible]. Another factors is spatial factors. This means that if the sensors are nearby, their readings are almost the same. For example, you can see these are two sensor and the speed readings are almost the same, similar. And compared to this sensor, their readings, their colors are very different. So if you have the temporal factor and spatial factor, we can retrieve more GPS data point from the data base, from the data base, and use these GPS data point to estimate the traffic. I will also give you one short demo, video demo. >>: After looking into the car web website, user can browse and manage their own trajectories collected by car web client from their smart phone or PDA. User may click on any of them to show the whole trip on a map. But query can ask specific GPS point. The corresponding information will be provided. We can also choose road segments from a map. Our system will report estimated driving velocity. This is the entry point of our car web client. User may first login into the server and the map will show up centering at current location based on the GPS signal. By default, car web client will continuously track user's trajectory every five seconds per point and upload trajectories into the server periodically. Apart from the data collection, we also provide traffic status estimation service based on the collected history data. When the system uploads user trajectory, it actually invokes the service query simultaneously. Later on, according to the query result, drawing different colors on nearby road segments, indicating different levels of expected driving velocity. You may have noticed from this demo the estimation query result doesn't show up instantly as we drive. The reason is our server calculation is not fast enough. >> Wen-Chih Ping: So my student tell you the province, then you can drive the cars. But the road segment, the queries results are delayed, okay? So our problem is given the range of query that circles and how to efficiently estimate a traffic status of overall segment within the range specifies, this is a computation intensive, okay? So we tried to use cloud computing to solve the problems. So similar to the Google or Microsoft map service, the whole space is divided into several grid. For each grids, we will use MapReduce to estimate -- for each grid, we will use one virtual machine and try to find the traffic status in the grid. So this is our first implementations which we try to use HDFS and use the MapReduce by using ten virtual machines and the results is very bad, because it needs 20 minutes and 11 seconds. So I'm wondering why, because there is no intact structure there in HDFS. So we try -- another approach is that we try to construct index structures for the GPS data points and road segments, and also we increased the number of virtual machines. Okay. So we input three measures. One measure is that we use Hbase. Hbase is open data base Hadoop platforms. So this is one. And another approach that we use Hbase plus five virtual machines. And finally, we do not use the Hbase. We only use HDFS, but we use some grid index data structures. We also use five virtual machines. And this figure shows you that Hbase is not good. You can find that. And then if you use the HDFS plus five virtual machines, the response times, the execution times is very slow. It's shorter. And this is also show you the same results, similar result. And in this figure, we show you that with different virtual machines, if you have more virtual machines of codes, the execution times will be smaller. Also, different colors shows you the different query range. If you have a larger the query range, execution times, of course, will be larger. So some possible issues we find. When you use the MapReduce, you need to devise the file into multiple file, but the problem is that the trivia's data points are not uniform [indiscernible] allowed in the [indiscernible] space so we think we should devise good measures to devise the data file into smaller data files according to the data distributions. Okay. So if you use the MapReduce, there are nine map here, okay. But sometimes you will need to wait to those virtual machines with heavy load. So another possible issue that we find that index structures should be developed, okay. As we know that you can find there are trees in traditional data base. But how about are trees in cloud computing environments? Then you can find that our key note will answer the realtime sensor data. This is why we also -- we also think this as a challenge of problems. So I conclude with this talk and you can find in our projects, we propose a new paradigm for cloud computing, what we call the sensor cloud, and we use the sensor data to collect information of physical things and put all sensor data into cloud computing environments. And we have some preliminary implementations for this wireless sensor networks and we find some possible issue. For example, how to develop or propose efficiency data storage and retrieval measures and how to propose the partition scheme for MapReduce and also deal with the realtime sensor datas. Of course, after this cloud future workshops, I will let my student to try Windows Azure. Yeah. >>: Thank you. I actually want to comment. I really like that, the use of the sensors, and I thought of taking Tai-Chi with Facebook. Questions? >>: I'm sure it was in the talk and I just missed it. I apologize. How big was the data for the second half of the talk, the spatial data? >> Wen-Chih Ping: GPS data? >>: Yes. How big? >> Wen-Chih Ping: Five gigabyte, yeah. >>: So you do try just [indiscernible]? >> Wen-Chih Ping: Yeah, but the same properties. You use the conventional data base and the response times are very slow, yeah. >>: Not anywhere near 12 minutes, I'm sure? >> Wen-Chih Ping: Not, yeah. >>: So you tested and it came out? >> Wen-Chih Ping: Yeah. And the video came around. In the video clip, you can find that. This is use the traditional MySQL. We use the my sequel data base. >>: Did you try a real data base? [laughter] >>: Microsoft data base. >>: Microsoft data base is not MySQL. >> Wen-Chih Ping: Maybe I can try it. >>: Five gigabyte scale, using MapReduce ->> Wen-Chih Ping: I think the computation is very intensive. The renderings of the map service on the google maps or Microsoft Map, you can find they use cloud computings. But in our province, it's more difficult, because you have the previous data points. We need to estimate traffic of road segments with the inter-query range. So the challenge would be the computation. >>: [inaudible]. >>: Do I have time for one? There are lots of people in the spatial data base community that keep coming out with new algorithms to speed up these nearest neighbor queries and so on and so forth. >> Wen-Chih Ping: Yeah. >>: So do you believe that this will kind of disappear when you start running on the [indiscernible] so we won't need to have so many very, very small magnifications and algorithms and performance once you start running on the wall? >> Wen-Chih Ping: I just want to try to say that, of course, there are lots of the research works on traditional spatial queries data base, traditional data base, but the cloud platforms are -- the cost model of cloud platforms are different from the traditional data base. So this traditional spatial query should be redesigned, yeah. So this is just the first tries. And we will continue to work, and I hope that cloud DB will be developed. Yeah. Just try. Yeah. >>: Thank you very much. >> Wen-Chih Ping: Okay.