>>: So the next speaker, Rob Gillen, needs no... introduced in the previous thing in the panel.

advertisement
>>: So the next speaker, Rob Gillen, needs no introduction, because he was just
introduced in the previous thing in the panel.
But he's from Oakridge Labs, and he's been doing some interesting things with
Azure and looking at a lot of problems from the national laboratories point of
view.
>> Rob Gillen: The original e-mail I got said this was a 30-minute talk. Then I
got here on Thursday, and the slide said talks are limited to 20 minutes. I
cringed and then I realized that the panel precedes in this one. I'm going to sort
of count that as my introduction so we're just going to roll right in here.
I want to talk about two main topics, or two topics sort of under the umbrella of
data and the cloud. To give a little bit of context. We've been working on a
project specifically focused on this. We are -- if you don't know much about
Oakridge laboratory, we do a lot of computation, like a lot meaning we have both
the first and third in the top 500 list super computer sitting on the floor and we're
currently putting in our third petascale Fmachine. We do lots of computation.
All of those bad things we heard about in the first opening keynote where we
heard about MPI and four tran and C plus plus, that's all we do. So I'm probably
the fish out of water in that group, but we do a lot of work with lots of tightly
coupled simulation, though we also do lots of other areas of science. In other
audiences I tend to say we cover all areas of science and I've been told that's
technically incorrect and get in trouble for saying that. We do lots of science from
biology to neutron scattering to take your pick. We do lots of different things.
Our specific interest in cloud computing is how does it fit into our computational
profile specifically, and what we define as mid-range computing. And because
our definition may be different than yours, we define that as anywhere -- is
computing where the problems need between 256 and 1,000 nodes. So that's
sort of, that's our middle bounds for computational profiles.
And it's a multi-year project. The first year we're focusing specifically on data.
This is some thoughts and comments that come out of that. And we, like I say,
the first aspect we're going to talk about is movement. There's two main topics.
Movement and then data services. So movement, we're focusing on sort of three
paradigms of very, very simple. One is if we have computation host locally at our
facility and we're accessing data that's hosted in the cloud, the second being the
reverse of that.
We actually have some massive data sets, sort of petascale size, and if we
wanted to make those, sort of looking at the case, what if we want to make those
available to the cloud for maybe third parties or other people who are doing
computation there. The third case, which is very easy, or easiest of the three, is
scenarios where both computation and data live in the same cloud.
While there are multiple cloud providers out there, we've been focusing our
efforts on the two that are represented here, both Amazon and Microsoft and
playing with different keys there. Probably unlike most, I'm more of a Microsoft
coder by definition so our samples will lean that way that we've been working
with both.
We came across something I know that is going to amaze you all. The Internet is
slow. So to put this in context, we live -- Oakridge has sort of a uniquely, a
uniquely fortuitous network connectivity. There's national sort of a lot of
academic networks that sort of make a loop around the country, and then there's
this big fat pipe that goes from Atlanta to Chicago and it bends directly through
our building. So we have multiple ten gig connections, a couple of 40 gig
connections. We're in the process of putting in a 100 gig connection to the
Internet right through our building. So we have significant band width.
And so in that context, as we're studying things, we're trying to keep things in
perspective, right. So things that may work for us may not work for other people.
Things that work at our lab may not work for that typical scientist we were talking
about earlier in the panel or at a research institute or collegiate institution or
things like that.
So we're looking at different scenarios. And one of the biggest things, you know,
we started off with some very simple baseline parameter sweeps. We started off
by generating a bunch of data and different file sizes and doing parameter
sweeps of those file sizes, both up and down, going to Amazon's cloud, going
from our place to azure's cloud, doing downloads and inter-cloud
communications.
And probably one of the most significant things that sort of came up is more
educational not so much from my standpoint, coming from industry in the
Internet, but for our researchers. They were caught off guard a little bit, having
spent most of their time in working in local networks, was the variants that the
cloud -- the variants that the internet introduces. So the next two slides sort of
hint at this.
This is a duration by file size showing one standard deviation. This is comparing
the same profile of us against -- us. Microsoft against Amazon coming from our
labs. The key here is not to say, well, Amazon's better or Microsoft's better
because it's less variants. That's sort of irrelevant to the point. The point is
there's a lot of variance. And there's a number of factors that play into it.
This next slide actually illustrates it much more. This is a very non-intuitive
variants pattern, and it's not a -- there's a big fancy scientific reason for why this
happens. It's called network congestion, right. Has nothing to do with the
science, nothing to do with the bits. It has everything to do with the fact that we
ran our test sequentially. We worked from one file size all the way up. And when
we hit this file size, it happened to hit a blip. There happened to be other traffic
on the network that interacted with it that caused things to slow down. That's the
reason.
It's a very key and important thing for researchers, if you're doing this sort of
detached or segregated computation, where you've got computation disjoint from
your data say in the cloud to take into account in your computation. And in your
patterns. So again, stating the obvious, probably that's from this is you want to
put your data as close to your compute as you can. That's sort of an obvious
known statement.
But there's still case where's we think that this -- that there may be too expensive
to move data. As you can see we did parameter sweeps from 2 KL up to a gig.
This is not just moving massive files. This is moving small files simultaneously
and parallel across things.
So we spent some time looking at this and came up with again a somewhat
obvious but yet effectual statement is that multithreaded downloads are file
transfers make significant impacts, right. So again, something that may not be
as intuitive to classical HPC people is that moving data across the internet doing
multi-threaded transfers can provide some interesting improvements.
So essentially what we did is we wrote some -- we adjusted the libraries of test
harnesses that we had earlier to say rather than just opening up a socket and
pulling a file down, give me first of all, acquiesce or ascertain what the file size is,
split it into end number of parts, open up that number of threads and pull it down.
And we did parameter sweeps for all of the file sizes as well as the number of
concurrent threads. For downloads that's very easy because both providers and
most do provide HTP interfaces that support byte ranges so you can simply give
me a byte range for this. Different threads working on different byte ranges.
That was very easy.
The upload was actually much harder. And so what we did is we sort of, we
looked at the different API styles and the things supported by the different
providers and we sort of stole an implementation idea from Microsoft. So
Amazon's, you know, their S3 thing when you want to put up a blog, you open up
a pipe, say here's my data, go, go, go, go, go, closes it and you finalize your
data. Very standard upload.
Microsoft does the exact same thing until you hit a limit of 64 megs. Anything
above 64 megs, they require you to break it up into blocks, somewhere, I think
each block has to be less than four megs and you can issue a number of puts
and then like they demonstrated the other day, once you get them all up there,
you do a commit, say here's my list, commit that into symbols and a file.
The beauty of the Microsoft approach is it allows us to do sub-file parallelization.
So we can actually put one -- we can transfer one file to it and do parallel
threads. What we did then is so we could sort of compare things similarly, is we
established a data proxy so we took and basically built our own data service,
deployed it both to Microsoft and to Azure and to EC 2 in this case.
The point of that data service would be to allow it to behave very much like
Microsoft's blob storage does in that it allows block transfer. It does a couple of
additional things in that it allows us to do CRC checks on each block as well as
doing adaptive compression on each block. As we're doing our transfers, we
would do selective compression on each block and then do the reassembly on
the client side or, excuse me, on the server side and then do, then optionally we
would transfer that file from the compute location over to the storage location.
And what we saw was, well, sort of what you'd expect. We saw some pretty
amazing performance increases. This is actually the download direction. So
even for a second thread, we saw stuff easily in -- or pushing double percentages
all the way up to over 800% performance improvements for transferring stuff.
Again, that's because we're taking advantage of sort of, if you will weaknesses in
the internet and the way things work in opening up pipes.
So one of the conversations or the topics that sort of we want to throw out is this
all works and this is great, but if I'm a scientist or if I'm trying to consume one of
these cloud services, we view this as an area that needs significant work. In
neither of these cases were the local disks or the target disks the gating factor. It
wasn't local network. It wasn't local memory or CPU, nor was it remote CPU or
remote memory. It's simply the inter-connects between them.
So we're sort of focusing our own efforts internally and also sort of making a call
forward is how do we solve this problem. Why can't -- if I've got a gig connection
at my facility and there's Amazon or Microsoft has a gig connection in their place
and neither are full, why can't I use as much as is available, right?
Why can't I get closer to saturating that network pipe. So that's an area of
significant improvement or that we're looking to see and would sort of call upon
the community to work on and focus on.
Second topic is services, and this is sort of follows on to the conversation or one
of the points we made earlier is when we -- I think we need to step and look,
when we're talking about cloud computing, talking about what it's good for and
how does it augment our computational platform, one of those areas that's sort of
natural is that of data distribution and data sharing.
And there's a lot of reasons, discussions as to why that should be. Our facility
does a lot of data distribution. We do a lot -- we host a bunch of the IPCC data
for the climate stuff and we -- I did a little digging as to how that works and
there's actually this big massive tape library. And there's a web page you go to,
and you select, you can go through catalogue, select things. You check a box, it
submits a job, it goes to the tape library, moves it over to a computer. You get an
e-mail with a download link that's good for an hour or something. You pull it
down and so forth. It works, but it feels very arcane, very difficult to move. And
what I'm getting in that case is I'm getting a bunch of net CDF files, which if
you're a domain scientist or you work in that space, you know what that is and
you know how to work with it and that's fine.
If you're not, you look at the file and you say what's this. And if all you wanted to
really know was for a given point in time what was the temperature for this
lat/long combination, you've done an awful lot of work and you still have an awful
lot of long way to go to get that answer, right. You have to find libraries to
interact with it, which certainly they're there, but then you've downloaded this gig
file for roughly a K's worth of information that you need.
So we saw earlier a conversation about Dallas and there's some really interesting
things, not necessarily from a products standpoint but conceptually what they're
trying to do. They're taking a data service and they're saying one of their key
selling points if you will, is we're going to provide a commonality across all of
these data services. We're going to provide a similar interface, whatever it may
be. In their case, it's O data, which gives you both JSON, JSONP optionally and
ATOM pub.
As we looked at that and were playing with it, we actually did some work with, I
don't know if it's technically accurate to say a predecessor to Dallas, but it's
called OGDI, the open government data initiative, which is an STK that was put
together by Microsoft's DPE group out of Microsoft federal in DC. Basically the
same notion of Dallas, just less polished and didn't have all of the magic
corporate sponsor behind it.
Same ideas, though. We did some kicking around with it, and again stating a
brilliant observation, science friendly formats don't often or often don't equal
internet friendly formats. We think about internet friendly formats, we're thinking
about ATOM pub, XML, things that are easily consumed, JSON, things that
people can, while they probably wouldn't want to make a steady diet of it, they
can open up a text browser and see it and consume.
It and sort of rock what the data is.
On the other hand, sort of scientific formats, FASTA maybe being an exception to
this, but even that's sort of a little weird to look at. Net CDF in particular, you get
this big, it's this big hierarchy of binary data. Without the proper tools, there's no
way to dig into it.
So we do D some experiments. We took a couple net CDF files and tried to
publish them through to the internet and say okay, if we wanted to make this
available as a service, ignoring by choice the whole open DAP approach and
what's sort of accepted because again that's domain specific within a realm.
We're trying to say if you want to take just an internet friendly approach
completely down the road.
Again, we saw some amazingly, I guess, blatant or expected results in that -well, stepping back, we took a slice of data, which was all of the temperatures for
a given point in time across the globe for a given, based on their grid, which I
think is a five degree grid ended up being somewhere in the order of 8,200 data
points, each data point consisting of a time stamp, lat/long and a temperature
value. So not much data.
As aggregated amongst the larger net CDF file, it was a K or a couple of Ks
worth of data. When you flat than to a CSV file if you will and stream it across,
you're talking about a couple hundred K worth of data. When we moved it and
exposed it as an ATOM pub service, it got really big. So that same 8,200
records, if you want to look particularly at the first and third lines, that same 8,200
records if you want to look particularly at the first and third line. Became almost
nine megs of XML coming down the line.
There's an obvious problem here in that the data services at that time, and I don't
think they still do, and Azure don't support native compression or compression
automatically so you don't get HTTP compression on your transfers yet. I
understand there's work coming there. Right now, it just gives you raw XML,
which as we all know, raw XML is very bloated.
JSON is better, but it's still, it's still pretty significant compared to the original,
which was actually I don't have on the chart. So the payload got very, very big.
So I guess the, again, stating future work here as well as sort of a call to the
community is to put -- as we, as an industry, or as a research community, adopt
cloud computing and adopt cloud computing strategies for distribution and data
and so forth, I'd like to see us move towards consistency across the protocols
that we distribute, protocols that we use. There's certainly a justification and a
valid reason for binary or, if you will for their native formats, but I think it's
incumbent along with those, at least in my semi perfect view of the world, for the
notion broad distribution of data is, if you're going to distribute data in the native
formats, it should be associated with sort of more internet friendly formats so I
can get it in both ways or either way I'd like it to. Ideally, in a format where
expose it as a service that will adapt based on my call. So if I use the same URI
end points and basically pass flags that say maybe I want it as JSON or ATOM
pub. Maybe I'm playing with a small data set and have a very easy way to
interact with it. And as I'm moving down the chain and I figure out what I want I
pass a different flag and get the same results that are larger results in a more
friendly format for that particular thing.
I think there needs to be work in the area of binary XML and some standards
around that line wherein we can express formats that have some of the richness,
if you will, of ATOM pub, because ATOM pub is great for a lot of things in that it's
self-describing, it's richly and easily consumed by a number of clients, but so we
need to shoot for formats that have those qualities without having the
encumbrances of being overly bloated.
So that's the future work. Things in sort of the last statement in this sort of goes
to the overall thing is I think it's incumbent on us as researchers as we're using
the cloud and looking at ways, how does it change what we're doing to be a little
more open with the data that we're developing or data that we're publishing.
Working through some papers and some research, or some papers that I was
editing for somebody else, they made some very audacious claims with very little
data to back it up.
One of the things the internet provides us is the ability to link large data sets with
those results and to publish those concurrently. Allowing people to do that
validation. We sort of hinted at that in some of the other conversations. And I
guess it's just, I think we can do better as community. I think we can provide that
proof, provide that data such that a broad community of people can consume and
use it. That's it.
[applause].
>>: Questions?
>>: I think you sort of hit on it. One of the reasons these why binary formats sort
of split [inaudible] richer.
>> Rob Gillen: Yes.
>>: I think it's somewhat fundamental because the atomic unit of work decided to
do this [indiscernible] time stamp [indiscernible] can be made, right, so if you
actually want to draw that locally, because that's what you're doing, is maybe
[indiscernible] splitting up a gig because it's XML. So I guess I was wondering, is
there a way to make science friendly internet friendly rather than make, rather
than take internet friendly and use it more for science?
>> Rob Gillen: I guess that's the ->>: [indiscernible].
>> Rob Gillen: If every scientific domain used net CDFs, sure. That's the point,
right. So ATOM pub, I guess the point I'm making is much the same way that
ATOM pub has sort of become very widely adopted across, you know if you look
at protocols like O-data and G-data and these things, that approach is very
broadly, broadly used, right. And it's almost, it's sort of being used in science
slightly backhanded coming from industry and from the larger group. I'd like to
see -- right. The point is I'd like to see us take that same notion, whether it's
ATOM pub or some variant, but providing it in a binary format, so it's domain
discipline agnostic.
>>: I guess -- I'm not trying to defend CS data base version or whatever, but
they were saying that they are [inaudible] just because there's not [indiscernible]
or something.
>> Rob Gillen: Right.
>>: Like they're just a raise.
>> Rob Gillen: Yes, but if you're not a scientist, do you have any idea what an
[indiscernible] is?
Not even a domain scientist.
>>: If you're not a scientist do you know [indiscernible] that's what they would
say that their data [indiscernible]. Anyway ->> Rob Gillen: Whether it's data -- I think it's, whether it's the development of a
new format, an adaptation of another one or simply the -- let's see, the
popularization of an existing format, I think there's a work that could be done in a
number of different ways. Because I don't frankly care what the format is so long
as it's broadly available and sort of to a large populus, not just that 1% of people
that know what MPI is. Yes?
>>: Did you guys take a look at any sort of plan optimization technique,
scientists or things like [indiscernible]?
>> Rob Gillen: We did not. We sort of acknowledged that they were there and
[inaudible]. Again thinking more of the generic consumer. Both assessing what's
currently out there, but also looking at what that generic consumer would have
exposure to or access to.
>>: [Indiscernible].
>> Rob Gillen: Yeah, so, yeah, I didn't go into that very well. I really wanted
people not to look at the fact that one was better than the other, because there
are about 300 different reasons for that. We did -- I took them, sort of ran up to
our network guys and said look at this, why?
And we're connect the both to Internet 2 and ES net and we actually saw that
because of the way Microsoft uses Akami, the southern crossing, which is one of
the big network providers advertises, they do a lot of sort of chatty advertising on
the routers, everything Microsoft this way, which is not actually the best route for
us to get to Azure, right. So our traffic to Azure was actually taking a
performance hit because of sort of other things on the net. That's one of the key
things that we took away from that. It's not just -- there's a plethora of things you
have to work through to get sort of that optimal network transfer that most people
who traditionally live inside a given data center wouldn't think of. The point of
that was to show the variants and that there were sort of odd things happening.
I've been very careful not to say this is better, this is better. It's the deltas
between the two ->>: Actually, we've had some experience where research [indiscernible] and it's
totally different than how the stuff in the middle [indiscernible].
>> Rob Gillen: It does.
>>: It's interesting, because I instantly thought of bit torrent when you were
talking about the multithread. That was supposed to solve the bottle neck. But
what you're saying is maybe the bottle neck has been in the network the whole
time. If you could ->> Rob Gillen: So Amazon actually has support for bit torrent coming down, so if
you're using -- which is novel in some ways if I'm using, say we've just done a big
run of climate data and I've tasked with distributing it across the world and I've
got lots of people pulling the same data at the same time. Then that model is
really great, because in theory bit torrent only works well if there's lots of people
seeding. If you're the only person pulling the data, doesn't matter, all right?
So it's interesting in that what bit torrent really does, presumably some of the
people who are seeding are closer to you than the actual source of the data,
right. And if you actually are closer to the actual source, you're not going to see
the benefit from the seeds coming from other people.
>>: One of the significant things was your use of the parallel streams. The
routing makes a huge difference.
>> Rob Gillen: It does.
>>: Is that piece of software available?
>> Rob Gillen: It will be. The point there, though is yes, I sort of, we came up
with a way to solve it, but I would like, I guess in the naiver, realistic view of my
view of the world is I shouldn't have had to have done that, right. And that's not
a, that's not a -- I guess that in and of itself is posing a problem to be solved,
right, is that's not sort of a banter on any cloud provider as it is the view in
between, right. In theory, I should be able to open a pipe and have a pipe open
and say here's all my data and go.
>>: [Inaudible] been trying to do that for [inaudible] never would drive. Liability
was an issue. It's very [inaudible] and it is the way [inaudible]. I think at this
point, we can thank Rob.
>> Jie Li: Hello, my name is Jie Li, and today I'll talk about some of our
experiences using windows Azure to process MODIS satellite data. This project
has been a cooperation between University of Virginia eScience Group,
University of California Berkeley, the Lawrence Berkeley National Lab and
Microsoft Research.
First I will give you some background information. So eScience today is
becoming more and more data centric and data intensive, which brings great
opportunities to accelerate scientific discoveries. On one hand, the growing, the
increasing data availability from both large scientific instruments and those large
scale inexpensive ground-based sensors have created an invaluable data
repository to enabling -- to enable new science explorings. And on the other
hand, various computational models with increasing complexities and precisions
are being used today to produce better scientific results.
So with these great opportunities, a natural question to ask is how can scientists
easily access these tremendous volumes of raw data and apply complex
computational models on this data to produce meaningful scientific results? And
more specifically, do scientists have sufficient computational resources access
and enough applications and tool support to enable them to easily manage these
large-scale data and the computation. If the answer is not, then what could we
do as computer scientists to help resolve this problem?
So in this project, we have encountered a concrete example of this problem, and
our goal is to build an application upon a scaleable infrastructure to help
environment scientists to easily access, manage and analyze the large scale
remote sensing data from the MODIS satellites.
So give you some background, MODIS is short for moderate resolution imaging
spectroradiometer satellites and there are currently two MODIS satellites which
are viewing the entire earth surface every one to two days. And acquiring data in
16 spectral bands. This data is separated into multiple data products according
to the surface types such as atmosphere, land and ocean. In general, the
MODIS data are very important for understanding the global environment and
various earth system models.
However, currently, scientists have a number of barriers for using MODIS data in
their scientific research. The first barrier is data collection. The MODIS source
data are currently published and maintained on multiple FTP sites. Though this
data are publicly accessible, it's not easy to query or get a subset of the source
data for a specific area or geographic tile. Because the metadata for this source
are maintained separately and there are no useful interface to support such
query.
And the second data barrier is that the data heterogeneity. For the different data
products, there are different time granularities and imaging resolutions and
before scientists can apply their computation models and doing analysis work on
this data, these different, this heterogenous data must be transformed into a
uniform format for different data fields. And even worse, this different products
use two different project types, called swath and sinusoidal, and these two
project types are using two geographic, different geographic coordinate systems
for mapping the whole entire earth.
So and also it is extremely expensive computing process to reproject one project
type to the other. But for many science, scientists who want to work on all this
data products, this is a required step. So we cannot avoid the computation here.
And finally, the data management and computational resources requirement is
overwhelming. For example, we currently have a use case and there is a
graduated environments student from UC Berkeley and his name is Yung Will,
and he's using a scientific model to compute a science variable from the totally
ten years of data covering the whole U.S. continent, and he needs to get those
results in order to finish his dissertation, Ph.D. dissertation, before this July.
However, the data management requirements is really huge. It involves totally
five terabytes of source data, which about 600,000 files, and after we do the
reprojection, in order to generate the harmonized data, there will be an extra two
terabytes of data and this whole process will take about 50,000 CPU hours of
parallel computation, and the fact that this computation can be largely
parallelized is important, because we need to scale out the computation and to
use as many computational resources as we can. But still, there will be a long
process to finish the computation.
So we build Azure MODIS, which is a client plus cloud solution and this is a
MODIS data processing framework we built in Microsoft windows Azure cloud
computing platform. First, we leverage the high flexibility and the scaleability of
the cloud infrastructure and the services as provided by windows Azure.
Second, we leveraged a unique dynamic, on-demand resource provisioning
capability of the cloud infrastructure in order to finish the computation in a timely
manner. So in this data processing framework, we completely automated the
data processing tasks and which would otherwise being manually done by
scientists to eliminate the barriers and complexities. And finally, we provide a
generic reduction service in which scientists can run their own arbitrary analysis
executables in our service.
Okay, next I'm taking an overview on the AzureMODIS framework.
I guess all of you have seen this slide many times so I will just skip the basic of
Windows Azure here.
Okay. So this graph shows the high level overview of our data processing
service. There are mainly two parts of it. The left side of the graph is the front
end data service web portal for which scientists or the users and developers can
use to interact with the system. The right side of the graph is the background
computing system which includes three main stages. Data collection stage and
the reprojection stage and also finally the scientific analysis and the reduction
stage. Actually, we currently have two sub-stages in this analysis step. So there
are actually a four-stage pipeline in the background system.
A typical computation run includes a series of steps. First, the scientist will
submit a request and in which he specifies the requirements for his computation
through the web portal.
Second, the request will be sent to the job request queue in the Windows Azure
system. And then further, processed and parsed by the service monitor, which
will then dispatch a large number of parallel tasks into the task queues.
And there are a number of service workers working in the back end, and each
worker will keep pulling the tasks from the queue. And after they get a specific
task, they will query the metadata in the Azure tables in order to locate the
specific source data files in external FTP sites and then download them to local
storage.
And this specified source data are then uploaded to the Azure blob storage and
in order to cache the data for future use, which we'll talk about later. And then
the heterogenous source data will be reprojected into a uniform format before
scientists can run analytical work on them.
So after the reprojection, scientists, when the scientist submits his request, he
will also upload obvious executables to work on the data, and in this reduction
stage, the worker will invoke these executables to do the analysis. Finally, when
the computation finished, the system will send a single download link to the
results as produced by these executables.
Next I want to show you a short demo which shows our system in life. So this is
the Azure web portal for our deployment. As you can see, there are a number of
different roles. There's a web roll which is hosting the web portal called MODIS
data service and there's one instance. Also there's also a single instance service
monitor which is the master of the computing system. Finally, we can see we
currently have one worker which is a service worker, because we currently have
no computation ongoing so we keep a minimum size of the service to just
maintain availability.
And now, I will go to the web portal. And submit a test request. So we can go to
data reduction service and specify the years and days. So I would just choose
two days. And also, satellites, each of them and also the tiles is the streams for
the geographical areas. And here, we'll choose all the U.S. tiles. And also,
request e-mail I will use. Which is test account.
Here is the step to upload the reduction executable and we do have a test
executable we got from our members. And open it. Actually, there is an optional
second stage reduction, but here we will just enable a single stage reduction.
So then I will submit the request. Okay. It shows it has been sent successfully
and then we retain -- written to the main page and hopefully, in a few seconds, I
should receive a confirmation e-mail from the service, which shows, okay, so I
already got the new notification e-mail saying okay, the service has been -- the
computation has been started by the background system.
So now, we just wait for the service to be done and then go back to the slides.
And after we submit our request, so what is -- what actually happens behind the
scene, so actually, our data processing framework is totally built on the three
scaleable storage services of Windows Azure. So as we send a job request to
the queue, there is a single service monitor which will pull the job and then parse
it into a large number of parallel single tasks and then dispatch them to the
specific task queue.
And then we have numbers of generic workers roles running and they will keep
pulling the single tasks from the queue. And when they got the specific task,
they will stage in the necessary source data from the blob storages and working
on them and then produce the scientific results and send it back to the blob
storage place. And finally, if the whole service is done, then there will be a single
download link sent back to the user.
And also, we can see that we have persisted the status information of each
request job and task and a single tasks into the Azure tables and this will be
helpful both for logging our history and also help in the diagnosing of our
computations, which I will show you later.
So data reusing is a very common scenario in the processing of mod MODIS
data. And so currently, in our system, we have implemented a two-level data
caching. So we used unique global name space and each data file in the blob
storage has a global unique identifier.
And so that we can either pre-download or download all the source files from
external FTP sites and then cache them in the blob storage for all future
computations to enable reuse. And also, we can compute or pre-compute all the
reprojection results for future reuse across different runs of computations.
And on the local machine level, since each small size instance has around 250
gigabytes of local storage and we currently choose to cache the large size data
files for reuse across the tasks. And also, why don't we cache all the data size,
because given the 250 giga bytes, it's not sufficient, okay, to cache other data.
And there will be some cost-related trade-offs to be made, but I will, given the
time limit, I will not go through them here.
Okay. So for the reduction service, as we see, scientists can upload their
analysis binary tools up on the request and there are two main benefits. First,
they can easily debunk and refine their scientific models in their code. And
secondly, we can cleanly separate the system code debugging from the science
code debugging and we have an optional second stage reduction.
So this graph shows the scaleability of our service and we've done a series of
experiments. For this specific experiment we shot performance using different
numbers of instance working on totally 1,500 tasks. And also, we run the same
amount of tasks on a single desktop machine. And although the desktop
machine capacity is roughly double of the single Azure instance, we can still
roughly get 90X speed-ups using 150 instances. And second, we got almost a
linear scaleability.
Next, I will talk about two important features that we think are the key capabilities
provided by cloud infrastructure. First is dynamic scaleability. So we currently
use Azure management API to dynamically scale up and down instances
according to workloads as to improve the cost effectiveness and achieve better
utilization.
If we go back to the service, to the web portal, I think something interesting
should be happening there. Because as we just submit a new request, the
service monitor will have the information about the computational requirements,
okay. So as we can see, it's updating the deployment. This is because the
service monitor keeps monitoring the work, the current workload and it will
estimate the total computational requirements and then start up new instances in
order to finish the whole computation in the appropriate time deadline.
So there could be some problems. And the most severe problem we observed is
the instant shutdown. So why is it a problem? Because currently, only Azure
system can decide which instance to shut down. So that means instances may
be shut down by the, maybe randomly chosen by the Azure system and they
may be shut down during task execution. That means we should either provide
some failure recovery mechanisms or just wait for all instance to finish their jobs
before we can invoke the instance shutdown API. So this could be some tricky
issues here.
And also, currently, computing instance usage are charged by hours. So which
means the CPU minutes will be rounded up to an integral number of hours. So
it's definitely not very cost effective. We start new instance and run for ten
minutes and then shut down.
And very recently, we have four test cases to test the performance to increase
instance dynamically. For each of the test case, we increased the number of the
generic worker role from one to larger number. So we have 1 to 13, 1 to 25, 50
and then 98. And there are two interesting points. First, it is generally, it takes
longer to start more instances dynamically. And secondly, for all four cases or
instances, all instance almost start up at the same time. This is different, this is
different from what we previously observed, which is an incremental startup
pattern. So that remind us we should not only focus on the development of the
application itself, but we should also keep an eye on the back cloud
infrastructure.
By the way, in contrast, the shutdown time for the instances is relatively small,
which is usually within three minutes.
Okay. Another important feature is fault tolerance. Tasks can fail for many
reasons. For example, there can be broken or missing source data files. Also,
the reduction tool may crash due to the code bugs. And finally, the failures
caused by the system instability can be very common when computation goes to
a larger scale. In the first two are unrecoverable failures and the third one is
actually recoverable.
So we have implemented a customized task retry policies. So in short, task with
timeout failures will be retried and also tasks with exceptions caught during the
execution will be immediately resent, retried too. And also, the task will be
cancelled after totally two retries, because if it's an unrecoverable, it's not helpful
retrying it forever. So why not -- so a question may be raised, why not we just
use queue message visibility settings for failure recovery? Given time limits, I
cannot explain here. In short, it's not be able to handle all these three types of -it can only handle timeout failures, but not other types of failures.
Okay. So I think to give some conclusion, cloud computing, we believe cloud
computing provides new capabilities and opportunities for data-intensive
eScience research. And also, the dynamic scaleability is a very powerful
mechanism, but instance startup overhead is still not trivial currently.
And finally, the built-in fault tolerance and diagnostic features are very important
in the face of common failures in large-scale cloud applications and systems.
And this is our list of future work. Do we still got more time? Okay. Thank you.
So we plan to scale up computations from the U.S. continent to the global scale,
because with our current system, our UC Berkeley graduate student could not
only finish the ten years of U.S. data processing, but he can also go beyond that
before this try.
Also, we plan to develop and evaluate a generic dynamic scaling mechanism
with AzureMODIS. Finally, particularly interesting, evaluate the similarities and
differences between our framework and also generic parallel computing
frameworks such as Map Reduce. Thank you. Questions?
>>: I have two questions. One is short and the other might be long. The first
one is, I work with a scientist that use MODIS to do analysis. To can I use your
application, but I would need other kinds of analysis functions? This is the first
question. And the second question is that they like to look at intermediate
results. How would you go about doing this in this environment?
>> Jie Li: Thank you for asking those two questions. I don't have enough time,
but now I can explain more. So yes, for the first question, I think the arbitrary
reduction executable upload is really powerful, because by that, scientists can
upload their own obvious code using map lab and other [indiscernible] code, et
cetera. So they can specify their own computational models and algorithms to
work on the data. And the back end system is capable of providing a uniformed,
reprojected. So they already got the same format for those data points. So
scientists can -- yeah, we can support any other customized code to do the
analysis work.
And second question, I just skipped the slide here, and which is very important
part of our system, and I can show you here. So after the scientists can send his
request to the system, right, and in the scenario of cloud computing
infrastructure, the user usually don't know how many instances are currently
working on his computation so he must be curious what the progress is during
the long time.
So actually, we've implemented a separate component, which is for status
monitoring and diagnosing. So we can just use our this request. For example,
we got a unit computation ID and we can use this ID to monitor the progress in
realtime, as we can show here.
So from the left bar, we can show that that's the total tasks inside this
computation we just submitted and there are four succeeded and there are 26
processing. So this shows progressively in realtime how much of the
computation has been finished. And also, we can even click into the details and
we can see output log, error log, which this computation looks fine, right. But if
we go to some really large scale computations, failures could happen.
I've switched to a previous request and computation which includes almost a
3,000 tasks and we got a total failure of 37. Which are unrecoverable. So we've
finished this computation with these failures.
>>: [Inaudible].
>> Jie Li: Yeah, there's also an interesting feature. We've estimated the total
cost of the computation, also during run time, and for this specific computation, it
takes 121 hours. So the cost is currently based on the CPU hours only, so it
doesn't take account into network band width usage and the transaction because
that's an almost negligible part of the cost, because CPU is the main component.
>>: We have to do the due diligence to compare the billing? Would that -- and
it's sufficiently close that it's round off [inaudible].
>>: So this was paid on Mario's credit card, which is why he was so concerned
in the last panel session about costs.
>>: No it's my credit card.
>>: Oh, your credit card.
>>: That's my AmEx card.
>>: So related to this, so here you're counting the CPU kind of time. Cache all
the original FTP files and to reproject the image and your blob storage. So how
big is that amount of data in terms of giga byte that you continue to store even
when no is running in this scenario.
>> Jie Li: So for the total, for our use case, ten years data covering the U.S.
continent and it's totally five terabytes source data from the FTP sites and two
terabytes of reprojected data. So that's totally seven terabytes currently stored in
the blob storage, which will cost around a thousand dollars per month. And
actually, yeah, our estimation is for this project, for this use case, it will be, there
will be a time span around six months. So actually, we've estimated that the
storage cost will be around $6,000. And the CPU, the computation cost is
around the same. Yeah. But why we choose data caching as opposed to
recomputing or redownload the data, because network band width also costs
money, and since we reuse is really common in our computation pattern and the
reprojection process is extremely compute intensive, so it's not cost effective to
reproduce the data on the fly every time.
>>: So with the current system, for example, does not look at kind of removing
files from Azure storage that have not been used for a long time and then
opportunistically just fetching them like in three months when you need them
again or ->> Jie Li: We think, given the project time span, right, it's six months. So we
think it's not -- and our scientists are frequently using, almost using all the scope
of data now. So it's not, yeah, very, economic.
>>: At this point, I need to move on to the next.
>> Jie Li: Thank you.
>>: We can take some questions offline.
>> Jie Li: Yeah, sure, we can talk.
>>: So the last speaker, and we'll be finishing up a little bit early because there's
a short break we can have before the final round. So Wen-Chih Peng from
National Chiao Tung University will be giving the talk on monitoring and mining
sensor data in cloud computing environments. And it's all there.
>> Wen-Chih Ping: Okay, yes. My name is Wen-Chih Peng, and today I'll
present the works, monitoring and mining sensor data in cloud computing
environments. This works, I join the works with my colleague, Professor
Yu-Chee Tseng and we're from the National Chiao Tung University, Taiwan.
Here is my outline. First of all, I will tell you how we develop in our projects, I will
tell you two platforms. The Tai-Chi platform and cloud platform. And then I will
tell you what we choose to put these applications into cloud computings. First I
will tell you our implementations and some observation and issues that arise
through our implementations. And then finally, I will conclude with this talk.
So from the key note and from the panel discussions, we find there are lots of
sensors available. But we don't have a lot monies to buy those expensive
sensors. We are focused in, like this sensor, in much smaller, and it can be
deployed to monitor environments and traffics, okay, so we can find sensors can
collect lots amount of data.
What we are doing is we use this sensor to sense the physical words and collect
this data into our servers. Remember that I use the servers, because in our
[indiscernible] works, we only collect this datas into one single servers, but we try
to use [indiscernible] servers to [indiscernible] cloud computing and try to monitor
and mining knowledge from this sensor datas.
Okay. So in our spools, we have one project is called ABC. It means that
always best connected cloud service, service and access platforms. There are
some sub-projects. One sub-project, of course, you should have service
platforms. These service platforms would be multiple cloud computing platforms.
For example, we use Hadoop, open source Hadoop versions. Of course, we
have Microsoft Solutions, provided by Microsoft Research. And that was the
project, we started the cloud device and some professor studies how to do the
power, low power issues and connectivity issues of the cloud device.
And you can see these figures of codes. We have multiple wireless access
interface. So some professor studies how to provide best connective wireless,
wireless communications between the mobile clients and service platforms. And,
of course, we developed some applications, wireless sense applications, and I
will tell you two wireless sense applications in the next slides.
So here are some wireless sense applications we developed. One is the, what
we call the carbon dioxide monitorings in around.
National Chiao Tung University and this is the lines, small lines, the lines will
track your breathing be heavier and change the line. And we will tell you the
Tai-Chi. Tai-Chi is the slow motion Chinese kung fu. And it will keep you healthy
and keep your mind at peace. So if you are going to Taibei, you can find these
are all the peoples in the mornings that will play Tai-Chi in the park, okay.
So but how about young people, young people like this one? He will use the
mouse and the keyboard to click and play the virtual game, okay. So we try to
combine the physical words and cyber words into one platform, what we call
cyber physical platforms. In this example, you can find students, they are master
Tai-Chi master and some students, they can be around the world, and they
carries body sensor networks and their physical information will be shown on the
websites; for example, Facebook.
Okay. So this architecture shows you the Tai-Chi platforms. As I told you, that
we have the body sensor networks. And totally, there are nine sensors around
the bodies, and they try to detect the movements of the user. And through the
sink, the sinks will collect the sensor data and transfer those sensor data to what
we call Tai-Chi engines here. And the Tai-Chi engines will map the motions of
the user, and the user will see other user while Facebook clients applications.
So here are the detailed components of each [indiscernible] which develops. I
want to skip this slide due to the time limits. Okay. This is how it's back. Sensor
knows. We use this sensor, and we also have a sink, and a sink will have
wireless communications interface. Maybe I should give you the video clip to
show you how Tai-Chi is work.
>>: Three users are at different locations.
>>: I want to play Tai-Chi. What are Ju and Hong doing now? Ju is in China
and Hong is in America. Let me check whether they are online on Facebook.
>>: I have not to exercise for long time. I want to play Tai-Chi. Let me check
whether I can take a Tai-Chi course on Facebook.
>>: I'm bored now without anything to do. What are they doing now? Let me
invite them to play Tai-Chi on Facebook. They are online. User C creates a
Tai-Chi on Facebook.
>> Wen-Chih Ping: [indiscernible]. It's some slow.
>>: User A joins list room. User B also joins list room. Three users do Tai-Chi
exercise at different locations.
>> Wen-Chih Ping: There will be one master, and all the students will follow the
master. They can find this on Facebook.
>>: They can share their emotions with each other anywhere.
>>: Let me check whether they are online on Facebook.
>> Wen-Chih Ping: So there are only nine sensors, and [indiscernible]. You can,
if you like, you can download applications and run this application on your
Facebook. So this is Tai-Chi's platforms. And we are now [indiscernible] our
works. We are thinking about more interesting topic to let more people join the
Tai-Chi communities. For example, we try to measure the similarities of different
users. This one is four sensors for one person. And the other persons will also
have these features, and we won't try to formulate the similarity measurements
between the user from their sensor readings.
And if you have the similarity measurements, then we can do a lot of the
recommendations. For example, we can recommend that you should -- the user
should follow one master, the Tai-Chi behaviors are very similar.
Okay. Another way we can recommend that you should join one Tai-Chi
communities, which are favors your style, okay. So this is our future works. And
also, we want to use the cloud computing for Tai-Chi. As you can see, the
renderings of this, their motions, they are similar. So we try to use the cloud
computing to speed up the renderings.
Another issue is the similarities computations amongst users are very
computation intensive, so we want to use the cloud computing to help us to
achieve these goals. But this is our future work.
Now, I will tell you another wireless sensor network about the environmental is
monitoring service over cloud computings. As you can see these figures, that we
have some vehiculars with carbon dioxide sensors and trivia sensors. And these
cars has 3Gs wireless networks, and they can upload or what we call publish this
sensor reading to the servers. And users from the desktops or from the other
vehicular user can subscribe to service, and they can see, for example, in this
one you can see the carbon dioxide map around the National Chiao Tung
University.
So we propose these concepts, and we implement what we call the car web
platform. In car web platform, these are some vehiculars, and they can use their
smart phone or GPS logger to log the trajectories of the user and upload to our
server, then the server will have the data points and road segment points, and
then we can use these sensor data to estimate or monitor the traffic centers
around the road.
We implement the car webs for about two years, okay. This is the first version so
we use the Window mobile version. This is very ugly, ugly smart phones, and
this is Window Mobile 6, and we also implement enjoy platform, like this one.
This is the client version. And you can look into our -- you can download the
applications and [indiscernible] applications on U.S. smart phones. As you can
see the two figures, this can show you your location, and the user can upload
their speed readings to the server. And this one shows you the nearby road
segment, the traffic status around the nearby road segments, okay. The red line
means the traffic jams happens.
Okay. Then think about that you have the GPS data points and trajectories of
user including the histories, GPS data points and realtime GPS data points. The
problems we want to deal with is we want to estimate the traffic around the road,
okay.
So here are the proper formulations, and we -- before the proper formulations,
we try to give you some assumptions. The first assumption that we have the
road network like this one. In our car web platforms, we have the road network
of Taiwan, and also we have the GPS data, like this one. In this one, you can
see that this is one car and their location and a speed and the upload time.
The problem will be like this one, the input will be the traffic data base and the
queries. This queries is one road segment with the time. And output is the
speed of the query road segment. For example, in this one, one users want to
query, want to know the traffic status of road segment E at time T4. This is our
output, 50. 50 kilometer, not mile.
And what's the problems behind our problems? The most challenges point is
less traffic information in real times are available, because not always users want
to share their GPS data points. So our prior work proposed spatiotemporal
weight approach to estimate the traffic, and this paper's published by SSTD 2009
and MDM 2009.
And we observed two important factors. One factor is temporal factor. This
means the traffics always have [indiscernible].
Another factors is spatial factors. This means that if the sensors are nearby, their
readings are almost the same. For example, you can see these are two sensor
and the speed readings are almost the same, similar. And compared to this
sensor, their readings, their colors are very different.
So if you have the temporal factor and spatial factor, we can retrieve more GPS
data point from the data base, from the data base, and use these GPS data point
to estimate the traffic.
I will also give you one short demo, video demo.
>>: After looking into the car web website, user can browse and manage their
own trajectories collected by car web client from their smart phone or PDA. User
may click on any of them to show the whole trip on a map. But query can ask
specific GPS point. The corresponding information will be provided. We can
also choose road segments from a map. Our system will report estimated driving
velocity. This is the entry point of our car web client. User may first login into the
server and the map will show up centering at current location based on the GPS
signal. By default, car web client will continuously track user's trajectory every
five seconds per point and upload trajectories into the server periodically.
Apart from the data collection, we also provide traffic status estimation service
based on the collected history data. When the system uploads user trajectory, it
actually invokes the service query simultaneously. Later on, according to the
query result, drawing different colors on nearby road segments, indicating
different levels of expected driving velocity. You may have noticed from this
demo the estimation query result doesn't show up instantly as we drive. The
reason is our server calculation is not fast enough.
>> Wen-Chih Ping: So my student tell you the province, then you can drive the
cars. But the road segment, the queries results are delayed, okay?
So our problem is given the range of query that circles and how to efficiently
estimate a traffic status of overall segment within the range specifies, this is a
computation intensive, okay?
So we tried to use cloud computing to solve the problems. So similar to the
Google or Microsoft map service, the whole space is divided into several grid.
For each grids, we will use MapReduce to estimate -- for each grid, we will use
one virtual machine and try to find the traffic status in the grid.
So this is our first implementations which we try to use HDFS and use the
MapReduce by using ten virtual machines and the results is very bad, because it
needs 20 minutes and 11 seconds. So I'm wondering why, because there is no
intact structure there in HDFS. So we try -- another approach is that we try to
construct index structures for the GPS data points and road segments, and also
we increased the number of virtual machines.
Okay. So we input three measures. One measure is that we use Hbase. Hbase
is open data base Hadoop platforms. So this is one. And another approach that
we use Hbase plus five virtual machines. And finally, we do not use the Hbase.
We only use HDFS, but we use some grid index data structures. We also use
five virtual machines.
And this figure shows you that Hbase is not good. You can find that. And then if
you use the HDFS plus five virtual machines, the response times, the execution
times is very slow. It's shorter. And this is also show you the same results,
similar result. And in this figure, we show you that with different virtual machines,
if you have more virtual machines of codes, the execution times will be smaller.
Also, different colors shows you the different query range. If you have a larger
the query range, execution times, of course, will be larger.
So some possible issues we find. When you use the MapReduce, you need to
devise the file into multiple file, but the problem is that the trivia's data points are
not uniform [indiscernible] allowed in the [indiscernible] space so we think we
should devise good measures to devise the data file into smaller data files
according to the data distributions.
Okay. So if you use the MapReduce, there are nine map here, okay. But
sometimes you will need to wait to those virtual machines with heavy load.
So another possible issue that we find that index structures should be developed,
okay. As we know that you can find there are trees in traditional data base. But
how about are trees in cloud computing environments?
Then you can find that our key note will answer the realtime sensor data. This is
why we also -- we also think this as a challenge of problems.
So I conclude with this talk and you can find in our projects, we propose a new
paradigm for cloud computing, what we call the sensor cloud, and we use the
sensor data to collect information of physical things and put all sensor data into
cloud computing environments.
And we have some preliminary implementations for this wireless sensor networks
and we find some possible issue. For example, how to develop or propose
efficiency data storage and retrieval measures and how to propose the partition
scheme for MapReduce and also deal with the realtime sensor datas.
Of course, after this cloud future workshops, I will let my student to try Windows
Azure. Yeah.
>>: Thank you. I actually want to comment. I really like that, the use of the
sensors, and I thought of taking Tai-Chi with Facebook. Questions?
>>: I'm sure it was in the talk and I just missed it. I apologize. How big was the
data for the second half of the talk, the spatial data?
>> Wen-Chih Ping: GPS data?
>>: Yes. How big?
>> Wen-Chih Ping: Five gigabyte, yeah.
>>: So you do try just [indiscernible]?
>> Wen-Chih Ping: Yeah, but the same properties. You use the conventional
data base and the response times are very slow, yeah.
>>: Not anywhere near 12 minutes, I'm sure?
>> Wen-Chih Ping: Not, yeah.
>>: So you tested and it came out?
>> Wen-Chih Ping: Yeah. And the video came around. In the video clip, you
can find that. This is use the traditional MySQL. We use the my sequel data
base.
>>: Did you try a real data base?
[laughter]
>>: Microsoft data base.
>>: Microsoft data base is not MySQL.
>> Wen-Chih Ping: Maybe I can try it.
>>: Five gigabyte scale, using MapReduce ->> Wen-Chih Ping: I think the computation is very intensive. The renderings of
the map service on the google maps or Microsoft Map, you can find they use
cloud computings. But in our province, it's more difficult, because you have the
previous data points. We need to estimate traffic of road segments with the
inter-query range. So the challenge would be the computation.
>>: [inaudible].
>>: Do I have time for one? There are lots of people in the spatial data base
community that keep coming out with new algorithms to speed up these nearest
neighbor queries and so on and so forth.
>> Wen-Chih Ping: Yeah.
>>: So do you believe that this will kind of disappear when you start running on
the [indiscernible] so we won't need to have so many very, very small
magnifications and algorithms and performance once you start running on the
wall?
>> Wen-Chih Ping: I just want to try to say that, of course, there are lots of the
research works on traditional spatial queries data base, traditional data base, but
the cloud platforms are -- the cost model of cloud platforms are different from the
traditional data base. So this traditional spatial query should be redesigned,
yeah. So this is just the first tries. And we will continue to work, and I hope that
cloud DB will be developed. Yeah. Just try. Yeah.
>>: Thank you very much.
>> Wen-Chih Ping: Okay.
Download