>> Dennis Gannon: So we actually heard that the... from us actually knew a thing or two about cloud...

advertisement
>> Dennis Gannon: So we actually heard that the book seller across the lake
from us actually knew a thing or two about cloud computing, so -- no, really,
seriously, we all owe Amazon a great deal of debt and gratitude for actually
pioneering cloud computing for the last two or three years. And we have much to
learn from them.
And I'm pleased to introduce Deepak Singh who will be telling us about Amazon
and what they're doing in research and science applications.
>> Deepak Singh: Thank you. Can everybody hear me? Sounds like it so
actually Ed just gave my talk, so you can all leave and get another one and a half
hours of a break now. So, yeah, thank you, Roger, for asking me to speak.
Usually I speak really quick, really fast. But we have a little more time than I'm
used to, so I think we're going to relax a little bit, go slower.
During the course of the presentation, if there's any questions or clarifications,
please do ask. I like it being interactive. So just holler.
Some quick history. So I've been at Amazon now for about well, it will be two
years in June and all at Amazon Web Services. By day I manage EC2, so that's
from the business development side, that's my that's what I do. But I spend my
entire life before that in the life science industry.
Mostly my background historically is in large scale molecular simulations and
started off as a quantum chemist. But before I came to Amazon I spent some
time at another company out here in Seattle called Rosetta Biosoftware, which is
now part of Microsoft, working as a strategist. So I learned more in genic large
scale data management, gene expression, genetics and genotyping.
So this is a -- so because of that, my talk tends to be very life science biased, so
I apologize to every other field of science for that. But that's the nature of the
game.
So let's get started. And the reason Ed laid it out quite nicely is we are here to
talk about data. Not my favorite Android, but the kind of data one gets from large
projects and historically has from large space projects. We talked about how
astronomy has been instrumental not just in creating a lot of data but in
developing sciences that do not necessarily do a good job in managing it, but
they analyze a lot of it.
The high energy physics community, on the other hand, has had to do a lot of
work in trying to figure out how to manage very, very large scale data projects
across not just labs but across continents because LSC is not just European
project, there's people from all over the world trying to work on it even when
bagels bring it down.
But near and dear to my heart is instruments like these. This is an older Illumina
genome analyzer. The root password and login information on there. So if you
wanted to hack into the Sanger Centre and start doing some sequencing, you
could probably do that.
But the difference that these things make compared to the space and LSC type
projects that we just talked about is the sheer throughput at which they generate
data. Space projects, high energy physics projects take 10 years. People collect
data for a long period of time analyze data for a long period of time. This
instrument is a few years old, and it's obsolete.
The amount -- the data volume that's being pushed by these and the changes in
the kind of data and just the technology is pretty intense. And this is happening
in the life sciences and other fields as well pretty much as we speak.
Another area that's interesting is, and we've heard quite a bit about this is things,
sensor based projects, things like the Ocean Observatories Initiative, where
sensors, ambient systems all over the world collecting data which people are
either analyzing after the fact or in realtime, depending on their needs. I forget -I was talking to somebody from outside from Newcastle where they were talking
about ambients are neurological projects where they're trying to analyze data
from people's homes and trying to predict whether they're having some kind of
neurological systems. And of course computers themselves generate a lot of
data.
You could be a Web company just click stream data as probably people who
worked on the live side of Microsoft know generates terabytes and terabytes of
data. There's people collecting data and trying to analyze it. People taking data
from these instruments and creating second rounds of data, which can often
exceed the initial data.
So it's a very interesting time. It's a time where a single genome makes no
sense. There's no such thing as the human genome anymore. Instead what we
have is everyone's genome. We have genomes of cancers, we have genome of
every individual -- we could have genomes of everybody in this building,
everybody on this campus. That's quite a lot of people. So what does that
mean?
Let's leave the space and LSC types aside for a second. Scientists who have
typically been used to working in the gigabytes, the kinds of people who used to
work with Excel and with joins across spreadsheets, they're kind of getting into
this whole terabyte scale thing. And they're just starting to get used to it, but it
might be too late. By the time they get used to terabytes, terabytes might -- the
petabyte might be the old terabyte.
And that's kind of the joke we have at Amazon where the petabyte is the new
terabyte, you know. That's what we see. And if you believe some people, we
actually going to be at the excess scale pretty soon. I think it's going to take for
more scientific projects, except the really, really large ones, going to take a little
bit of time to get to the excess scale, but I've been wrong before.
And the big difference as I just sort of alluded to, is we're collecting this data
really fast. It's not being collected over five years, it's being collected over weeks
and you have to analyze it over weeks, which means -- and the cost has gone
down enough that it's no longer -- this is the picture of the Broad Institute in
Cambridge, Massachusetts, which is a big genome center. They have 100
Illumina sequencers. But as Ed again said, you have folks like Ginger Armbrust
who can get a sequencer and collect a lot of data and generate a lot of data and
figure out how to manage it.
So I completely agree. I did not put this in after Ed started speaking. I had this in
earlier. It's actually a great book if anybody hasn't read it, it's available
somewhere on the website. About how scientific discovery -- I come from a
simulation background where data sizes were in the kilobytes and trying to
generate molecular trajectories and rotation functions and [inaudible] energy
services.
But more and more science is going into a state where you're collecting data and
you have to try and analyze it to try to figure what's happening, which means that
they're going through a lot of change. And it's a very rapid change. And it's a
change in scale. It's a scale that a lot of scientists, especially the kinds that I've
historically worked with, aren't used to dealing with, which means they are to
completely rethink the way they handle their daily chores. They have to start
thinking about things like data management.
I like saying that data management is not data storage. I don't know how often
I've heard from people, oh, I can go to Frye's and buy a terabyte -- you know,
multiterabytes of hard disk. Yeah, but that doesn't quite work. You have to try to
make this data available to other people, share it with other people, go back to it
and try and analyze it.
They have to rethink how they process that data. A lot of the cyclic coupled
codes that written Fortran, I've written a few of those myself, and they are a piece
of junk, don't quite work. And you have to start thinking about new ways to scale
out your data.
You have to realize that a scale a lot of the assumptions you make about
infrastructure change. And one of the last projects I was involved with when I
was at Rosetta was a project with FDA where we didn't have that much data, just
short of a terabyte. There were ten institutions, companies, academic groups,
the FDA involved. And what we had to do was we were collecting this data and
everybody was -- had a different slice of the pie that they were trying to solve.
But the project took a while starting because we had to ship the data from each
group to each group on a disk, and it took about for 10 weeks to get it Fed Ex -get it Fed Ex'd to everybody and make sure everyone had it.
That's not how you can share data anymore. In the past if you went to a data
repository like NCBI, you could just FTP it down, you could bring your data to
your desktop, to your work station, analyze it. But when the Tallinn genomes
project is multi-terabytes and you want it on your desktop, one, you're to find a
hard drive big enough, which is not easily doable. You're to go to shared system
or something like that. Or you have to start thinking about things like the cloud
where you get a common area where you can start sharing that data.
And it's a constant problem. And data volumes are changing rapidly enough that
a lot of scientists haven't had time to figure out how to build these systems, how
to think about it. And what they end up doing is getting in situations like this one.
I don't know which lab this is. I won't call out names. This is about 188 terabytes
of sequence data on somebody's lab bench. You can't really process it, because
it's not connected to anything. You can't really share it because you need all that
data, and trying to move it around and shipping it to your collaborator is not going
to be trivial.
And what happens if one of those -- it's on a lab bench. There's water around
this, solvents. What happens if something falls on it? You're in deep trouble.
This is a picture I borrowed from Chris Dagdidian of the BioTeam. And he talks
about the fact that in 20 -- 2009 was the first time that, A, he mounted a single
petabyte volume on to a compute cluster. It was also the first time that he saw
somebody being fired for a petabyte scale data loss.
So science is getting computer storage limited today. I know a lot of projects
which don't start because they don't have the storage or they don't have the
compute or they don't have easily accessible storage or compute, and they have
to scrounge around, try and find resources, go to their friends at a supercomputer
center, which most of us have done at some point in time and asked for time.
But there's more -- as you need more and more resources getting a friend to be
nice to you becomes a little more difficult.
So I started thinking about some of these problems a few years ago, and that's
kind of how I ended up at Amazon. So for the next -- we'll see -- few minutes, I'd
like to spend some time talking about Amazon Web Services.
We hear various definitions of the cloud. Different people have different opinions
of the cloud. For the purposes of this conversation and for what AWS does, we
are infrastructures as a service. Essentially what we provide is a toolkit, a toolkit
that allows people to do stuff for what it's worth.
It uses some basic building blocks. So how many people here have used any of
AWS's services? Fair number. Bill you're not allowed to raise your hand. I know
you have. So EC2 is where I spend most of my time.
For those who don't know, EC2 is essentially getting a virtualized server in an
Amazon data center or getting 10 or getting a hundred or getting 200. It's
essentially with simple Web services APIs you can provision lots of server
computer resources. You can run Linux, you can run windows, you can put your
own apps on it. You essentially control. Once you get the server that's where
you control. And people can do all kinds of things with it.
Historically our -- most of our customers came because of Amazon's history from
the Web community, people doing a lot of E-Commerce, people building new
software service stacks, and of course it evolved from there to people running
SharePoint on AWS.
On top of that over time we've added things like load balancers and the ability to
scale based on some metrics, based on some triggers. The other key sort of
component to this what I call core infrastructure is storage. And the service that
we actually launched at AWS was Amazon's S3. It's the service that a lot of
people tend to get involved with first.
If anybody uses Twitter, that's where your thumbnail is, you know, just as an
example. S3 is a massively distributed object store. It's not a file system. Every
piece of data you push on it gets replicated across multiple data centers. The
idea and design point is that it should be highly available and highly durable and
that you should be able to scale up as your needs be. And I'll get back to that in
a bit later.
Things that we've added to it, I'll speak to it again later, is ability to import and
export their data from S3 using physical disk, using Fed Ex because some the
network's not always the easiest way of getting data in and out, not for everyone.
And at AWS we also consider databases a core piece of infrastructure because
the belief we have is that you shouldn't need to manage your own database,
whether it be relational database. So we have a service called RDS which
essentially managed MySQL. What you do as a user, you just provision a
MySQL instance. It's MySQL as you know it.
But you don't have to be the DB. We've got people running in the back doing
that for you. You can scale up and down. You can scale up your server with an
API call. You can add new resources, you can grow your disk. Again, it's all
APIs or going through a simple interface that does it. You don't have to worry
about that. We try and work on the optimization and the tuning. It's still a
relatively young service, so it's going to evolve.
One of the things that we announced when we launched RDS was the fact that
MySQL is -- most people don't know how to run MySQL in a high availability
system, so one of the things we're working on is the ability for you to fail over into
a different data center without actually having to go through the pains of doing it
yourself. We'll do it automatically for you.
And SimpleDB is a simple -- it's a key value column store is one way of looking at
it. It's a -- if I could use a buzz word, I know SQL schemaless database, which
lot of people use for scoring metadata on NetFlix, for example uses it to store
their metric queues.
On top of that, we've added over time a bunch of other things. One of my
favorites is a parallel processing framework which encapsulates Hadoop, called
Amazon Elastic MapReduce for people -- you know a half -- a big chunk of our
users run Hadoop for doing things like log analysis, if nothing else. And it -- they
were struggle at that time, especially when Hadoop was in .15, .16, just with
keeping things up, keeping things stable. Hadoop's not exactly the most stable
thing in a shared environment. So we decided to build a system that helps them
stay away from managing the Hadoop side and just give them and API to that.
When S3 launched, a lot of people were obviously storing content in it and
distributing content. I for years have had a video sharing site and a podcast
where all the content goes into S3 and people downloaded, but obviously there's
folks in you're countries and far away who -- and people are essentially using S3
as content delivery network.
So we launched CloudFront, which is a CDN that sits on top of S3 and a bunch
of Web servers that deliver the content to various locations.
Messaging is a core part of our distributed infrastructure. If you're building sort of
asynchronous applications and you want good fault tolerance and good sort of
things that are loosely coupled, using messaging is usually a good way of doing
it. We started off, and actually SQS is our oldest service, I think it launched after
S3, though, which allows you to just run a simple pull based messaging
infrastructure in the cloud. And just two days ago we announced a service called
a simple notification service which is a realtime Pub/Sub service. It's assistive
service to SQS where essentially you publish a topic, people can subscribe to it,
and it gets delivered to them over HTTP or over SMTP in realtime. I think at
some point of time we'll add SMS to it as well.
And of course we got Amazon, we have a good payments platform. And for
those of you who like getting things looked at by humans, we have a service
called Amazon Mechanical Turk. It's actually quite unique. If anybody has
questions, catch me afterwards. I won't spend much time on it here.
And at the end, the -- we've added a bunch of other sort of infrastructure
management services. Amazon CloudWatch is a monitoring service that takes
care of the infrastructure. It gives you a whole bunch of metrics. Based on those
metrics, you can auto scale automatically. Or, you know, the two things that auto
scaling does is let's say you're servicing a lot of load, you can add new servers
just to handle the load and take them away.
When the load goes down, the other thing you can do is let's say you want an
infrastructure of 20 servers across two data centers and you want to keep them
at 20 servers across 2 data centers. You can use the auto-scaling functionality
and these metrics to make sure that's happening. It's a good built nice fault
tolerance architectures.
We have management console. We have toolkits for eclipse and .NET. And I'll
get back to this later, the ability to deploy virtually isolated networks, what we call
the Virtual Private Cloud.
And with all of that, what we can do is build your applications and services. But
what's really cool about all of this is the fact that you can do it either using a
simple UI, or the way I like doing it, sitting in a command line and orchestrating
an entire infrastructure. You can vault the servers, you can launch new servers,
you can launch new services, you can launch elastic Map/Reduce jobs all from
the command line tools or if you're somebody who likes a UI, you can do it that
way. Or if you like mobile devices, you can use your iPhone to manage
infrastructure.
So one of the things that everybody recognizes and associates Amazon Web
Services with is elasticity. As an example, here is one of my favorite use cases.
This is a hedge fund based out on Wall Street. This is a typical usage scenario
for them. They run about 3,000 servers at night before the markets open,
running a bunch of risk analysis models during the day to keep at -- you know, to
sort of keep at a lower baseline.
On weekends they're back at that lower baseline. But again for about six hours
to eight hours every night they spin up. It works very nicely. This is the kind of
usage graph that you'll see across many, many, many customers, especially
people in the data analytics with this intelligence base who have customers or
financial modeling to do.
The other thing that folks associate with AWS is scalability. Example of that on
the storage side is SmugMug. It's a photo sharing site for I call it the pro zuma
crowd because they don't have a free level. Everything is paid for. And they
support raw formats from SLR cameras and so on and so forth. SmugMug has
been on AWS for a long time. This is and old number, from like a year and a half
ago when they had over a petabyte of data in S3 and have grown since then.
Ed's already shown you this one, which is the classic Animoto story where they
had to go from baseline of 50 servers to 3,000 servers because they launched a
Facebook app. And that happens a lot. I mean, there's companies -- I know Ed
says don't play farmville, but please play farmville, that way my revenue targets
get easier to hit because it runs on AWS.
They will not have been able to scale as fast as they did in a non cloud
infrastructure.
But the thing that people tend not to forget, and I think this is what I like
reminding a lot of scientists on that cloud infrastructures, the people building
these cloud infrastructures come from a world where highly available
infrastructures is necessary. And most scientific infrastructures like this one are
not highly available.
As Werner Vogels likes to say, everything fails, all the time.
Jeff Dean from Google once said -- gave a really nice at Cornell last year, and I'll
steal it off his data from there, where he says things crash. Deal with it. I think
he had a number in there which he said you build and X number of node cluster,
watch it for a day and see servers fall. It happens.
And when you are at scale, you have to deal with things like that. If you're
running a work station [inaudible] desk like many of us did in graduate school or
a few nodes somewhere, you don't have to deal with it that much.
Two to four percent of servers will die annually. It doesn't sound much, but when
you're crunching on a lot of data or a lot of servers, that becomes viable. So
you're to prepare for failure. You can't just restart a calculation about four days in
when one node suddenly goes away and you're to start from scratch. I've had
customers do that because they were using a very clustering environment.
One to five percent of disk drives will die every year. One of the things that S3
does -- I'll get back to it actually.
But perhaps the most common error, especially in environments where your
infrastructure's managed by graduate students who really aren't really good at
managing infrastructures is the biggest issue is going to be systems
administration errors, somebody doing a bad config, somebody setting up SMPP
related with the problem, somebody misconfiguring a cluster. And when
availability of infrastructure and reliability of infrastructure becomes important,
you really have to start thinking about who's keeping it up, how they're keeping it
up, how you're doing it.
So if you want your infrastructure to be scalable and available, you have to
assume hardware and software failure. So you're going to design your apps to
be resilient these are not just the apps that are running on the infrastructure, but
also the apps and software that are managing that infrastructure. You're to start
thinking about things like automation and alarming.
Now, again, in a research environment, in the scientific environment, that's not
usually what most people think about, especially in a day where a small lab is
being able to generate a lot of information. At the University of Washington last
summer, my wife works at the genome sciences -- in the genome sciences
building. Every Sunday for a while because it was a heatwave, their cooling
system used to go down and you couldn't access the cluster. So for a day or
two, you couldn't do any work.
Maybe it's okay. I don't think it is. So that's the kind of thing that Amazon Web
Services, Microsoft, Google, that's kind of the environment we come from.
For example, with EC2, everyone who launches servers in our east region gets
four different availability zones. And an availability zone for all practical purposes
is a failure mode independent data center. You can pick and choose which one
you want to run in. You could choose to run across, you could choose to run in
one. You have the option of running across AZs and building failure tolerance
into your applications.
S3 goes one step better. It automatically replicates data across physical data
centers. It's constantly checking for bit rot. The goal is you shouldn't use your
data. You should have a highly reliable and durable infrastructure.
So as an academic or as a researcher, whether you're in a commercial entity or
in an academic entity, you don't -- you shouldn't really need to think about, okay,
how am I going to build a really reliable infrastructure. If a data center is a single
point of failure, I need two. Where do I find the second one? How do I get
distributed architectures? It's not something I've done before. Well, somebody
has done it for you. You can think about using it.
The other thing that happens is there's a whole bunch of other services that
we've had to build over the years, either internally or for the people using us, that
run -- that help people become scalable and fault tolerant; things like elastic load
balancing, which balances load across data centers. I've already talked about
auto scaling. Elastic IPs are the ability to take an IP address and move it from
server to server pretty much. The elasticity relies on the fact that you can take
an IP and move it from one server to another server. Elastic block store is again
one of my favorite services. You can essentially provision block devices with an
API call. You can resize them. You can move them from one disk to another
disk. SQS and SNS I've talked about. And CloudWatch is our monitoring
service.
And the other part is that they need to be cost effective. And one of the -- if
James Hamilton was here giving this talk, he would spend half an hour talking
about economies of scale for large scale computing. But I won't talk necessarily
about that.
But the other core part of AWS is pay as you go. It's that you should be paying
for what you're using. As an example, I'll talk for EC2. So EC2 has three
different ways of paying for it. The one that most people are familiar with is the
one on top over there, called on-demand instances. You run a Linux server for
an hour, you pay eight and a half cents. You run -- it was easier to do this math
in my head when it was 10 cents an hour.
So essentially what it means if you're running 10 servers for an hour or one
server for 10 hours, it's going to cost you the roughly the same amount of money,
because you're paying for the time the servers are on.
Now, that pricing assumes a certain utilization. There are some people who are
running applications all the time. They want those servers to be available all the
time. And they don't mind a solving some of the risk. So what they do is they
pay up front. This is a pricing model we introduced about a year ago. But for the
up front they pay $200 or so for a small, small instance, they now pay an
operational cost of instead of eight and a half cents and hour, they pay three
cents and hour. So that's a model that works really well for people running
databases, highly utilized infrastructures, which is not that many people.
And last Decemberish, we came up with my favorite pricing model, which is spot
instances. At any point in time we carry some overhead. There's pieces of the
infrastructure not being used. So what we created was a market. Essentially as
a user you can bid a price on it. For example, for an eight and a half cent
instance for doing some job that you may never do otherwise because you don't
have the compute time or something that you don't believe is worth paying eight
and a half cents for. An example I use is in chemical there's 2D to 3D
transformations that you have to do for a chemical laboratory. It's just grunt
work. You have to do it. But I want it to happen and not worry about it.
Spot instances are great for things like that. It's actually great for things that
David Baker has developed with Rosetta at Home because screen saver projects
assume the screen savers get shut off. So they're meant to be run piecemeal.
So what you can do is say I won't pay more than three cents an hour for this
server. And underneath that we have a dynamic market that changes based on
supply and demand. So if your price falls below three cents an hour, your
instances start. So let's say the price is at two cents an hour, you essentially pay
two cents an hour. The price goes up to 2.2, you pay 2.2. The caveat is if it
crosses three, what you had bid, your servers get shut down.
So if you are used to running checkpoint, checkpointing file, if you're used to
assuming that things go away, this is a great infrastructure. So what do people
use it for? For this pricing model. Scientific computing is actually a great use
case. Last week I was talking to somebody at a financial services company who
was running MATLAB workers on what exclusively using spot instances.
Each job that he has is a few minutes. And if he loses one or two, he can always
restart.
Web crawling is another one that's very common. You're indexing stuff you don't
mind. And of course a lot of these screen saver type projects work really well.
So the other core tenet of the infrastructure is security. You know, you have
cameras built on you. When you are building a large scale infrastructure,
especially coming from a retail background, you have to be very particular about
security. I won't go into the details of our security model. If people want to talk,
happy to do that after the fact.
But some of the -- one of the things that we do, especially at the instance level, is
what we've done is move a lot of the software that would normally sit in your
router on to our -- actually just below our virtualization stack. So essentially it
acts as a traffic cop and inspects every packet coming into every instance. And it
decides who gets access to it, what should be done with it and so on and so
forth.
But people still like more control, so for those people we developed something
called the Virtual Private Cloud which our folks in Virginia who developed it called
a data center on a stick.
Let's say you have a hardware device in your data center. You have a
corresponding gateway inside an Amazon data center. And what you essentially
get is a virtually isolated network which right now is so virtually isolated you can't
access anything else with it. If you want to access even a file on the Web, you
have to go through the gateway.
So what this means is you can bring your own address range into AWS, you can
use your standard network security policies, the network engineers. And the
guys managing their security love it because they don't need to change anything.
You can create your own subnets, build your own BMZs and so on.
Over time, we'll open up the Kimono into other AWS services and allow you
Internet access. But we started off by going the usual default route which is
basically shut everybody out and assume everybody is malicious. Which works.
So that was the sort of what is AWS spiel. Now let's get to the fun part and talk
about why I believe science and Amazon Web Services is a good thing. And in
sort of large scale distributed cloud infrastructures in general.
We talked about the Sloan Digital Sky Survey in the last talk. There's a project
called Galaxy Zoo. How many people have heard about it? Cool. It's one of the
few talks where more than one hand went up when I talked about it. So Galaxy
Zoo 1 ran on a certificate in Johns Hopkins. For those who don't know, Galaxy
Zoo is a citizen science project where they take data from the Sloan Digital Sky
Survey and they allow people who are not necessarily astronomers to classify
galaxies based on certain criteria.
They never expected to be this popular. Now, this could be an urban legend.
But apparently the first time the -- first Galaxy Zoo they got more traffic than they
expected and the server at JHU got fired. Which is a problem. So when Galaxy
Zoo 2 started, they decided to go for a much more scalable infrastructure and
build it on the AWS with a nice sort of loosely coupled infrastructure. They
actually a tool called vehicle-assembly, which you can open -- which you can get
from get hub, which was written to deploy genome sequencing algo's. So the
guy who built this used to work at genome center had moved to Oxford and this
is now running at Oxford in the UK.
This is just some of the early stats that they saw. They saw in the first three days
after Galaxy Zoo 2 was announced they saw close to 4 million classifications. 15
million in the first month or so. And at one point of time they were tracking sort of
how many clicks they got over a hundred hour period and they had over two and
a half million clicks. So if the first server caught fire, if they were still running on
it, I don't know what would have happened, but this is a fun project. And it
becomes possible. And I've seen other examples of people sort of running large
scale Web infrastructures which have performs expectations.
One of the things in the protein structure prediction community you often see is
you find prediction servers but when you go to them and you submit a simple
sequence you're to basically wait for hours while they're in queue. But if you
want people to be able to access things like classifying galaxies, make it
available to just not the experts, things like Fold It, you need to have -- you need
to start thinking about good Web infrastructures. And I think this is one way of
getting that.
But the reason we are here is we have lots and lots and lots and lots of data.
And lots and lots of data has challenges with scalability and availability as I hope
both Ed and I convinced you. But then you have scientists who they're data
geeks. Vaughan Bell has written a book called Mind Hacks, if anybody has not
read it, it's a great book. And Duncan Hull, who doesn't quite look at that
anymore works at the EBI I think these days.
They're basically implementations that, you know, they like mining and analyzing
data. What they don't necessarily like doing is worrying about the infrastructure
that it's running on. So data management is a huge challenge. Data
management as I told you in his last talk people often worry about data by putting
it in flat files. I remember my first startup, we took data from the PDB. Every
piece of data that came, we looked at the four letter code of the PDB file, took
that middle two letters, made it a folder and everything came in underneath that,
so we had hundreds of folders in a hierarchy that went below that.
In those days it was okay because the PDB was small. If you did it today, you'd
of a pretty rough tree there to have to try with for every single job.
And as scientists what you want to do is find out everything is to answer
questions. Your question might be tell me everything I know about Shaquille
O'Neal. From the gene expression data, from the proteomics data, from the
gevas [phonetic] data, you know, from a pathway database. And collect all this
data and be able to analyze it. And for that you need good data management
systems. You need systems that allow people that you collaborate with or other
scientists to be able to go in after the fact, maybe even after the primary analysis,
and try and figure out, okay, here's a new data I have, here's all the correlation -correlated information. What's the biology of -- what Shaquille O'Neal's biology.
Why is he 7 foot, 2 inches tall and why doesn't he rebound well? Maybe not that
question.
And it's not just about flat files, it's not just about databases, either. One of the
things that AWS gives you is choice. You can use a managed database like
SimpleDB or RDS. You could read a user massively distributed object store like
S3. Or you could build stuff on your own using EC2 and EBS.
For example, there's a company called Recombinant. This is something that
they built for a big pharma company has built a biomarker warehouse using
Oracle. It's about 10 terabyte warehouse. Over three years is what they
[inaudible]. They could have built it themselves. This is again, as I said, a top 10
pharma company. But they actually found it to be more cost effective given the
utilization rates to build it on AWS. And it's a proper biomarker warehouse where
they can go and do all the stuff that you could do in actually with a system like
Rosetta Resolver which came out of Rosetta where I was before.
It also changes the way you approach data processing. Sure. Today on AWS
you can run things like Unicloud. Unicloud is by a company called Univa UD.
They run on top of Sun Grid Engine, but they provide basically policy
management and user resource management. Sun Grid Engine now natively
supports EC2, so if you don't want the sort of business process management that
Univa provides, you can go straight to Oracle I guess now and get Sun Grid
Engine. I don't think they've scaled it yet.
Or we talked about Condor. Cycle computing is sorted of in the middle. They
take technology like Condor, but they sort of abstract it away from the user and
try to void rest full API. So as an end user you talk to the API rather than to the
Condor pools themselves. And they manage to back end in the load balancing
and the scaling.
So, you know, they have an application called cycle blast, for example, which
allows you to do that. But the real fun comes from people start building their own
cool tools. So star cluster comes out of MIT. It's essentially a cluster
management system used for all kinds of fun projects. It's -- again, the website is
over there. If you want to have a take a look at it. Anyone can download this
[inaudible] and run a star cluster installation.
The folks at the New York Times actually set up -- a bunch of them got together
and set up a non-profit called document cloud. And part of document clouds goal
is to create -- is to make infrastructure available as open source that they're sort
of built for things like the New York Times and things they do within that.
This is something called Cloud Crowd. Cloud Crowd essentially runs users of
SQS queues in their job state and are very heavy -- even Sun Grid Engine is
fairly heavy in how it manages resources. A very loosely coupled way of having
one server spin up workers when they need to, chunk up jobs as required based
on a certain set of metrics. The source data is always in S3. Again, you can just
-- if you are a Ruby user, that's how you install Cloud Crowd. It's just a
RubyGem.
RightScale has a very nice struck, again very nicely loosely coupled dynamic
clustering infrastructure called right grid. Again, it uses SQS, our queuing system
as the job state. You can essentially set up error queues, audit queues, output
queues, have a bunch of messages sitting in the input queues which make
decisions.
Part of the decision is that elasticity function up in the corner over there. Your
elasticity function can be the number of jobs. So based on the number of jobs it
will scale up and down. And how many jobs per server.
The other elasticity function they have is the money you want to spend. And it
basically builds -- scales up your cluster based on how much your budget is. It's
pretty cool and it's pretty neat and it's very dynamic if you want to move clusters,
nodes in and out. You can do that. It's non-trivial with a traditional cluster setup.
And again, you can do it either through the NICY or through the RubyGem. So
there's multiple ways of doing this. And using this kind of stuff, people have
started doing pretty interesting stuff -- pretty interesting things.
As already talked about, John Rehr's work on FEFF. You know, part of the
interest in FEFF was can we take all Fortran code designed to run on tightly
coupled clusters. I believe when I last talked to John he said was -- or maybe the
guy who funded him at the NSF was there's a typical loop in it which if all you had
to do was make that loop embarrassingly parallel then you solved half the
problems.
And I think that's continued to work on that. The folks from the BioTeam used
the right grid architecture. This is again for a pharma company. Running David
Baker's Rosetta program, Rosetta application, which actually put my first
company out of business. And it basically sits in the queue. You basically are
looking at how much data do I have, how many jobs do I want to run? And
based on that, bunch of JSON messages, you are spinning up the number of
workers that you need, you're pulling data from S3, and part of the reap they built
it was they got tired of waiting for IT to provide them the infrastructure that they
needed.
This is a group in a pharma company that's not at headquarters in the other side
of the country. They don't get necessarily always get the love that they think they
need. And this is an infrastructure that worked really well. Scientists can run it.
Somebody built it for them. They don't need IT support. It works quite well. IT
found out eventually and they started managing it.
My friend Matt Wood who was then at the Sanger one day decided, hey, let me
try assembling a genome on AWS. This is a couple of years ago. Before I was
at Amazon, I think. He took about 140 million reads from a 454 sequencer.
Wrote two open source applications called mission-control and launch pad.
Launch pad and vehicle-assembly which I already talked about deployed his
application on AWS. And mission-control is this Ruby on Rails application which
he uses to monitor his work -- monitor his jobs. And very quickly was up and
running in, you know, basically the time it took him to write up the code and
deploy this.
It makes it very easy for people who are interested in writing algorithms, writing
deployment systems, writing infrastructure as developers to experiment, to make
-- to do that stuff and then make the stuff available to everybody else. Because,
you know, it's just a Linux server. If you run Amazon machine image you can do
pretty interesting things.
I talked about Cloud Crowd. That was written originally at the New York Times to
take images and transform them into another kind -- you know, transforming
image formats. This is a group at Penn or Penn State, I think Penn which
basically took Blat, which is genome assembly sort of algo, which tend to be
somewhat memory heavy. They use Cloud Crowd, which -- and this is how you
actually run it. Again, Cloud Crowd is a Ruby app. Wrote up a quick script,
pointed it to the input files and whoala, 32 hours and 200 bucks later they were
done. And they've continued to do that. That group actually uses a lot -- does a
lot of proteomics on AWS as well. It's basically -- you have all these tools floating
around built by people to do one thing and somebody else find them and does
fun stuff with it.
The other interesting area is heavy-ion collisions. Heavy-ion collisions are -- so
this is from I think -- this is a project called the Star Project I think, which uses an
ion collider at Brookhaven, if I remember correctly. Preyed they will a conference
to get to very quickly. They had to get data for a conference and the internal
resources were kind of not available. And very often I find is that the first time an
academic or researcher will use AWS is when they have a conference to go to, a
paper to submit or the boss is asking for something over the weekend.
They use something called NIMBUS, which is a context broker developed at
Argonne National Labs. They have their own sort of a cloudlike environment at
Argonne. But one of the things that they designed it for was using NIMBUS you
can easily go on to EC2, because their environment is still limited. They want
people to get more resources, you can just overflow on to EC2. So they use the
NIMBUS environment, provisioned a bunch of nodes and got the simulations
done. I'm not much of a particle physicist, so I'm not 100 percent sure what
exactly they found from it.
Ted Lyfield [phonetic] whose name's missing, at the University of Melbourne, has
built and infrastructure which he plans to share with his collaborators to do Monte
Carlo simulations from the Belle experiment, which is part of the Large Hadron
Collider. It's just a very nicely, nicely, again sort of loosely coupled based on
messaging infrastructure to start doing -- running a bunch of workers to analyze
data. They get it from this experiment. The experiment I don't think is quite
kicked into full gear. But at some point, hopefully it will. And there's a bunch of
people collaborating on it. Ted happens to be at Melbourne. And they do that
quite a bit.
But everything I've talked about to now has still been mostly chunking up jobs
across worker nodes which is something that we've done for years. Yes, having
dynamic sort of having computing as a commodity, having fungible servers as I
like to say, makes it a lot easier and makes it a lot more flexible. But what
happens when you have tons of data? You can't really chunk it up that easily,
you can't move it around servers that easily. Your disk reads and writes are slow
and expensive. Data processing on the other hand is fast and cheap.
So one solution to that is distribute the data and parallel your reads. And that's
something that Hadoop does really well. For those of you who may not know
what Hadoop is, it's not data processing -- it's not a cloud environment. You can
run Hadoop anywhere, but it sort of fits the cloud paradigm really well because
it's designed to be fault tolerant, it's designed to run on commodity machines. It
assumes that hardware is going to fail and takes that into account.
So what does it have? It has two components. One is a distributed file system
for HDFS. And the other part is a Map/Reduce implementation. So Hadoop was
developed by Doug Cutting as an extension to the Lucene project. Doug went -after he read the Map/Reduce paper from Google, after that he was hired by
Yahoo where he did a lot of his Map/Reduce work. And now he works at a
company called Cloudera. Ed showed you a picture of Christophe Bisciglia.
Christophe is one of the founders of Cloudera. And they're sort of the red hat of
the Hadoop world, so to speak.
So how does Map/Reduce work? Well, it's pretty simple in some ways. As a
developer, you write a map function, which takes a bunch of keys and values and
creates a bunch of keys and values, a list of keys and values. And at the reduce
phase, you essentially aggregate them. It's more complicated than it looks over
here, but it works very well. Especially for large data sets. For things like
aggregation, for things like analyzing log files, and a bunch of other use cases
that people, especially in the Web companies who sort of developed and
pioneered this didn't really think about.
But you still have to write functional programs. And not everybody in science
especially or outside science really knows much about functional programming.
So there's some very nice frameworks that having built on top of Hadoop. One of
the nice things about being an Apache project is people can do a lot of
interesting things on top of you.
So cascading is a framework that I really like. It's essentially a dataflow system.
Very similar I think to Dryad in some ways whereas user you define a dataflow.
And cascading takes up the whole job of writing the Map/Reduce functions
underneath it. You are just writing a dataflow.
Now, cascading also works with any language in the JBM. So you can write
cascading in Java, you can write it in JRuby, you can write it in closure, pick a
JBM language, Scaler, you know, things like that.
The folks at Yahoo! developed something called Pig, which is again a more -again it's a script -- it's almost like a scripting language. I like to think about it as
pull for Hadoop, because you're basically writing a scripting language, and that
takes care of your Hadoop jobs.
And the folks at Facebook developed something called Hive. For a lot of their
product managers an analysts they're used to writing SQL query, so they wrote a
SQL like language that sits again -- sits on top of Hadoop. Facebook has
terabytes and terabytes of data. And they need to be able to do ad hoc
analyses, and Hive's been really key to helping them get there.
So there's all these higher level languages and higher level systems that have
come on top of Hadoop which has made it a very dynamic and growing system.
And one of the biggest use cases on top of AWS, especially in the Web world.
You have folks like Pete Skomoroch who now works as a research scientist at
LinkedIn. He has -- you can actually get all of this. He wrote a sort of a
reference architecture for a data mining system, where he took data from
Wikipedia, combined it with Google news data and used Hive to develop an
application which he called Trending Topics and it looks at topics at Wikipedia
over a period of time. He released the data as an Amazon public data set. I'll
talk about those in a bit as well. And he's made the whole course available.
This is a very quick idea of the architecture. It uses a rails based application. It
used MySQL. It uses Hive. It uses Hadoop. And it uses our elastic block store
system which is where the data sets reside.
The part that I really like is the work -- and this is I would say a pioneering
working that's been done by Mike Schatz at the University of Maryland. The life
science community historically has -- I can get the biggest box, I can get and use
up all the memory attitude. Most of the programmers are folks like me who know
how to write algorithms but don't know how to write good code. And it can be a
mess. And when things start scaling up and as these -- as the nature of your
sequence data started getting shorter, start getting shorter reads, you've had a
problem. So what Mike decided to do a few years ago was try a simple problem.
He figured that these schemers, these short read fragments fit the Map/Reduce
paradigm very well. And he decided to see if it worked.
If it didn't work, it was not going to make any sense. So this is a simple
application called CloudBurst that he developed a few years ago and open
sourced. I think that's where it's been published. It essentially does a simple
read alignment against reference genome. You have the map phase which is
where you catalog and where essentially admit the k-mers on to your reference.
You collect all the seeds because it uses a seed and extend paradigm at the
shuffle phase, which is actually the magic part of Hadoop, in the middle. Map
shuffle reduce is what it should be called. And at the end in the reduction phase,
it does the alignment. And it worked really well.
I don't have numbers for his scaleup, but he got on a hundred node -- on a 96
node EC2 cluster, he got 100 node scale -- 100 time scaleup from the traditional
sort of sequential algo that's sort of the gold standard in this community.
And this is all just a start. I think where things got really interesting is when he
collaborated with Ben Langmead. They're both in -- both of them working with
Mihai Pop and Steve Salzberg at Maryland.
Ben's a developer of a program called Bowtie, which is another better known
aligners in the genomic space. What they did was they used a feature of Hadoop
called Hadoop streaming where you actually don't need to write the functional
program, if you actually stream in existing code. They had to modify it a little bit.
But what they did was for the map trace, they used Bowtie, which is doing very
fast alignment. For the middle phase, the shuffle phase, they essentially bin and
partitioned, clustered all the intermediate data. And they took another open
source algorithm that detects mutation snips on those alignments. And they now
have a pipeline as a Map/Reduce function that anyone can get and run called
SoapSNP. And this paper, I forget when they released it in Genome Biology,
was for a while the most read paper at Genome Biology for quite a bit of time.
And actually a lot of interest in cloud computing has come from that, because
their paper was targeted -- SNP discovery with cloud computing because all of
this work was done on EC2. Here's some numbers that they got.
Now, the cost doesn't really mean anything because there's other costs
associated with it. This is just a raw compute cost. But it was solving a real
problem. Essentially because a lot of these aligners were very memory
intensive.
The work that Mike's doing right now is actually probably even more interesting
because assembly is a significantly tough and complex process. It takes big
machines to pick -- there's even codes out there that use close to a terabyte of
memory. And he's trying to implement this as -- in a Hadoop-type environment, a
Map/Reduce environment. It's a paper that they're working on assembly of large
genomes in cloud computing.
Bacterial genomes are easy, human genomes are hard. This is still work
ongoing. I won't show the data because of that.
As I alluded to earlier, and I'm showing you all the stack that we have for AWS,
part of the problem is, A, Hadoop can be a little temperamental. As Mihai has
found out, their Hadoop cluster at Maryland keeps crashing. Michael actually lost
a whole bunch of data because of that recent. And he's ready to -- within two
weeks of defending so he's not a happy camper.
So we decided to develop something called Amazon Elastic Map/Reduce.
Elastic Map/Reduce is just not a wrap around Hadoop. It's actually a data -- a
job flow engine where you essentially define a job or a series of jobs and then
you pass over the definition to Elastic Map/Reduce and it takes over from you.
It uses S3 as its source data store because hopefully you have your data in a
nicely redundant environment. And then it does all the fun stuff in the middle.
And it takes care of things like node failure, it takes care of things like recently
Hadoop didn't have a good debugger, so we actually created a debugger which
is available as part of Elastic Map/Reduce.
Yesterday we launched something called bootstrap actions which allows you to
put arbitrary applications and things like that on your Hadoop clusters because
we don't want people logging in necessarily and doing that because it's designed
to be an abstraction.
And one of the first things we did when we launched Elastic Map/Reduce was
take this CloudBurst algorithm that Mike had developed and put it on there as a
sample application. All a user does is -- and literally this is how you use it from
the UI, is you tell elastic Map/Reduce where your source data is in S3, you point
it to the S3 bucket. You put in a jar file or a streaming script and you tell it where
the output is going to go and size of cluster you want and you hit go. And that's
it.
So as an end user, if you have a series of applications available to you, you don't
really need to learn how to run -- you know, write algorithms with Hadoop jobs.
So the reason I like Mike's work so much is it's actually inspired a lot of people to
start thinking of algorithm development, rethinking how that code works, and then
deploying it either as Amazon machine images, as virtual machines or as part of
Elastic Map/Reduce. And we continue to work very closely with Mike and Ben,
and have funded them through our education program.
The other thing that makes the cloud very interesting, and this is where I talked to
where you can't have just FTP data or send it over to each other is data storage
and distribution. Both public and private. On the public side we have our Elastic
-- sorry, our public data sets project where you can get everything from Jay
Flatley, who is the CEO of Illumina, his genome to actually the first African
genomes that were sequenced from a particular transcribe in Nigeria and to
things like Ensembl over there which is pretty cool. Data from the Sloan Digital
Sky Survey. Global weather measurements, mapping data. All kinds of fun stuff.
There's other people who are distributing data privately. So I'm a company. I
have customers who are running EC2 job, who are running on EC2. I want to
make my data available to them inside the same environment. So that's also
happening today.
I alluded to earlier we have an import/export service. If you don't have a big
bandwidth connection you put all your disks -- you put your data on to a disk, you
electronically sign a manifest, you submit a job, ship your disks, they come into
us, we're good at moving stuff around and dealing with UPS and Fed Ex.
The data shows up and you're a three bucket and you get an e-mail saying we're
ready for you. That's really good for people moving data in. On the disaster
recovery side, this is not a scientific problem necessarily, people use the export
part of it. So they keep a lot of data inside AWS knowing that they can export it
out if they have some kind of BR scenario.
This also enables things like sharing and collaboration. I don't have any good
pretty slides to show for it, but I've talked about -- I mean, I think Ed also talked
about it is that the cloud makes a great I think -- he quoted Bill, makes a great
collaboration environment. In that project that I did when I was still at Rosetta we
wouldn't have had to ship the disks to 10 people, we would have just shipped it
into Amazon and all of us would have access to the same common global names
data space, and we could have all worked on that space. You don't have to have
20 copies.
One of the reason that we started the public data set project was we found that
this -- different people were all getting the same data set for essentially 20 copies
of the same publically available data set inside AWS which earns money but it's
inefficient.
It makes a great software distribution system. Examples would be -- this is a
Large Hadron Collider project that's comes from the Max-Planck Institut. It's
called the ATLAS project where Stefan Kluth had started doing in prototyping
work and has had people who wanted to try it. Go check it out.
A very nice project is -- if you can see anything, is the Cloud Bio-Linux project.
This comes from the folks at the J. Craig Venter Institute. What they did was
they looked at the Bio-Linux distribution which I think used to be Slackware or
something like that. Maybe in a boot [inaudible] which is a bootable CD with a
bunch of standard bioinformatics analysis tools on it. They essentially created a
cloud-based version of it which runs -- which you can get as an Amazon machine
image with all the tools that the average non-expert from addition would need.
Another really cool project, and I've been a long-term fan of Galaxy before I was
at AWS. The Galaxy project comes from Anton Nekrutenko at Penn State and
James Taylor at Emory. It's been available for a long time. You could download
it, run it. They now have a Web based application version of it. And last year
they applied for an AWS grant, and what they ended up developing and now it's
publically available is Galaxy on the Cloud where you can -- anyone can
instantiate their own Galaxy instance -- Galaxy star from metagenomics, but now
it's a general sequence analysis application with a standard bunch of tools, all the
viewers you need. So it's a one-stop pipeline for managing and looking at
genomic data. And anyone can run it. All they need is an EC2 account and a
credit card.
The other thing that I think is a lot of fun and where some of the best innovation
is going on is application platforms. These have nothing to do necessarily -- not
necessarily things specifically to do with science. Heroku, for example, is one of
my favorite platforms. It's a Ruby on Rails platform where all you have to do to
push data to it, they take care of all the scaling, the management, is literally if
you're using get, you do a get push Heroku and your application is running on
Heroku after that pretty much.
An example of that is Chempedia, which comes from Rich Apodaca down in San
Diego. It's sort of a social Wikipedia of chemical entities. It's running on Heroku.
People can go in and edit stuff if there's mistakes, add metadata to this. It's a fun
project which he did over a weekend. As a developer all he has to worry about is
his code, not about managing it and deploying it.
The other hear which is a lot of fun actually is in the geospatial space. And I'm
actually really jealous of these guys. Because I think the geo guys have decided
that APIs are great and they create great APIs. So there's companies like
SimpleGeo which worked with folks like Sky Hook Wire to start mapping out cell
density is and missile you know may be at some kind of event where everybody
is watching the World Series or something like that. Or they'll be whole big blob
on whatever the fines are happening. Not a baseball guy.
But this is South by Southwest. So you can look at the density of probably
iPhone usage over there. And that's Manhattan, which is a little more diffuse.
That's probably where the conference center was at South by Southwest. And
SimpleGeo, there's companies like Twilio, which does a telephony app. There's
Exari [phonetic] which has stuff going on in the geospace as well. These are
platforms on which other people can build applications. But SimpleGeo as a
developer I could take the core SimpleGeo APIs, mash it up with other things and
do a lot of interesting stuff.
And it's not started happening in the sciences, I wish it would, where people
would take tools like Heroku, take the APIs that AWS provides or somebody else
provides and start building up these applications platforms that other people can
build on top of it it's hopefully some day these will start happening.
The other interesting area is business models, where you have a new way of
getting new customers for more traditional companies like Wolfram Research,
and the MathWorks. So you can run MATLAB or Mathematica on AWS today,
the grid back ends. It's just another way the customers don't have to spend
money on hardware, they spend it on software and whatever it costs for them to
run AWS.
Or companies like DNAnexus which are completely in the cloud. DNAnexus
actually came out of data at the meeting Roger and I were ATLAS week where
they essentially have a sequence data management analysis system. They
come out of Stanford, another Stanford BC funded companies.
But it's a complete SAS platform. Historically you would have probably bought
this as a [inaudible] package and it's happening a lot more these days. I'm
probably still going to finish way head of time, so hopefully you'll get for time for
Q and A.
So to conclude what would I like to sort of say as a summary? I spoke to you a
little bit about AWS and what AWS does and how it's used. For the scientific
folks in this audience, you know, infrastructure clouds are designed for scale.
That's what they were built for. That's where the origins is. They're also built for
availability. So you can always get to them anytime you want to. You don't have
to wait for in a queue to get to a super computing system. You don't have to wait
for servers to arrive just because you got more data than you were used to or ran
out of disk.
You have the ability to have shared dataspaces and global namespaces. As an
example, in S3 if you create a bucket, it's a global namespace. If you make it
public, anyone in the world can of access to that one global namespace. It's kind
of cool. You can say my data and throw whatever else you want and everybody
can -- my data is probably taken, but still.
The other part that's interesting, and I didn't talk about this too much, is
task-based resources. Typically in a shared system what you have is a shared
file system and a cluster that is shared with about 20 other people or you're
running 20 different tasks on it. They're still a shared system. With AWS what
happens is every EC2 cluster is a discrete entity. You're not sharing resources
with anybody else, you're not competing with the resources with a second
cluster. You have your own cluster.
One of the reasons EC2 is very popular for educational courses is for example at
MIT, their IT won't get students get root access on their clusters, but with EC2
they can, so their entire CS 50 course is now taught on EC2 because students
blow something up, it doesn't matter.
And so you can stay at resources that are for a dedicated task, especially if you
have this shared common dataspace that you can access. You can have people
trying out new software architectures. Mike's a great example of that. There are
other examples of people doing some very innovative stuff where they're taking
advantage of loosely coupled systems and systems that are massively
distributed.
You can try out new computing platforms. You can deploy Sun Grid Engine, you
can deploy -- and I suspect you can deploy things like LSF on AWS as well,
which is well and good. But you can try these very dynamic systems like
document -- like Cloud Crowd or the Star System from MIT or the Right Grid
System from RightScale. Or just build your own. It's not that difficult.
And the best part is you can do it right now. And you don't have to wait around to
start running -- just start hammering at it. You have heard enough examples
today. And one last plug. We have an education program. It's basically what we
do is we provide -- I think we have three submission or four submission deadlines
every year. You submit a little abstract of what you want to do. And you get
credits, compute and storage credits to use.
Examples. Here are some examples of the kind of stuff that we have funded
using that. And then of that, thank you very much. I'd like to definitely thank
James who I always learn a lot from. Matt Wood who used to be at the Sanger,
now works for a company called MacIntosh which makes a program called
Papers if you're a Mac user and you like managing data, managing your PDF
library or you're papers, it's really, really good. And I love [inaudible] presentation
slide. So thank you very much.
[applause].
>> Dennis Gannon: Thank you, Deepak. We have ample time for questions.
You just raise your hand, I'll bring a microphone to you so you can ask our
speaker. Okay.
>>: Hello. Thanks for your talk. [inaudible] from the University of Antwerp. I
have a question on the new -- well, new, the spot market that you introduced
currently it's fairly -- let's say not really transparent how this market operates. Is
this company classified or could you hint at how this market forms its prices, how
it works?
>> Deepak Singh: Yes. So it's our own algorithm right now. What we do
provide is a history of pricing. So with the API you can get the latest price or you
can get historical prices. In fact, there's a website called cloudexchange.com
which tracks all our pricing history. Most of the time the prices stay roughly about
a third of the actual price. And every now and then once in a while it will go poof.
Because we don't have a ceiling to how high you can submit your bids. It's
based on supply and demand. It's still early days. It's a simple -- you know, we
decided to keep things simple because we want to see how people use it, what
kind of -- you know, what kind of usage patterns there are. And it will evolve.
So at some point we might talk about how exactly it works. But I think how it
works is going to evolve over time, also, as people use it in different ways. But
right now what we expose is your pricing history.
>>: And are you experimenting internally with that? I mean, looking at new
mechanism designs to --
>> Deepak Singh: We continue to look at what other -- what else makes sense.
So if you have any ideas on what you would like to see as in that kind of dynamic
capacity system, yeah, we're open to ideas. And as I said, one of the things we
like doing is start simple, see what people do, and then try and adjust to that.
>>: That gives us a good approach. Thanks.
>> Deepak Singh: Yeah.
>>: [inaudible]. So for using EC2 for experimenting -- computer science
experimenting of the science we need [inaudible] like 600, 800, 1,000 instances
simultaneously because we are going to run a cloud computing platform on top of
EC2. Is this feasible or ->> Deepak Singh: Yeah. It's definitely feasible. Now, if you're getting that many
resources, obviously if you suddenly say I want a thousand, you'll probably get a
call from me or somebody else to try to figure out what -- who you are, whether
you're legitimate, whether you're trying to be -- you know, just try and understand
what you're doing.
We have single, individual customers you've much more than that every day. So
it's definitely feasible. We just have to figure out -- we just have to make sure
that, you know, it -- what we try and do is make sure that it's legitimate use which
in your case would be, and try to figure out what's the best way to make that
available to you.
>>: But in our case what we do is of course there are experiments [inaudible] the
use is not for long-term [inaudible] our customers ->> Deepak Singh: That's fine. A lot of our usage -- you know, you have -- I won't
mention names. We have big companies with their big infrastructures. But they
do a lot of development work on AWS, right, where they're testing out new
algorithms. And always at scale. So that's very common use case for us, load
testing.
I think one of the classic use cases we've seen is there's a company called
SOASTA that load tested a million user load test for My Space where they spun
up 800 nodes for three hours or four hours. And they do it once a month for
different -- 800 is rare. Normally they do three, 400. So 800 nodes is what, 16
hundred cores. So it's definitely doable.
Your default limit with EC2 is 20 instances. That's when you start. So you have
to submit a request for more. And at that point of time you probably get a call
from where we appropriate contact and your location is trying to figure out yes, is
this -- you know, what's a usage pattern going to be? Because that also helps us
manage capacity.
If you know you're going to stay all the time every day, we can adjust to that. If
we know you're going to come in and out, we can adjust accordingly. Yeah.
>>: Hi. Chris Menzel, Moore Foundation. I'm just wondering if you could talk a
little about the criteria you use for deciding what public data sets you provide.
>> Deepak Singh: No. I think the code criteria is because we are storing them
as snapshots, DBS snapshots today is it should be something that's useful to a
community. If it's -- you know, all the pictures you've taken in your life and you're
the only person who would ever be interested in it, not so much interest to us.
But if it's a data set that's a reasonable size where, you know, it's kind of -- we
kind of do it on a case-by-case basis and see what we ask the person submitting
data set is who would use this. And especially if it's something we have not
heard of.
Or in some cases we go to the data producer like Ensembl and say, hey, people
are asking for it, can you start moving Ensembl to us.
Our approach has been we'll see how the -- what the usage patterns are. And
we might go back to somebody and say nobody is using it, hasn't happened yet.
But so right -- and again, it's still early days. That will also definitely evolve over
time. Right now it's kind of as long as we are sure that you have the rights to put
the data up there and that you can convince us that there's people actually use it,
we are fine with it.
>>: So Alex [inaudible], the Netherlands. We've been doing over the past two
years experiments with various clouds infrastructures, service clouds, trying in
particular to run scientific applications. And we've been surprised by the very
poor performance that we've seen there. I've a question that is a bit longer.
First, do you really plan to support scientific applications such as my colleague
has suggested before, those 600 or 200 whatever parallel applications, in
particular? And do you really, as you mentioned about the slide in your
presentation, do you really don't do any kind of research sharing; and in
particular network?
>> Deepak Singh: So when I was talking about resource sharing, so the two
things that just sort of going back to how EC2 works, the two things on EC2 that
are hard schedule are compute CPU and memory. That's hard schedule. That's
the performance you're going to get.
I/O as a whole is a resource. When I said you're not sharing our resources, you
don't have one cluster and everybody is hitting on. If you create a cluster, you're
the only person using the right instance type you'll pretty much be the only
person on those nodes.
Now, what you don't have control of is node placement, right? And that's a
reason why a lot of traditional user or traditional scientific codes run into trouble
because they expect certain rack locality. They make assumptions. If you're
using MPI code, they assume they're talking to the same switch.
That's not our kind of environment today. And that's why you'll see poor
performance.
We can give you guidance on it. For example, if you're lucky you might just -- if
you use the large [inaudible] that we get you're essentially getting the full
network, right, for that box. But you don't know where the other instance that it's
talking to. If you're lucky, it could be close by. But the more nodes you provision,
the more further away they are, you know, spread out they're likely to be.
>>: [inaudible].
>> Deepak Singh: Yeah. And because our goal is to try and make sure that
you're not -- it's almost availability becomes the first priority, first thing you design
for, so you try and make sure that everybody is not -- you know, if a rack goes
away, everything that you have doesn't go away. So we actually spread you out.
Which works very well for embarrassingly parallel stuff because you don't care.
But for tightly coupled jobs, yes. And so anybody comes to me and says my
thing doesn't scale beyond four nodes and I'm running M pitch 2 and doing
molecular dynamics, I'm like, yes, not surprising at all.
We're still trying to figure out what we will do in that use case. We don't have a
good answer. I was talking to you earlier. You know, we want to try and do it in
a way that makes sensor if it makes sense at all, we don't even know that yet.
But if you're doing sort of distributed computing, embarrassingly parallel
simulations, all simulation or any computation where I/O is not your -- going to be
most of your stuff, you'll be in good shape. The moment I/O becomes dominant
things, if you can't put into a Map/Reduce like type environment, then you are
going to run into these kind of performance issues.
>>: So related question. On your private cloud, though, aren't you actually
pushing the workloads into a very isolated piece of hardware network,
everything?
>> Deepak Singh: It's virtually isolated.
>>: For -- okay. Very good.
>> Deepak Singh: Yeah. It's -- you're not giving you a dedicated set of boxes.
So if you shut it down, you come back again, you're not going to get the same
machine. Right?
>>: Okay.
>> Dennis Gannon: Other questions, please.
>>: Hi, I'm Lucas Kendall from Czech Technical University in Prague. You
mentioned that there is interaction now with suites like MATLAB. Can you say a
few more words how that works and who's actually doing the parallelization? Is it
you or MathWorks?
>> Deepak Singh: MathWorks. So the philosophy we take is MathWorks knows
how their software is built. So their grid -- so if you go to MATLAB and they have
this grid product, you can choose EC2 as the deployment end point and they will
basically launch their grid back end, a bunch of workers on EC2. The MPI part I
think doesn't -- works, but it's not going to be that performant. It's the more of the
grid back end that they really encourage people to use and what people are
using.
Same with -- same with Mathematica. Essentially can start from your notebook
and just launch a grid back end and they built it. And we didn't even do anything
at all. We just know that they're doing it.
>>: So do you know what sort of a license do you need from them to do that?
>> Deepak Singh: Yeah. I think Mathematica I think have a poor use. So
different companies do it different ways. I think Mathematica -- Wolfram went
with the poor use kind of licensing. I'm not that familiar with people using
Mathematica. On the MATLAB side, they still use good old flex LM. You need to
have a license manager running somewhere else, and you check out your
licenses. They do license borrowing, so it's floating licenses or whatever the
standard flex policies they have. That's not uncommon actually.
>> Dennis Gannon: Any additional questions? Okay. Great.
>>: [inaudible] center of computer science. I'd like to ask you, with the current
trends of incoming users and all the data, do you expect in three to five years to
be still scalable and be able to provide the same sort of quality?
>> Deepak Singh: Yeah. I mean, that's what we do, right? I mean today we
have a whole group that's all they're doing is building our infrastructure, trying to
make sure that it keeps scaling. Proof lies in the pudding. So far we've been
around, what, three years. We've done pretty fine, and we keep growing. And
the kind of users and the kind of load on the systems keeps growing. Obviously
you learn along the way on what different kind of workloads come on here to
make changes that are just systems to that. The good news is that that's a cool
competency. That's what we're thinking about all the time.
That's why we have folks like James Hamilton, I'll pick a name, who that's kind of
what he thinks and thinks and breeds all the time. So absolutely. This is -- so
this is a separate business unit core, you know, business unit for Amazon. So it's
as serious as retail. We spend as much as we would on that.
>>: Steven Wong, Rice University. Have you thought about problems where you
have highly interactive applications such as like online games and other -- those
sort of things where you've got lots of things going on but in scale requirements
but highly interactive with the speed, network speeds?
>> Deepak Singh: So I think eight of the top ten Facebook games run on
Amazon. So that's happening today. Farmville runs on Amazon. Everything that
PlayStation does runs on Amazon. There's games from big gaming companies I
can't mention which names they are that are console games are running on
Amazon today. So that's already happening.
People build their applications to address this sort of distributed infrastructure.
They might build peering arrangement with us. You know, there's ways to make
sure that performance is good. There might be companies that spend the money
on network, make sure they're good networking to us and not on servers.
What won't work is ultra low latency systems where you have -- and I'll go off
gaming to high frequency trading, right, where you have to do very, very sub-Mill
second trades.
There are a bunch of HFD companies running on AWS today. What they do is
they build their models on AWS, they deploy them to the exchanges where they
have cabinets, they do the high frequency stuff there every half, two hours, and
the markets change because they've traded so often, they come back, rebuild
their models and then deploy the new ones.
So you have to decide if you're very sensitive to jitter, for example, where you run
a little bit of jitter in the system is a problem. In a shared I/O resource we talked
about that, that's going to be a problem. If you sort of understand that, adjust for
that, you can build pretty performant interactive systems as well. People are
doing that right now.
>>: A lot of those games are just sort of single user, all you need to do is update,
but they're not necessarily where you've got these multi-user ->> Deepak Singh: There are some. I can't mention them by name which ones.
There are the -- there are those running today. By big MM whatever they call
them, console games. I'm not much of a gamer. Yeah.
>> Dennis Gannon: Any other questions? All right. Let's thank Deepak one last
time for a great talk.
[applause]
Download