>> Dennis Gannon: So we actually heard that the book seller across the lake from us actually knew a thing or two about cloud computing, so -- no, really, seriously, we all owe Amazon a great deal of debt and gratitude for actually pioneering cloud computing for the last two or three years. And we have much to learn from them. And I'm pleased to introduce Deepak Singh who will be telling us about Amazon and what they're doing in research and science applications. >> Deepak Singh: Thank you. Can everybody hear me? Sounds like it so actually Ed just gave my talk, so you can all leave and get another one and a half hours of a break now. So, yeah, thank you, Roger, for asking me to speak. Usually I speak really quick, really fast. But we have a little more time than I'm used to, so I think we're going to relax a little bit, go slower. During the course of the presentation, if there's any questions or clarifications, please do ask. I like it being interactive. So just holler. Some quick history. So I've been at Amazon now for about well, it will be two years in June and all at Amazon Web Services. By day I manage EC2, so that's from the business development side, that's my that's what I do. But I spend my entire life before that in the life science industry. Mostly my background historically is in large scale molecular simulations and started off as a quantum chemist. But before I came to Amazon I spent some time at another company out here in Seattle called Rosetta Biosoftware, which is now part of Microsoft, working as a strategist. So I learned more in genic large scale data management, gene expression, genetics and genotyping. So this is a -- so because of that, my talk tends to be very life science biased, so I apologize to every other field of science for that. But that's the nature of the game. So let's get started. And the reason Ed laid it out quite nicely is we are here to talk about data. Not my favorite Android, but the kind of data one gets from large projects and historically has from large space projects. We talked about how astronomy has been instrumental not just in creating a lot of data but in developing sciences that do not necessarily do a good job in managing it, but they analyze a lot of it. The high energy physics community, on the other hand, has had to do a lot of work in trying to figure out how to manage very, very large scale data projects across not just labs but across continents because LSC is not just European project, there's people from all over the world trying to work on it even when bagels bring it down. But near and dear to my heart is instruments like these. This is an older Illumina genome analyzer. The root password and login information on there. So if you wanted to hack into the Sanger Centre and start doing some sequencing, you could probably do that. But the difference that these things make compared to the space and LSC type projects that we just talked about is the sheer throughput at which they generate data. Space projects, high energy physics projects take 10 years. People collect data for a long period of time analyze data for a long period of time. This instrument is a few years old, and it's obsolete. The amount -- the data volume that's being pushed by these and the changes in the kind of data and just the technology is pretty intense. And this is happening in the life sciences and other fields as well pretty much as we speak. Another area that's interesting is, and we've heard quite a bit about this is things, sensor based projects, things like the Ocean Observatories Initiative, where sensors, ambient systems all over the world collecting data which people are either analyzing after the fact or in realtime, depending on their needs. I forget -I was talking to somebody from outside from Newcastle where they were talking about ambients are neurological projects where they're trying to analyze data from people's homes and trying to predict whether they're having some kind of neurological systems. And of course computers themselves generate a lot of data. You could be a Web company just click stream data as probably people who worked on the live side of Microsoft know generates terabytes and terabytes of data. There's people collecting data and trying to analyze it. People taking data from these instruments and creating second rounds of data, which can often exceed the initial data. So it's a very interesting time. It's a time where a single genome makes no sense. There's no such thing as the human genome anymore. Instead what we have is everyone's genome. We have genomes of cancers, we have genome of every individual -- we could have genomes of everybody in this building, everybody on this campus. That's quite a lot of people. So what does that mean? Let's leave the space and LSC types aside for a second. Scientists who have typically been used to working in the gigabytes, the kinds of people who used to work with Excel and with joins across spreadsheets, they're kind of getting into this whole terabyte scale thing. And they're just starting to get used to it, but it might be too late. By the time they get used to terabytes, terabytes might -- the petabyte might be the old terabyte. And that's kind of the joke we have at Amazon where the petabyte is the new terabyte, you know. That's what we see. And if you believe some people, we actually going to be at the excess scale pretty soon. I think it's going to take for more scientific projects, except the really, really large ones, going to take a little bit of time to get to the excess scale, but I've been wrong before. And the big difference as I just sort of alluded to, is we're collecting this data really fast. It's not being collected over five years, it's being collected over weeks and you have to analyze it over weeks, which means -- and the cost has gone down enough that it's no longer -- this is the picture of the Broad Institute in Cambridge, Massachusetts, which is a big genome center. They have 100 Illumina sequencers. But as Ed again said, you have folks like Ginger Armbrust who can get a sequencer and collect a lot of data and generate a lot of data and figure out how to manage it. So I completely agree. I did not put this in after Ed started speaking. I had this in earlier. It's actually a great book if anybody hasn't read it, it's available somewhere on the website. About how scientific discovery -- I come from a simulation background where data sizes were in the kilobytes and trying to generate molecular trajectories and rotation functions and [inaudible] energy services. But more and more science is going into a state where you're collecting data and you have to try and analyze it to try to figure what's happening, which means that they're going through a lot of change. And it's a very rapid change. And it's a change in scale. It's a scale that a lot of scientists, especially the kinds that I've historically worked with, aren't used to dealing with, which means they are to completely rethink the way they handle their daily chores. They have to start thinking about things like data management. I like saying that data management is not data storage. I don't know how often I've heard from people, oh, I can go to Frye's and buy a terabyte -- you know, multiterabytes of hard disk. Yeah, but that doesn't quite work. You have to try to make this data available to other people, share it with other people, go back to it and try and analyze it. They have to rethink how they process that data. A lot of the cyclic coupled codes that written Fortran, I've written a few of those myself, and they are a piece of junk, don't quite work. And you have to start thinking about new ways to scale out your data. You have to realize that a scale a lot of the assumptions you make about infrastructure change. And one of the last projects I was involved with when I was at Rosetta was a project with FDA where we didn't have that much data, just short of a terabyte. There were ten institutions, companies, academic groups, the FDA involved. And what we had to do was we were collecting this data and everybody was -- had a different slice of the pie that they were trying to solve. But the project took a while starting because we had to ship the data from each group to each group on a disk, and it took about for 10 weeks to get it Fed Ex -get it Fed Ex'd to everybody and make sure everyone had it. That's not how you can share data anymore. In the past if you went to a data repository like NCBI, you could just FTP it down, you could bring your data to your desktop, to your work station, analyze it. But when the Tallinn genomes project is multi-terabytes and you want it on your desktop, one, you're to find a hard drive big enough, which is not easily doable. You're to go to shared system or something like that. Or you have to start thinking about things like the cloud where you get a common area where you can start sharing that data. And it's a constant problem. And data volumes are changing rapidly enough that a lot of scientists haven't had time to figure out how to build these systems, how to think about it. And what they end up doing is getting in situations like this one. I don't know which lab this is. I won't call out names. This is about 188 terabytes of sequence data on somebody's lab bench. You can't really process it, because it's not connected to anything. You can't really share it because you need all that data, and trying to move it around and shipping it to your collaborator is not going to be trivial. And what happens if one of those -- it's on a lab bench. There's water around this, solvents. What happens if something falls on it? You're in deep trouble. This is a picture I borrowed from Chris Dagdidian of the BioTeam. And he talks about the fact that in 20 -- 2009 was the first time that, A, he mounted a single petabyte volume on to a compute cluster. It was also the first time that he saw somebody being fired for a petabyte scale data loss. So science is getting computer storage limited today. I know a lot of projects which don't start because they don't have the storage or they don't have the compute or they don't have easily accessible storage or compute, and they have to scrounge around, try and find resources, go to their friends at a supercomputer center, which most of us have done at some point in time and asked for time. But there's more -- as you need more and more resources getting a friend to be nice to you becomes a little more difficult. So I started thinking about some of these problems a few years ago, and that's kind of how I ended up at Amazon. So for the next -- we'll see -- few minutes, I'd like to spend some time talking about Amazon Web Services. We hear various definitions of the cloud. Different people have different opinions of the cloud. For the purposes of this conversation and for what AWS does, we are infrastructures as a service. Essentially what we provide is a toolkit, a toolkit that allows people to do stuff for what it's worth. It uses some basic building blocks. So how many people here have used any of AWS's services? Fair number. Bill you're not allowed to raise your hand. I know you have. So EC2 is where I spend most of my time. For those who don't know, EC2 is essentially getting a virtualized server in an Amazon data center or getting 10 or getting a hundred or getting 200. It's essentially with simple Web services APIs you can provision lots of server computer resources. You can run Linux, you can run windows, you can put your own apps on it. You essentially control. Once you get the server that's where you control. And people can do all kinds of things with it. Historically our -- most of our customers came because of Amazon's history from the Web community, people doing a lot of E-Commerce, people building new software service stacks, and of course it evolved from there to people running SharePoint on AWS. On top of that over time we've added things like load balancers and the ability to scale based on some metrics, based on some triggers. The other key sort of component to this what I call core infrastructure is storage. And the service that we actually launched at AWS was Amazon's S3. It's the service that a lot of people tend to get involved with first. If anybody uses Twitter, that's where your thumbnail is, you know, just as an example. S3 is a massively distributed object store. It's not a file system. Every piece of data you push on it gets replicated across multiple data centers. The idea and design point is that it should be highly available and highly durable and that you should be able to scale up as your needs be. And I'll get back to that in a bit later. Things that we've added to it, I'll speak to it again later, is ability to import and export their data from S3 using physical disk, using Fed Ex because some the network's not always the easiest way of getting data in and out, not for everyone. And at AWS we also consider databases a core piece of infrastructure because the belief we have is that you shouldn't need to manage your own database, whether it be relational database. So we have a service called RDS which essentially managed MySQL. What you do as a user, you just provision a MySQL instance. It's MySQL as you know it. But you don't have to be the DB. We've got people running in the back doing that for you. You can scale up and down. You can scale up your server with an API call. You can add new resources, you can grow your disk. Again, it's all APIs or going through a simple interface that does it. You don't have to worry about that. We try and work on the optimization and the tuning. It's still a relatively young service, so it's going to evolve. One of the things that we announced when we launched RDS was the fact that MySQL is -- most people don't know how to run MySQL in a high availability system, so one of the things we're working on is the ability for you to fail over into a different data center without actually having to go through the pains of doing it yourself. We'll do it automatically for you. And SimpleDB is a simple -- it's a key value column store is one way of looking at it. It's a -- if I could use a buzz word, I know SQL schemaless database, which lot of people use for scoring metadata on NetFlix, for example uses it to store their metric queues. On top of that, we've added over time a bunch of other things. One of my favorites is a parallel processing framework which encapsulates Hadoop, called Amazon Elastic MapReduce for people -- you know a half -- a big chunk of our users run Hadoop for doing things like log analysis, if nothing else. And it -- they were struggle at that time, especially when Hadoop was in .15, .16, just with keeping things up, keeping things stable. Hadoop's not exactly the most stable thing in a shared environment. So we decided to build a system that helps them stay away from managing the Hadoop side and just give them and API to that. When S3 launched, a lot of people were obviously storing content in it and distributing content. I for years have had a video sharing site and a podcast where all the content goes into S3 and people downloaded, but obviously there's folks in you're countries and far away who -- and people are essentially using S3 as content delivery network. So we launched CloudFront, which is a CDN that sits on top of S3 and a bunch of Web servers that deliver the content to various locations. Messaging is a core part of our distributed infrastructure. If you're building sort of asynchronous applications and you want good fault tolerance and good sort of things that are loosely coupled, using messaging is usually a good way of doing it. We started off, and actually SQS is our oldest service, I think it launched after S3, though, which allows you to just run a simple pull based messaging infrastructure in the cloud. And just two days ago we announced a service called a simple notification service which is a realtime Pub/Sub service. It's assistive service to SQS where essentially you publish a topic, people can subscribe to it, and it gets delivered to them over HTTP or over SMTP in realtime. I think at some point of time we'll add SMS to it as well. And of course we got Amazon, we have a good payments platform. And for those of you who like getting things looked at by humans, we have a service called Amazon Mechanical Turk. It's actually quite unique. If anybody has questions, catch me afterwards. I won't spend much time on it here. And at the end, the -- we've added a bunch of other sort of infrastructure management services. Amazon CloudWatch is a monitoring service that takes care of the infrastructure. It gives you a whole bunch of metrics. Based on those metrics, you can auto scale automatically. Or, you know, the two things that auto scaling does is let's say you're servicing a lot of load, you can add new servers just to handle the load and take them away. When the load goes down, the other thing you can do is let's say you want an infrastructure of 20 servers across two data centers and you want to keep them at 20 servers across 2 data centers. You can use the auto-scaling functionality and these metrics to make sure that's happening. It's a good built nice fault tolerance architectures. We have management console. We have toolkits for eclipse and .NET. And I'll get back to this later, the ability to deploy virtually isolated networks, what we call the Virtual Private Cloud. And with all of that, what we can do is build your applications and services. But what's really cool about all of this is the fact that you can do it either using a simple UI, or the way I like doing it, sitting in a command line and orchestrating an entire infrastructure. You can vault the servers, you can launch new servers, you can launch new services, you can launch elastic Map/Reduce jobs all from the command line tools or if you're somebody who likes a UI, you can do it that way. Or if you like mobile devices, you can use your iPhone to manage infrastructure. So one of the things that everybody recognizes and associates Amazon Web Services with is elasticity. As an example, here is one of my favorite use cases. This is a hedge fund based out on Wall Street. This is a typical usage scenario for them. They run about 3,000 servers at night before the markets open, running a bunch of risk analysis models during the day to keep at -- you know, to sort of keep at a lower baseline. On weekends they're back at that lower baseline. But again for about six hours to eight hours every night they spin up. It works very nicely. This is the kind of usage graph that you'll see across many, many, many customers, especially people in the data analytics with this intelligence base who have customers or financial modeling to do. The other thing that folks associate with AWS is scalability. Example of that on the storage side is SmugMug. It's a photo sharing site for I call it the pro zuma crowd because they don't have a free level. Everything is paid for. And they support raw formats from SLR cameras and so on and so forth. SmugMug has been on AWS for a long time. This is and old number, from like a year and a half ago when they had over a petabyte of data in S3 and have grown since then. Ed's already shown you this one, which is the classic Animoto story where they had to go from baseline of 50 servers to 3,000 servers because they launched a Facebook app. And that happens a lot. I mean, there's companies -- I know Ed says don't play farmville, but please play farmville, that way my revenue targets get easier to hit because it runs on AWS. They will not have been able to scale as fast as they did in a non cloud infrastructure. But the thing that people tend not to forget, and I think this is what I like reminding a lot of scientists on that cloud infrastructures, the people building these cloud infrastructures come from a world where highly available infrastructures is necessary. And most scientific infrastructures like this one are not highly available. As Werner Vogels likes to say, everything fails, all the time. Jeff Dean from Google once said -- gave a really nice at Cornell last year, and I'll steal it off his data from there, where he says things crash. Deal with it. I think he had a number in there which he said you build and X number of node cluster, watch it for a day and see servers fall. It happens. And when you are at scale, you have to deal with things like that. If you're running a work station [inaudible] desk like many of us did in graduate school or a few nodes somewhere, you don't have to deal with it that much. Two to four percent of servers will die annually. It doesn't sound much, but when you're crunching on a lot of data or a lot of servers, that becomes viable. So you're to prepare for failure. You can't just restart a calculation about four days in when one node suddenly goes away and you're to start from scratch. I've had customers do that because they were using a very clustering environment. One to five percent of disk drives will die every year. One of the things that S3 does -- I'll get back to it actually. But perhaps the most common error, especially in environments where your infrastructure's managed by graduate students who really aren't really good at managing infrastructures is the biggest issue is going to be systems administration errors, somebody doing a bad config, somebody setting up SMPP related with the problem, somebody misconfiguring a cluster. And when availability of infrastructure and reliability of infrastructure becomes important, you really have to start thinking about who's keeping it up, how they're keeping it up, how you're doing it. So if you want your infrastructure to be scalable and available, you have to assume hardware and software failure. So you're going to design your apps to be resilient these are not just the apps that are running on the infrastructure, but also the apps and software that are managing that infrastructure. You're to start thinking about things like automation and alarming. Now, again, in a research environment, in the scientific environment, that's not usually what most people think about, especially in a day where a small lab is being able to generate a lot of information. At the University of Washington last summer, my wife works at the genome sciences -- in the genome sciences building. Every Sunday for a while because it was a heatwave, their cooling system used to go down and you couldn't access the cluster. So for a day or two, you couldn't do any work. Maybe it's okay. I don't think it is. So that's the kind of thing that Amazon Web Services, Microsoft, Google, that's kind of the environment we come from. For example, with EC2, everyone who launches servers in our east region gets four different availability zones. And an availability zone for all practical purposes is a failure mode independent data center. You can pick and choose which one you want to run in. You could choose to run across, you could choose to run in one. You have the option of running across AZs and building failure tolerance into your applications. S3 goes one step better. It automatically replicates data across physical data centers. It's constantly checking for bit rot. The goal is you shouldn't use your data. You should have a highly reliable and durable infrastructure. So as an academic or as a researcher, whether you're in a commercial entity or in an academic entity, you don't -- you shouldn't really need to think about, okay, how am I going to build a really reliable infrastructure. If a data center is a single point of failure, I need two. Where do I find the second one? How do I get distributed architectures? It's not something I've done before. Well, somebody has done it for you. You can think about using it. The other thing that happens is there's a whole bunch of other services that we've had to build over the years, either internally or for the people using us, that run -- that help people become scalable and fault tolerant; things like elastic load balancing, which balances load across data centers. I've already talked about auto scaling. Elastic IPs are the ability to take an IP address and move it from server to server pretty much. The elasticity relies on the fact that you can take an IP and move it from one server to another server. Elastic block store is again one of my favorite services. You can essentially provision block devices with an API call. You can resize them. You can move them from one disk to another disk. SQS and SNS I've talked about. And CloudWatch is our monitoring service. And the other part is that they need to be cost effective. And one of the -- if James Hamilton was here giving this talk, he would spend half an hour talking about economies of scale for large scale computing. But I won't talk necessarily about that. But the other core part of AWS is pay as you go. It's that you should be paying for what you're using. As an example, I'll talk for EC2. So EC2 has three different ways of paying for it. The one that most people are familiar with is the one on top over there, called on-demand instances. You run a Linux server for an hour, you pay eight and a half cents. You run -- it was easier to do this math in my head when it was 10 cents an hour. So essentially what it means if you're running 10 servers for an hour or one server for 10 hours, it's going to cost you the roughly the same amount of money, because you're paying for the time the servers are on. Now, that pricing assumes a certain utilization. There are some people who are running applications all the time. They want those servers to be available all the time. And they don't mind a solving some of the risk. So what they do is they pay up front. This is a pricing model we introduced about a year ago. But for the up front they pay $200 or so for a small, small instance, they now pay an operational cost of instead of eight and a half cents and hour, they pay three cents and hour. So that's a model that works really well for people running databases, highly utilized infrastructures, which is not that many people. And last Decemberish, we came up with my favorite pricing model, which is spot instances. At any point in time we carry some overhead. There's pieces of the infrastructure not being used. So what we created was a market. Essentially as a user you can bid a price on it. For example, for an eight and a half cent instance for doing some job that you may never do otherwise because you don't have the compute time or something that you don't believe is worth paying eight and a half cents for. An example I use is in chemical there's 2D to 3D transformations that you have to do for a chemical laboratory. It's just grunt work. You have to do it. But I want it to happen and not worry about it. Spot instances are great for things like that. It's actually great for things that David Baker has developed with Rosetta at Home because screen saver projects assume the screen savers get shut off. So they're meant to be run piecemeal. So what you can do is say I won't pay more than three cents an hour for this server. And underneath that we have a dynamic market that changes based on supply and demand. So if your price falls below three cents an hour, your instances start. So let's say the price is at two cents an hour, you essentially pay two cents an hour. The price goes up to 2.2, you pay 2.2. The caveat is if it crosses three, what you had bid, your servers get shut down. So if you are used to running checkpoint, checkpointing file, if you're used to assuming that things go away, this is a great infrastructure. So what do people use it for? For this pricing model. Scientific computing is actually a great use case. Last week I was talking to somebody at a financial services company who was running MATLAB workers on what exclusively using spot instances. Each job that he has is a few minutes. And if he loses one or two, he can always restart. Web crawling is another one that's very common. You're indexing stuff you don't mind. And of course a lot of these screen saver type projects work really well. So the other core tenet of the infrastructure is security. You know, you have cameras built on you. When you are building a large scale infrastructure, especially coming from a retail background, you have to be very particular about security. I won't go into the details of our security model. If people want to talk, happy to do that after the fact. But some of the -- one of the things that we do, especially at the instance level, is what we've done is move a lot of the software that would normally sit in your router on to our -- actually just below our virtualization stack. So essentially it acts as a traffic cop and inspects every packet coming into every instance. And it decides who gets access to it, what should be done with it and so on and so forth. But people still like more control, so for those people we developed something called the Virtual Private Cloud which our folks in Virginia who developed it called a data center on a stick. Let's say you have a hardware device in your data center. You have a corresponding gateway inside an Amazon data center. And what you essentially get is a virtually isolated network which right now is so virtually isolated you can't access anything else with it. If you want to access even a file on the Web, you have to go through the gateway. So what this means is you can bring your own address range into AWS, you can use your standard network security policies, the network engineers. And the guys managing their security love it because they don't need to change anything. You can create your own subnets, build your own BMZs and so on. Over time, we'll open up the Kimono into other AWS services and allow you Internet access. But we started off by going the usual default route which is basically shut everybody out and assume everybody is malicious. Which works. So that was the sort of what is AWS spiel. Now let's get to the fun part and talk about why I believe science and Amazon Web Services is a good thing. And in sort of large scale distributed cloud infrastructures in general. We talked about the Sloan Digital Sky Survey in the last talk. There's a project called Galaxy Zoo. How many people have heard about it? Cool. It's one of the few talks where more than one hand went up when I talked about it. So Galaxy Zoo 1 ran on a certificate in Johns Hopkins. For those who don't know, Galaxy Zoo is a citizen science project where they take data from the Sloan Digital Sky Survey and they allow people who are not necessarily astronomers to classify galaxies based on certain criteria. They never expected to be this popular. Now, this could be an urban legend. But apparently the first time the -- first Galaxy Zoo they got more traffic than they expected and the server at JHU got fired. Which is a problem. So when Galaxy Zoo 2 started, they decided to go for a much more scalable infrastructure and build it on the AWS with a nice sort of loosely coupled infrastructure. They actually a tool called vehicle-assembly, which you can open -- which you can get from get hub, which was written to deploy genome sequencing algo's. So the guy who built this used to work at genome center had moved to Oxford and this is now running at Oxford in the UK. This is just some of the early stats that they saw. They saw in the first three days after Galaxy Zoo 2 was announced they saw close to 4 million classifications. 15 million in the first month or so. And at one point of time they were tracking sort of how many clicks they got over a hundred hour period and they had over two and a half million clicks. So if the first server caught fire, if they were still running on it, I don't know what would have happened, but this is a fun project. And it becomes possible. And I've seen other examples of people sort of running large scale Web infrastructures which have performs expectations. One of the things in the protein structure prediction community you often see is you find prediction servers but when you go to them and you submit a simple sequence you're to basically wait for hours while they're in queue. But if you want people to be able to access things like classifying galaxies, make it available to just not the experts, things like Fold It, you need to have -- you need to start thinking about good Web infrastructures. And I think this is one way of getting that. But the reason we are here is we have lots and lots and lots and lots of data. And lots and lots of data has challenges with scalability and availability as I hope both Ed and I convinced you. But then you have scientists who they're data geeks. Vaughan Bell has written a book called Mind Hacks, if anybody has not read it, it's a great book. And Duncan Hull, who doesn't quite look at that anymore works at the EBI I think these days. They're basically implementations that, you know, they like mining and analyzing data. What they don't necessarily like doing is worrying about the infrastructure that it's running on. So data management is a huge challenge. Data management as I told you in his last talk people often worry about data by putting it in flat files. I remember my first startup, we took data from the PDB. Every piece of data that came, we looked at the four letter code of the PDB file, took that middle two letters, made it a folder and everything came in underneath that, so we had hundreds of folders in a hierarchy that went below that. In those days it was okay because the PDB was small. If you did it today, you'd of a pretty rough tree there to have to try with for every single job. And as scientists what you want to do is find out everything is to answer questions. Your question might be tell me everything I know about Shaquille O'Neal. From the gene expression data, from the proteomics data, from the gevas [phonetic] data, you know, from a pathway database. And collect all this data and be able to analyze it. And for that you need good data management systems. You need systems that allow people that you collaborate with or other scientists to be able to go in after the fact, maybe even after the primary analysis, and try and figure out, okay, here's a new data I have, here's all the correlation -correlated information. What's the biology of -- what Shaquille O'Neal's biology. Why is he 7 foot, 2 inches tall and why doesn't he rebound well? Maybe not that question. And it's not just about flat files, it's not just about databases, either. One of the things that AWS gives you is choice. You can use a managed database like SimpleDB or RDS. You could read a user massively distributed object store like S3. Or you could build stuff on your own using EC2 and EBS. For example, there's a company called Recombinant. This is something that they built for a big pharma company has built a biomarker warehouse using Oracle. It's about 10 terabyte warehouse. Over three years is what they [inaudible]. They could have built it themselves. This is again, as I said, a top 10 pharma company. But they actually found it to be more cost effective given the utilization rates to build it on AWS. And it's a proper biomarker warehouse where they can go and do all the stuff that you could do in actually with a system like Rosetta Resolver which came out of Rosetta where I was before. It also changes the way you approach data processing. Sure. Today on AWS you can run things like Unicloud. Unicloud is by a company called Univa UD. They run on top of Sun Grid Engine, but they provide basically policy management and user resource management. Sun Grid Engine now natively supports EC2, so if you don't want the sort of business process management that Univa provides, you can go straight to Oracle I guess now and get Sun Grid Engine. I don't think they've scaled it yet. Or we talked about Condor. Cycle computing is sorted of in the middle. They take technology like Condor, but they sort of abstract it away from the user and try to void rest full API. So as an end user you talk to the API rather than to the Condor pools themselves. And they manage to back end in the load balancing and the scaling. So, you know, they have an application called cycle blast, for example, which allows you to do that. But the real fun comes from people start building their own cool tools. So star cluster comes out of MIT. It's essentially a cluster management system used for all kinds of fun projects. It's -- again, the website is over there. If you want to have a take a look at it. Anyone can download this [inaudible] and run a star cluster installation. The folks at the New York Times actually set up -- a bunch of them got together and set up a non-profit called document cloud. And part of document clouds goal is to create -- is to make infrastructure available as open source that they're sort of built for things like the New York Times and things they do within that. This is something called Cloud Crowd. Cloud Crowd essentially runs users of SQS queues in their job state and are very heavy -- even Sun Grid Engine is fairly heavy in how it manages resources. A very loosely coupled way of having one server spin up workers when they need to, chunk up jobs as required based on a certain set of metrics. The source data is always in S3. Again, you can just -- if you are a Ruby user, that's how you install Cloud Crowd. It's just a RubyGem. RightScale has a very nice struck, again very nicely loosely coupled dynamic clustering infrastructure called right grid. Again, it uses SQS, our queuing system as the job state. You can essentially set up error queues, audit queues, output queues, have a bunch of messages sitting in the input queues which make decisions. Part of the decision is that elasticity function up in the corner over there. Your elasticity function can be the number of jobs. So based on the number of jobs it will scale up and down. And how many jobs per server. The other elasticity function they have is the money you want to spend. And it basically builds -- scales up your cluster based on how much your budget is. It's pretty cool and it's pretty neat and it's very dynamic if you want to move clusters, nodes in and out. You can do that. It's non-trivial with a traditional cluster setup. And again, you can do it either through the NICY or through the RubyGem. So there's multiple ways of doing this. And using this kind of stuff, people have started doing pretty interesting stuff -- pretty interesting things. As already talked about, John Rehr's work on FEFF. You know, part of the interest in FEFF was can we take all Fortran code designed to run on tightly coupled clusters. I believe when I last talked to John he said was -- or maybe the guy who funded him at the NSF was there's a typical loop in it which if all you had to do was make that loop embarrassingly parallel then you solved half the problems. And I think that's continued to work on that. The folks from the BioTeam used the right grid architecture. This is again for a pharma company. Running David Baker's Rosetta program, Rosetta application, which actually put my first company out of business. And it basically sits in the queue. You basically are looking at how much data do I have, how many jobs do I want to run? And based on that, bunch of JSON messages, you are spinning up the number of workers that you need, you're pulling data from S3, and part of the reap they built it was they got tired of waiting for IT to provide them the infrastructure that they needed. This is a group in a pharma company that's not at headquarters in the other side of the country. They don't get necessarily always get the love that they think they need. And this is an infrastructure that worked really well. Scientists can run it. Somebody built it for them. They don't need IT support. It works quite well. IT found out eventually and they started managing it. My friend Matt Wood who was then at the Sanger one day decided, hey, let me try assembling a genome on AWS. This is a couple of years ago. Before I was at Amazon, I think. He took about 140 million reads from a 454 sequencer. Wrote two open source applications called mission-control and launch pad. Launch pad and vehicle-assembly which I already talked about deployed his application on AWS. And mission-control is this Ruby on Rails application which he uses to monitor his work -- monitor his jobs. And very quickly was up and running in, you know, basically the time it took him to write up the code and deploy this. It makes it very easy for people who are interested in writing algorithms, writing deployment systems, writing infrastructure as developers to experiment, to make -- to do that stuff and then make the stuff available to everybody else. Because, you know, it's just a Linux server. If you run Amazon machine image you can do pretty interesting things. I talked about Cloud Crowd. That was written originally at the New York Times to take images and transform them into another kind -- you know, transforming image formats. This is a group at Penn or Penn State, I think Penn which basically took Blat, which is genome assembly sort of algo, which tend to be somewhat memory heavy. They use Cloud Crowd, which -- and this is how you actually run it. Again, Cloud Crowd is a Ruby app. Wrote up a quick script, pointed it to the input files and whoala, 32 hours and 200 bucks later they were done. And they've continued to do that. That group actually uses a lot -- does a lot of proteomics on AWS as well. It's basically -- you have all these tools floating around built by people to do one thing and somebody else find them and does fun stuff with it. The other interesting area is heavy-ion collisions. Heavy-ion collisions are -- so this is from I think -- this is a project called the Star Project I think, which uses an ion collider at Brookhaven, if I remember correctly. Preyed they will a conference to get to very quickly. They had to get data for a conference and the internal resources were kind of not available. And very often I find is that the first time an academic or researcher will use AWS is when they have a conference to go to, a paper to submit or the boss is asking for something over the weekend. They use something called NIMBUS, which is a context broker developed at Argonne National Labs. They have their own sort of a cloudlike environment at Argonne. But one of the things that they designed it for was using NIMBUS you can easily go on to EC2, because their environment is still limited. They want people to get more resources, you can just overflow on to EC2. So they use the NIMBUS environment, provisioned a bunch of nodes and got the simulations done. I'm not much of a particle physicist, so I'm not 100 percent sure what exactly they found from it. Ted Lyfield [phonetic] whose name's missing, at the University of Melbourne, has built and infrastructure which he plans to share with his collaborators to do Monte Carlo simulations from the Belle experiment, which is part of the Large Hadron Collider. It's just a very nicely, nicely, again sort of loosely coupled based on messaging infrastructure to start doing -- running a bunch of workers to analyze data. They get it from this experiment. The experiment I don't think is quite kicked into full gear. But at some point, hopefully it will. And there's a bunch of people collaborating on it. Ted happens to be at Melbourne. And they do that quite a bit. But everything I've talked about to now has still been mostly chunking up jobs across worker nodes which is something that we've done for years. Yes, having dynamic sort of having computing as a commodity, having fungible servers as I like to say, makes it a lot easier and makes it a lot more flexible. But what happens when you have tons of data? You can't really chunk it up that easily, you can't move it around servers that easily. Your disk reads and writes are slow and expensive. Data processing on the other hand is fast and cheap. So one solution to that is distribute the data and parallel your reads. And that's something that Hadoop does really well. For those of you who may not know what Hadoop is, it's not data processing -- it's not a cloud environment. You can run Hadoop anywhere, but it sort of fits the cloud paradigm really well because it's designed to be fault tolerant, it's designed to run on commodity machines. It assumes that hardware is going to fail and takes that into account. So what does it have? It has two components. One is a distributed file system for HDFS. And the other part is a Map/Reduce implementation. So Hadoop was developed by Doug Cutting as an extension to the Lucene project. Doug went -after he read the Map/Reduce paper from Google, after that he was hired by Yahoo where he did a lot of his Map/Reduce work. And now he works at a company called Cloudera. Ed showed you a picture of Christophe Bisciglia. Christophe is one of the founders of Cloudera. And they're sort of the red hat of the Hadoop world, so to speak. So how does Map/Reduce work? Well, it's pretty simple in some ways. As a developer, you write a map function, which takes a bunch of keys and values and creates a bunch of keys and values, a list of keys and values. And at the reduce phase, you essentially aggregate them. It's more complicated than it looks over here, but it works very well. Especially for large data sets. For things like aggregation, for things like analyzing log files, and a bunch of other use cases that people, especially in the Web companies who sort of developed and pioneered this didn't really think about. But you still have to write functional programs. And not everybody in science especially or outside science really knows much about functional programming. So there's some very nice frameworks that having built on top of Hadoop. One of the nice things about being an Apache project is people can do a lot of interesting things on top of you. So cascading is a framework that I really like. It's essentially a dataflow system. Very similar I think to Dryad in some ways whereas user you define a dataflow. And cascading takes up the whole job of writing the Map/Reduce functions underneath it. You are just writing a dataflow. Now, cascading also works with any language in the JBM. So you can write cascading in Java, you can write it in JRuby, you can write it in closure, pick a JBM language, Scaler, you know, things like that. The folks at Yahoo! developed something called Pig, which is again a more -again it's a script -- it's almost like a scripting language. I like to think about it as pull for Hadoop, because you're basically writing a scripting language, and that takes care of your Hadoop jobs. And the folks at Facebook developed something called Hive. For a lot of their product managers an analysts they're used to writing SQL query, so they wrote a SQL like language that sits again -- sits on top of Hadoop. Facebook has terabytes and terabytes of data. And they need to be able to do ad hoc analyses, and Hive's been really key to helping them get there. So there's all these higher level languages and higher level systems that have come on top of Hadoop which has made it a very dynamic and growing system. And one of the biggest use cases on top of AWS, especially in the Web world. You have folks like Pete Skomoroch who now works as a research scientist at LinkedIn. He has -- you can actually get all of this. He wrote a sort of a reference architecture for a data mining system, where he took data from Wikipedia, combined it with Google news data and used Hive to develop an application which he called Trending Topics and it looks at topics at Wikipedia over a period of time. He released the data as an Amazon public data set. I'll talk about those in a bit as well. And he's made the whole course available. This is a very quick idea of the architecture. It uses a rails based application. It used MySQL. It uses Hive. It uses Hadoop. And it uses our elastic block store system which is where the data sets reside. The part that I really like is the work -- and this is I would say a pioneering working that's been done by Mike Schatz at the University of Maryland. The life science community historically has -- I can get the biggest box, I can get and use up all the memory attitude. Most of the programmers are folks like me who know how to write algorithms but don't know how to write good code. And it can be a mess. And when things start scaling up and as these -- as the nature of your sequence data started getting shorter, start getting shorter reads, you've had a problem. So what Mike decided to do a few years ago was try a simple problem. He figured that these schemers, these short read fragments fit the Map/Reduce paradigm very well. And he decided to see if it worked. If it didn't work, it was not going to make any sense. So this is a simple application called CloudBurst that he developed a few years ago and open sourced. I think that's where it's been published. It essentially does a simple read alignment against reference genome. You have the map phase which is where you catalog and where essentially admit the k-mers on to your reference. You collect all the seeds because it uses a seed and extend paradigm at the shuffle phase, which is actually the magic part of Hadoop, in the middle. Map shuffle reduce is what it should be called. And at the end in the reduction phase, it does the alignment. And it worked really well. I don't have numbers for his scaleup, but he got on a hundred node -- on a 96 node EC2 cluster, he got 100 node scale -- 100 time scaleup from the traditional sort of sequential algo that's sort of the gold standard in this community. And this is all just a start. I think where things got really interesting is when he collaborated with Ben Langmead. They're both in -- both of them working with Mihai Pop and Steve Salzberg at Maryland. Ben's a developer of a program called Bowtie, which is another better known aligners in the genomic space. What they did was they used a feature of Hadoop called Hadoop streaming where you actually don't need to write the functional program, if you actually stream in existing code. They had to modify it a little bit. But what they did was for the map trace, they used Bowtie, which is doing very fast alignment. For the middle phase, the shuffle phase, they essentially bin and partitioned, clustered all the intermediate data. And they took another open source algorithm that detects mutation snips on those alignments. And they now have a pipeline as a Map/Reduce function that anyone can get and run called SoapSNP. And this paper, I forget when they released it in Genome Biology, was for a while the most read paper at Genome Biology for quite a bit of time. And actually a lot of interest in cloud computing has come from that, because their paper was targeted -- SNP discovery with cloud computing because all of this work was done on EC2. Here's some numbers that they got. Now, the cost doesn't really mean anything because there's other costs associated with it. This is just a raw compute cost. But it was solving a real problem. Essentially because a lot of these aligners were very memory intensive. The work that Mike's doing right now is actually probably even more interesting because assembly is a significantly tough and complex process. It takes big machines to pick -- there's even codes out there that use close to a terabyte of memory. And he's trying to implement this as -- in a Hadoop-type environment, a Map/Reduce environment. It's a paper that they're working on assembly of large genomes in cloud computing. Bacterial genomes are easy, human genomes are hard. This is still work ongoing. I won't show the data because of that. As I alluded to earlier, and I'm showing you all the stack that we have for AWS, part of the problem is, A, Hadoop can be a little temperamental. As Mihai has found out, their Hadoop cluster at Maryland keeps crashing. Michael actually lost a whole bunch of data because of that recent. And he's ready to -- within two weeks of defending so he's not a happy camper. So we decided to develop something called Amazon Elastic Map/Reduce. Elastic Map/Reduce is just not a wrap around Hadoop. It's actually a data -- a job flow engine where you essentially define a job or a series of jobs and then you pass over the definition to Elastic Map/Reduce and it takes over from you. It uses S3 as its source data store because hopefully you have your data in a nicely redundant environment. And then it does all the fun stuff in the middle. And it takes care of things like node failure, it takes care of things like recently Hadoop didn't have a good debugger, so we actually created a debugger which is available as part of Elastic Map/Reduce. Yesterday we launched something called bootstrap actions which allows you to put arbitrary applications and things like that on your Hadoop clusters because we don't want people logging in necessarily and doing that because it's designed to be an abstraction. And one of the first things we did when we launched Elastic Map/Reduce was take this CloudBurst algorithm that Mike had developed and put it on there as a sample application. All a user does is -- and literally this is how you use it from the UI, is you tell elastic Map/Reduce where your source data is in S3, you point it to the S3 bucket. You put in a jar file or a streaming script and you tell it where the output is going to go and size of cluster you want and you hit go. And that's it. So as an end user, if you have a series of applications available to you, you don't really need to learn how to run -- you know, write algorithms with Hadoop jobs. So the reason I like Mike's work so much is it's actually inspired a lot of people to start thinking of algorithm development, rethinking how that code works, and then deploying it either as Amazon machine images, as virtual machines or as part of Elastic Map/Reduce. And we continue to work very closely with Mike and Ben, and have funded them through our education program. The other thing that makes the cloud very interesting, and this is where I talked to where you can't have just FTP data or send it over to each other is data storage and distribution. Both public and private. On the public side we have our Elastic -- sorry, our public data sets project where you can get everything from Jay Flatley, who is the CEO of Illumina, his genome to actually the first African genomes that were sequenced from a particular transcribe in Nigeria and to things like Ensembl over there which is pretty cool. Data from the Sloan Digital Sky Survey. Global weather measurements, mapping data. All kinds of fun stuff. There's other people who are distributing data privately. So I'm a company. I have customers who are running EC2 job, who are running on EC2. I want to make my data available to them inside the same environment. So that's also happening today. I alluded to earlier we have an import/export service. If you don't have a big bandwidth connection you put all your disks -- you put your data on to a disk, you electronically sign a manifest, you submit a job, ship your disks, they come into us, we're good at moving stuff around and dealing with UPS and Fed Ex. The data shows up and you're a three bucket and you get an e-mail saying we're ready for you. That's really good for people moving data in. On the disaster recovery side, this is not a scientific problem necessarily, people use the export part of it. So they keep a lot of data inside AWS knowing that they can export it out if they have some kind of BR scenario. This also enables things like sharing and collaboration. I don't have any good pretty slides to show for it, but I've talked about -- I mean, I think Ed also talked about it is that the cloud makes a great I think -- he quoted Bill, makes a great collaboration environment. In that project that I did when I was still at Rosetta we wouldn't have had to ship the disks to 10 people, we would have just shipped it into Amazon and all of us would have access to the same common global names data space, and we could have all worked on that space. You don't have to have 20 copies. One of the reason that we started the public data set project was we found that this -- different people were all getting the same data set for essentially 20 copies of the same publically available data set inside AWS which earns money but it's inefficient. It makes a great software distribution system. Examples would be -- this is a Large Hadron Collider project that's comes from the Max-Planck Institut. It's called the ATLAS project where Stefan Kluth had started doing in prototyping work and has had people who wanted to try it. Go check it out. A very nice project is -- if you can see anything, is the Cloud Bio-Linux project. This comes from the folks at the J. Craig Venter Institute. What they did was they looked at the Bio-Linux distribution which I think used to be Slackware or something like that. Maybe in a boot [inaudible] which is a bootable CD with a bunch of standard bioinformatics analysis tools on it. They essentially created a cloud-based version of it which runs -- which you can get as an Amazon machine image with all the tools that the average non-expert from addition would need. Another really cool project, and I've been a long-term fan of Galaxy before I was at AWS. The Galaxy project comes from Anton Nekrutenko at Penn State and James Taylor at Emory. It's been available for a long time. You could download it, run it. They now have a Web based application version of it. And last year they applied for an AWS grant, and what they ended up developing and now it's publically available is Galaxy on the Cloud where you can -- anyone can instantiate their own Galaxy instance -- Galaxy star from metagenomics, but now it's a general sequence analysis application with a standard bunch of tools, all the viewers you need. So it's a one-stop pipeline for managing and looking at genomic data. And anyone can run it. All they need is an EC2 account and a credit card. The other thing that I think is a lot of fun and where some of the best innovation is going on is application platforms. These have nothing to do necessarily -- not necessarily things specifically to do with science. Heroku, for example, is one of my favorite platforms. It's a Ruby on Rails platform where all you have to do to push data to it, they take care of all the scaling, the management, is literally if you're using get, you do a get push Heroku and your application is running on Heroku after that pretty much. An example of that is Chempedia, which comes from Rich Apodaca down in San Diego. It's sort of a social Wikipedia of chemical entities. It's running on Heroku. People can go in and edit stuff if there's mistakes, add metadata to this. It's a fun project which he did over a weekend. As a developer all he has to worry about is his code, not about managing it and deploying it. The other hear which is a lot of fun actually is in the geospatial space. And I'm actually really jealous of these guys. Because I think the geo guys have decided that APIs are great and they create great APIs. So there's companies like SimpleGeo which worked with folks like Sky Hook Wire to start mapping out cell density is and missile you know may be at some kind of event where everybody is watching the World Series or something like that. Or they'll be whole big blob on whatever the fines are happening. Not a baseball guy. But this is South by Southwest. So you can look at the density of probably iPhone usage over there. And that's Manhattan, which is a little more diffuse. That's probably where the conference center was at South by Southwest. And SimpleGeo, there's companies like Twilio, which does a telephony app. There's Exari [phonetic] which has stuff going on in the geospace as well. These are platforms on which other people can build applications. But SimpleGeo as a developer I could take the core SimpleGeo APIs, mash it up with other things and do a lot of interesting stuff. And it's not started happening in the sciences, I wish it would, where people would take tools like Heroku, take the APIs that AWS provides or somebody else provides and start building up these applications platforms that other people can build on top of it it's hopefully some day these will start happening. The other interesting area is business models, where you have a new way of getting new customers for more traditional companies like Wolfram Research, and the MathWorks. So you can run MATLAB or Mathematica on AWS today, the grid back ends. It's just another way the customers don't have to spend money on hardware, they spend it on software and whatever it costs for them to run AWS. Or companies like DNAnexus which are completely in the cloud. DNAnexus actually came out of data at the meeting Roger and I were ATLAS week where they essentially have a sequence data management analysis system. They come out of Stanford, another Stanford BC funded companies. But it's a complete SAS platform. Historically you would have probably bought this as a [inaudible] package and it's happening a lot more these days. I'm probably still going to finish way head of time, so hopefully you'll get for time for Q and A. So to conclude what would I like to sort of say as a summary? I spoke to you a little bit about AWS and what AWS does and how it's used. For the scientific folks in this audience, you know, infrastructure clouds are designed for scale. That's what they were built for. That's where the origins is. They're also built for availability. So you can always get to them anytime you want to. You don't have to wait for in a queue to get to a super computing system. You don't have to wait for servers to arrive just because you got more data than you were used to or ran out of disk. You have the ability to have shared dataspaces and global namespaces. As an example, in S3 if you create a bucket, it's a global namespace. If you make it public, anyone in the world can of access to that one global namespace. It's kind of cool. You can say my data and throw whatever else you want and everybody can -- my data is probably taken, but still. The other part that's interesting, and I didn't talk about this too much, is task-based resources. Typically in a shared system what you have is a shared file system and a cluster that is shared with about 20 other people or you're running 20 different tasks on it. They're still a shared system. With AWS what happens is every EC2 cluster is a discrete entity. You're not sharing resources with anybody else, you're not competing with the resources with a second cluster. You have your own cluster. One of the reasons EC2 is very popular for educational courses is for example at MIT, their IT won't get students get root access on their clusters, but with EC2 they can, so their entire CS 50 course is now taught on EC2 because students blow something up, it doesn't matter. And so you can stay at resources that are for a dedicated task, especially if you have this shared common dataspace that you can access. You can have people trying out new software architectures. Mike's a great example of that. There are other examples of people doing some very innovative stuff where they're taking advantage of loosely coupled systems and systems that are massively distributed. You can try out new computing platforms. You can deploy Sun Grid Engine, you can deploy -- and I suspect you can deploy things like LSF on AWS as well, which is well and good. But you can try these very dynamic systems like document -- like Cloud Crowd or the Star System from MIT or the Right Grid System from RightScale. Or just build your own. It's not that difficult. And the best part is you can do it right now. And you don't have to wait around to start running -- just start hammering at it. You have heard enough examples today. And one last plug. We have an education program. It's basically what we do is we provide -- I think we have three submission or four submission deadlines every year. You submit a little abstract of what you want to do. And you get credits, compute and storage credits to use. Examples. Here are some examples of the kind of stuff that we have funded using that. And then of that, thank you very much. I'd like to definitely thank James who I always learn a lot from. Matt Wood who used to be at the Sanger, now works for a company called MacIntosh which makes a program called Papers if you're a Mac user and you like managing data, managing your PDF library or you're papers, it's really, really good. And I love [inaudible] presentation slide. So thank you very much. [applause]. >> Dennis Gannon: Thank you, Deepak. We have ample time for questions. You just raise your hand, I'll bring a microphone to you so you can ask our speaker. Okay. >>: Hello. Thanks for your talk. [inaudible] from the University of Antwerp. I have a question on the new -- well, new, the spot market that you introduced currently it's fairly -- let's say not really transparent how this market operates. Is this company classified or could you hint at how this market forms its prices, how it works? >> Deepak Singh: Yes. So it's our own algorithm right now. What we do provide is a history of pricing. So with the API you can get the latest price or you can get historical prices. In fact, there's a website called cloudexchange.com which tracks all our pricing history. Most of the time the prices stay roughly about a third of the actual price. And every now and then once in a while it will go poof. Because we don't have a ceiling to how high you can submit your bids. It's based on supply and demand. It's still early days. It's a simple -- you know, we decided to keep things simple because we want to see how people use it, what kind of -- you know, what kind of usage patterns there are. And it will evolve. So at some point we might talk about how exactly it works. But I think how it works is going to evolve over time, also, as people use it in different ways. But right now what we expose is your pricing history. >>: And are you experimenting internally with that? I mean, looking at new mechanism designs to -- >> Deepak Singh: We continue to look at what other -- what else makes sense. So if you have any ideas on what you would like to see as in that kind of dynamic capacity system, yeah, we're open to ideas. And as I said, one of the things we like doing is start simple, see what people do, and then try and adjust to that. >>: That gives us a good approach. Thanks. >> Deepak Singh: Yeah. >>: [inaudible]. So for using EC2 for experimenting -- computer science experimenting of the science we need [inaudible] like 600, 800, 1,000 instances simultaneously because we are going to run a cloud computing platform on top of EC2. Is this feasible or ->> Deepak Singh: Yeah. It's definitely feasible. Now, if you're getting that many resources, obviously if you suddenly say I want a thousand, you'll probably get a call from me or somebody else to try to figure out what -- who you are, whether you're legitimate, whether you're trying to be -- you know, just try and understand what you're doing. We have single, individual customers you've much more than that every day. So it's definitely feasible. We just have to figure out -- we just have to make sure that, you know, it -- what we try and do is make sure that it's legitimate use which in your case would be, and try to figure out what's the best way to make that available to you. >>: But in our case what we do is of course there are experiments [inaudible] the use is not for long-term [inaudible] our customers ->> Deepak Singh: That's fine. A lot of our usage -- you know, you have -- I won't mention names. We have big companies with their big infrastructures. But they do a lot of development work on AWS, right, where they're testing out new algorithms. And always at scale. So that's very common use case for us, load testing. I think one of the classic use cases we've seen is there's a company called SOASTA that load tested a million user load test for My Space where they spun up 800 nodes for three hours or four hours. And they do it once a month for different -- 800 is rare. Normally they do three, 400. So 800 nodes is what, 16 hundred cores. So it's definitely doable. Your default limit with EC2 is 20 instances. That's when you start. So you have to submit a request for more. And at that point of time you probably get a call from where we appropriate contact and your location is trying to figure out yes, is this -- you know, what's a usage pattern going to be? Because that also helps us manage capacity. If you know you're going to stay all the time every day, we can adjust to that. If we know you're going to come in and out, we can adjust accordingly. Yeah. >>: Hi. Chris Menzel, Moore Foundation. I'm just wondering if you could talk a little about the criteria you use for deciding what public data sets you provide. >> Deepak Singh: No. I think the code criteria is because we are storing them as snapshots, DBS snapshots today is it should be something that's useful to a community. If it's -- you know, all the pictures you've taken in your life and you're the only person who would ever be interested in it, not so much interest to us. But if it's a data set that's a reasonable size where, you know, it's kind of -- we kind of do it on a case-by-case basis and see what we ask the person submitting data set is who would use this. And especially if it's something we have not heard of. Or in some cases we go to the data producer like Ensembl and say, hey, people are asking for it, can you start moving Ensembl to us. Our approach has been we'll see how the -- what the usage patterns are. And we might go back to somebody and say nobody is using it, hasn't happened yet. But so right -- and again, it's still early days. That will also definitely evolve over time. Right now it's kind of as long as we are sure that you have the rights to put the data up there and that you can convince us that there's people actually use it, we are fine with it. >>: So Alex [inaudible], the Netherlands. We've been doing over the past two years experiments with various clouds infrastructures, service clouds, trying in particular to run scientific applications. And we've been surprised by the very poor performance that we've seen there. I've a question that is a bit longer. First, do you really plan to support scientific applications such as my colleague has suggested before, those 600 or 200 whatever parallel applications, in particular? And do you really, as you mentioned about the slide in your presentation, do you really don't do any kind of research sharing; and in particular network? >> Deepak Singh: So when I was talking about resource sharing, so the two things that just sort of going back to how EC2 works, the two things on EC2 that are hard schedule are compute CPU and memory. That's hard schedule. That's the performance you're going to get. I/O as a whole is a resource. When I said you're not sharing our resources, you don't have one cluster and everybody is hitting on. If you create a cluster, you're the only person using the right instance type you'll pretty much be the only person on those nodes. Now, what you don't have control of is node placement, right? And that's a reason why a lot of traditional user or traditional scientific codes run into trouble because they expect certain rack locality. They make assumptions. If you're using MPI code, they assume they're talking to the same switch. That's not our kind of environment today. And that's why you'll see poor performance. We can give you guidance on it. For example, if you're lucky you might just -- if you use the large [inaudible] that we get you're essentially getting the full network, right, for that box. But you don't know where the other instance that it's talking to. If you're lucky, it could be close by. But the more nodes you provision, the more further away they are, you know, spread out they're likely to be. >>: [inaudible]. >> Deepak Singh: Yeah. And because our goal is to try and make sure that you're not -- it's almost availability becomes the first priority, first thing you design for, so you try and make sure that everybody is not -- you know, if a rack goes away, everything that you have doesn't go away. So we actually spread you out. Which works very well for embarrassingly parallel stuff because you don't care. But for tightly coupled jobs, yes. And so anybody comes to me and says my thing doesn't scale beyond four nodes and I'm running M pitch 2 and doing molecular dynamics, I'm like, yes, not surprising at all. We're still trying to figure out what we will do in that use case. We don't have a good answer. I was talking to you earlier. You know, we want to try and do it in a way that makes sensor if it makes sense at all, we don't even know that yet. But if you're doing sort of distributed computing, embarrassingly parallel simulations, all simulation or any computation where I/O is not your -- going to be most of your stuff, you'll be in good shape. The moment I/O becomes dominant things, if you can't put into a Map/Reduce like type environment, then you are going to run into these kind of performance issues. >>: So related question. On your private cloud, though, aren't you actually pushing the workloads into a very isolated piece of hardware network, everything? >> Deepak Singh: It's virtually isolated. >>: For -- okay. Very good. >> Deepak Singh: Yeah. It's -- you're not giving you a dedicated set of boxes. So if you shut it down, you come back again, you're not going to get the same machine. Right? >>: Okay. >> Dennis Gannon: Other questions, please. >>: Hi, I'm Lucas Kendall from Czech Technical University in Prague. You mentioned that there is interaction now with suites like MATLAB. Can you say a few more words how that works and who's actually doing the parallelization? Is it you or MathWorks? >> Deepak Singh: MathWorks. So the philosophy we take is MathWorks knows how their software is built. So their grid -- so if you go to MATLAB and they have this grid product, you can choose EC2 as the deployment end point and they will basically launch their grid back end, a bunch of workers on EC2. The MPI part I think doesn't -- works, but it's not going to be that performant. It's the more of the grid back end that they really encourage people to use and what people are using. Same with -- same with Mathematica. Essentially can start from your notebook and just launch a grid back end and they built it. And we didn't even do anything at all. We just know that they're doing it. >>: So do you know what sort of a license do you need from them to do that? >> Deepak Singh: Yeah. I think Mathematica I think have a poor use. So different companies do it different ways. I think Mathematica -- Wolfram went with the poor use kind of licensing. I'm not that familiar with people using Mathematica. On the MATLAB side, they still use good old flex LM. You need to have a license manager running somewhere else, and you check out your licenses. They do license borrowing, so it's floating licenses or whatever the standard flex policies they have. That's not uncommon actually. >> Dennis Gannon: Any additional questions? Okay. Great. >>: [inaudible] center of computer science. I'd like to ask you, with the current trends of incoming users and all the data, do you expect in three to five years to be still scalable and be able to provide the same sort of quality? >> Deepak Singh: Yeah. I mean, that's what we do, right? I mean today we have a whole group that's all they're doing is building our infrastructure, trying to make sure that it keeps scaling. Proof lies in the pudding. So far we've been around, what, three years. We've done pretty fine, and we keep growing. And the kind of users and the kind of load on the systems keeps growing. Obviously you learn along the way on what different kind of workloads come on here to make changes that are just systems to that. The good news is that that's a cool competency. That's what we're thinking about all the time. That's why we have folks like James Hamilton, I'll pick a name, who that's kind of what he thinks and thinks and breeds all the time. So absolutely. This is -- so this is a separate business unit core, you know, business unit for Amazon. So it's as serious as retail. We spend as much as we would on that. >>: Steven Wong, Rice University. Have you thought about problems where you have highly interactive applications such as like online games and other -- those sort of things where you've got lots of things going on but in scale requirements but highly interactive with the speed, network speeds? >> Deepak Singh: So I think eight of the top ten Facebook games run on Amazon. So that's happening today. Farmville runs on Amazon. Everything that PlayStation does runs on Amazon. There's games from big gaming companies I can't mention which names they are that are console games are running on Amazon today. So that's already happening. People build their applications to address this sort of distributed infrastructure. They might build peering arrangement with us. You know, there's ways to make sure that performance is good. There might be companies that spend the money on network, make sure they're good networking to us and not on servers. What won't work is ultra low latency systems where you have -- and I'll go off gaming to high frequency trading, right, where you have to do very, very sub-Mill second trades. There are a bunch of HFD companies running on AWS today. What they do is they build their models on AWS, they deploy them to the exchanges where they have cabinets, they do the high frequency stuff there every half, two hours, and the markets change because they've traded so often, they come back, rebuild their models and then deploy the new ones. So you have to decide if you're very sensitive to jitter, for example, where you run a little bit of jitter in the system is a problem. In a shared I/O resource we talked about that, that's going to be a problem. If you sort of understand that, adjust for that, you can build pretty performant interactive systems as well. People are doing that right now. >>: A lot of those games are just sort of single user, all you need to do is update, but they're not necessarily where you've got these multi-user ->> Deepak Singh: There are some. I can't mention them by name which ones. There are the -- there are those running today. By big MM whatever they call them, console games. I'm not much of a gamer. Yeah. >> Dennis Gannon: Any other questions? All right. Let's thank Deepak one last time for a great talk. [applause]