>> Rob Fatland: Good afternoon. Our speaker today is Doctor Anthony Arendt from the University of Alaska, Fairbanks Geophysical Institute and he is a career glaciologist who will be addressing us today on his work here at Microsoft research as a visiting researcher working with Azure and with his modeling framework. With that, Anthony. >> Anthony Arendt: Thanks very much, Rob. That's really a pleasure to be here today and talk to you about the work that we're doing. I wanted to quickly acknowledge the invitation from Rob to come here as a visiting researcher. This has been a really incredible opportunity for me to work with Microsoft resources and really apply it to the signs that we are developing at the University of Alaska Fairbanks and also be really grateful for Microsoft Azure for research grant which is supporting the user cloud resources that will be the basis of what I'll be showing you today. A lot of this research is supported by a wide range of funding agencies, most of which are shown here. A lot of collaborators students and technicians, too many to list, but I want to just recognize that this research is a very collaborative effort and it comes from a lot of efferent folks at multiple agencies. Today I want to cover a few different topics that are summarized here. The first is to give you a brief overview of the scientific problems that we are addressing in our research and a summary of the different kinds of data sets and I want you to get a sense of the complexity of these different data sets in space and time and how this calls for a kind of different computational infrastructure than we have had in the past. I want to talk about the methods that we have been developing in Fairbanks over the past couple of years, sort of our academic approach to these problems and the infrastructure that we built around that. And then really some of these incredible tools that I've learned just in the past couple of months here at MSR that are really broadening those tools and allowing us to accomplish really innovative and collaborative science which is the goal of all of our efforts here. Let's just jump into a bit of the science just to give you a bit of the flavor of the kinds of things that we're looking at. We are studying the Gulf of Alaska system. This is a large band of glaciers and ice caps that surround the Gulf of Alaska and this is some of the largest ice areas outside of the Greenland and Antarctica ice sheets. The reason we've got so much ice in this area is that there are some very large mountain ranges. There are seven different mountain ranges in Alaska and they act as barriers to these massive storms that come in off of the gulf and deposit large amounts of precipitation in the form of snow. To get upwards of six or seven meters some years in some of these high mountain areas that create the formation of glaciers that then flow down to the ocean and discharge that ice into the ocean. When you have all that ice and snow close to a coastal marine environment, there are a lot of complex interactions that we have to think about. One of the biggest things that we're concerned with is the fact that it's a vast amount of fresh water stored in these glaciers that as the climate is changing this is being released into the oceans and this is a major source of global sea level rise. Alaska contributes somewhere around 7 percent to the global signal of sea level rise, which is very high considering how much ice is there relative to the ice sheets. Furthermore, we have to think about all this extra freshwater getting into streams and lakes and out into the coastal fjord areas as that freshwater which has a different chemical and temperature signature than the surrounding waters enters these areas. It will have impacts on things like the ability of salmon to spawn, the kinds of terrestrial succession that occurs and then the overall ecology of these fjord areas will really change rapidly as the climate varies and we get more and more of this water getting into these fjord environments. We get different salinity and temperature of these coastal areas that causes changes in the coastal current as you get further out into the ocean. What we have here is really interdisciplinary and complex data sets for which we need a computational infrastructure that allows us to do collaborative and across disciplinary research. A little bit more about the region that we are studying. This is in black, the Gulf of Alaska watershed. It has about 27,000 glaciers, somewhere around 87,000 square kilometers. That's about 4 percent of the area of the state of Alaska. The annual discharge from this area is somewhere on the order of the discharge of the Mississippi River, so very high fluxes of water are passing through this area every year and we want to determine how this is changing over time. Going to take you over just a few of the data sets that we work with. Again, I want you to think about the complexity of these data sets and how they vary in terms of whether they are vector data or raster data and how we can put together an infrastructure to handle all of this information. The first tier is just basic inventory data which is where the glaciers are located. These are really high density vector polygons that tell us the location of the ice in this particular slide at two different times. The yellow is areas where the ice has retreated since the 1950s and read being areas where there has been a bit of advance. We need this basic information before we can run any of our models and do any simulations. We have point and line data that is, for example, collected when we just put a ground penetrating radar underneath a helicopter and we fly along and we use that to determine the travel time and the distance then to the interface between the snow and the ice. That tells us the depth of the snow pack for that particular year. And there's an example of us flying on a glacier near the Cordova region taking these measurements to a high density of point observations and it gives us, you know, traces like this from which we can determine the layering of the snowpack and pick out that surface where we think the summer surfaces from the previous year and determine how much snow is there and then that is used to determine the freshwater available for that particular season. Go-ahead. >>: When you fly over how do you get a picture of anything underneath the surface? >> Anthony Arendt: It's a 500 MHz radar and it basically penetrates through the snow and wherever you have a discontinuity where there's ice it will show up like that. This is actually up in the higher accumulation area where we get multiple layers. It's called the fern zone. We have to carefully trace that line upwards and there are tracing algorithms that help us do that as well. Time series information from things like these high-altitude weather stations, Alaska is very data poor in the high mountain regions. Most weather stations that we use to run our models are located down near towns where people live so we install these stations and we get often real-time satellite telemetry feeds of temperature, precipitation and so on and that is another form of data that we are using in our analysis. Finally, a lot of remote sensing information, in this particular case we are looking at data from what's called the gravity recovery and climate experiment. And this is a really fascinating mission. It's two satellites and we're measuring the inter-satellite range which tells us the influence of gravity on the orbits of these satellites. From that we can determine temporal variations in the Earth's gravity peer here's a map of the overall net loss of mass from different regions around the globe and you can see it really maps into where we are seeing massive amounts of glacier ice loss. This is the time series for Alaska. You see the overall trend of mass change and then the seasonal cycle from which we can determine that runoff component. Again, Alaska is somewhere around 7 percent of the sea level signal. >>: [indiscernible] >> Anthony Arendt: That's the net change in ice over a year, so 65 gigatons of water is leaving that storage, the ice and going to the ocean. >>: [indiscernible] by the year? >> Anthony Arendt: Why is it? >>: It's going, there is lesser water that's being released in 2011 than in 2003? >> Anthony Arendt: This is the cumulative mass, so we start at an arbitrary zero. We aren't actually measuring the actual mass of the glacier. We're saying relative to where it was in 2003 this is how much mass it has, accumulated mass change, basically. So it's summertime, it's losing mass. Wintertime it gains some mass and then because of that trending downwards the net change of storage is around this value here. >>: When that gets down and flattens out the ice is all gone and now you know how much ice there was in the beginning. >> Anthony Arendt: True. >>: What is the highest contribution? Is it the Antarctic thing or is it Greenland? >> Anthony Arendt: Greenland right now is the highest and what's happening here is not only just the melting of the ice but a big component of the loss is because of this calving, the breaking off of ice into the ocean. We call it dynamic mass losses. That's linked to the fact that the oceans are warming up and kind of eroding away where the ice is meeting the ocean and then you get potentially very catastrophic losses of mass in those areas. >>: What percentage do you give Greenland credit for? >> Anthony Arendt: I think it's around 25 to 30 percent. I'd have to look that up. Our goal is to model this process as much as we can, to reconstruct past conditions, learn about the system and also we want to be able to run future forecast models, safe where is the ice going to disappear in the next hundred years and those kinds of things. There's a lot of ways to do this. Go-ahead. >>: This is a fairly narrow set of latitudes in which this is pretty much most of the melting is happening in. Why is it that the other latitudes are not affected that much in terms of the ice melting? [indiscernible] goes around the Arctic, right, Russian Siberia? >> Anthony Arendt: So I cut the snap off a little bit, but you're saying why is the loss focused on the… >>: Why is it not focused more on the [indiscernible] as opposed to the other latitudes? >> Anthony Arendt: Right. Okay. I understand. The biggest players are Patagonia, Alaska, Canadian Arctic as far as the non-ice sheet areas. Also the Himalayas, but it's off the map. I just cut it off, but it's not as big a signal. There's not quite as much ice there to begin with and also these areas are more susceptible because, like I was saying earlier, they are close to the ocean and are more sensitive to changes in the climate because of higher precipitation rates and they have ice that actually goes into the ocean and they are sensitive to these dynamic losses as well. Really, this is not as much ice at middle latitudes. >>: You mentioned Siberia, for example. Siberia is this vast plains but it's not like big huge reservoirs and glaciers as much as it is just frozen ground. >>: So the main thing is is it just frozen ground or is it meeting the water? >>: I think what Anthony is saying is when you have accumulated ice in glaciers, big huge masses of ice that are facing into the ocean and they are warm that's what's happening in Alaska. You have this mechanism in place for transport of water inside the ocean. Whereas, in Siberia my understanding is you don't have these masses of ice. There are other things going on there that are related to climate change, but not so much water transported to the ocean. >> Anthony Arendt: This is the net loss; this is the trend in mass, right? That's this sort of red line mapped onto here. Siberia may still have seasonality, but it's not going to be trending so heavily because it just doesn't have all that ice from which can be taken out of storage. I'll tell you a little bit about our modeling efforts. This is an energy balance model approach and it formally calculates the fluxes of sensible latent heat. Solar radiation, solar incoming radiation is the main factor that drives the melting of snow and ice on a glacier surface. There are a lot of models to do this. We're using one called SnowModel, made by Glenn Liston and others. It has these subroutines that calculate everything from surface energy bouts shown in the picture here and then the redistribution of snowpack due to wind. The interpolation of climate data so we're using these fairly coarse grid climate products and we have to downscale them to an individual watershed to get better estimates at an individual location, and then the routing of the water when we generate some melt at a given location how does that get distributed across the landscape and that uses a simple linear reservoir approach. Here's a picture of how that works. The input to the model will be things like a land cover mask and that's where we go back to knowing where the glaciers are located with those polygons. These are examples of those gridded climate products. We're using reanalysis data which is kind of, are you familiar with reanalysis? It's like a weather forecast model but applied to past conditions, so it calculates all the physics of the atmosphere, but it also assimilates in observations at the stations to give you a best estimate of the past 30 years of climate or so. That gets fed into the model and then we get things like an estimate of the water balance at our location of interest. So this is an animation of the kind of output that we get. We're starting in the spring of 2004 and blue is where there is a net, this is a cumulative change in mass of the water. Blue is where we are getting accumulations, these high mountain areas. And you'll start seeing some red emerging and that's where there is a net loss of mass and that will be the low elevation parts of the glaciers. We're getting into July and then it's going to stop here soon in August. You'll see this is a net mass balance of this entire region. Losses at low elevations, accumulations up high and we can use that to inform a lot of our other estimates and comparisons to the grace data. Let's step back. I've shown you that there is point line polygon gridded data in multiple bands when we look at satellite imagery. We have very spatial and temporal resolution. We needed infrastructure to handle that. Go-ahead. >>: Interesting about two points where substantial red right on the top left or something. Those seem like the points with the most substantial red. Most of the other reds are closer to the water. >> Anthony Arendt: Yeah. This is the Alaska Range where, you are right. I picked this year. It's a record warm year. That was a year where the melt extended all the way up to the top. There was very little accumulation. Those glaciers were really hit hard that year. Are you asking that with respect to what I was saying about glaciers near the ocean? Yeah. That wasn't a hard and fast rule. These interior glaciers also are especially susceptible to the summer temperature, whereas, here precipitation variability can play a major role as well. Things can really fluctuate dramatically here. But you get a really warm summer and those can be hit quite hard as well. In terms of sea level we don't, we are not as concerned about smaller glaciers. There are a few up in the Brooks Range as well. These are the main players because they are so large and have potential to add a lot to the ocean. But that is a good observation. Getting into more of the computational part of the talk, what do we need to handle this information? We need, everything I've shown you has a geospatial encoding to it, so we need to be able to do spatial querying in order to accomplish that. We need to be able to route our data iteratively. Some of our models require that we send the model output back in and do iterative model runs, direct access to our collaborators, not just within our Institute, large storage capacities and multiple processors to run these large models. I'll go through this quickly I wanted to outline my different views of computational infrastructure that scientists tend to use. This is my version of what I think typically happens in at least my field on the academic side is that a researcher will work on an individual desktop and then typically we're using things like Python and Matlab, FORTRAN code usually for the larger simulations, generate a lot of flat files. Our raw data will be in tabular or flat file format and then if we're lucky that will get uploaded to some kind of national data center and that's required by most of our funding agencies. And then sent out from a web server there to multiple web clients. Let me just quickly go through what I think is good and bad about this. This is a model that works okay for the individual scientist who knows where all his files are, but there are disadvantages because these flat files would require us to write a lot of elaborate scripts to do any kind of spatial relationships and it's hard for us to collaborate with each other under that model. >>: [inaudible] into a database or is it stored in a non-relational [indiscernible] >> Anthony Arendt: It stored as a text file, literally so it's a really not a great way to do things at all. Maybe, you know, some standardized… >>: [inaudible] >> Anthony Arendt: It gets sent, that flat file up to a server somewhere. >>: [indiscernible] then it's just not multithreaded, right? >> Anthony Arendt: No. It's not, not at all. I mean, speaking in general terms here. One thing that's good about this I think is that the data tend to be in these national servers and labs, more stable and longer-term. This is a question that came up in a talk I gave at the University of Washington recently. I'll be telling you about the Azure solutions that we are developing, but what happens to that of the long-term? Is it as stable as something like some of these snow and ice data centers where we send our data? They tend to have fairly good version control metadata requirements, but the main point here is that there is sort of a one-way flow of information out to the client and if I do some reanalysis on the send or change some of the products that I'm working with, I have to go and submit it again here get might not get out to the end-user as easily as I like. This is the model in my lab I've been trying to develop which I think is a bit of a step forward over the last couple of years. The core of it is to create a relational database for all of our work. We still right scripts to call into that and to do our analysis. If we're doing any modeling rerun that off, you know, and Institute computational supercomputing cluster, and then I have a local machine right in my office that is a web server in which this database lives and then that gets served up through my institute web server and out to the community that way. I really think relational database is fantastic for our work and if more of us put things into this format we would be able to advance quickly, I think. This is just an example of some of the tables here. There are the detailed polygons. Here's some climate data, a time series of weather stations, stream, discharge, that kind of thing. We can join and relay all of this information together by using standardized relational queries. You probably know this really well. I'm just showing you an example of SQL query. I'll run through it very quickly. I just want to show quickly what you can achieve with this. This is an example of simply taking a, I want to know what is the average temperature in Denali National Park. It's the polygon outline for that and very simple SQL query like that and I can get with just a couple of lines the time series of the monthly data there. >>: [indiscernible] >> Anthony Arendt: That sample is 2 kilometer grids over the state of Alaska, so it's not huge. >>: [indiscernible] >> Anthony Arendt: Yeah, gigabyte size. We're using PostGIS. This is really where the functionality comes in being able to do queries where the geometry is contained in another geometry. Do you work with SQL Server a lot? >>: I work with SQL. [indiscernible] they have some of the query languages though. >> Anthony Arendt: I'm going to talk a bit later. I tried to migrate over to SQL Server and I'm investigating whether the spatial querying abilities are the same as what I have with PostGIS. They are largely the same but there are some differences that are making that process a little… >>: [indiscernible] >> Anthony Arendt: Okay. This is an improvement. We can do these relational table searches. We can generate map based products. We are using Esri ArcGIS server to serves that out to the community right now. One of the disadvantages of that is acquiring the proprietary licenses with working with those particular products. It's still kind of a one-way flow of information. One of the problems is that I become database manager. I have to make sure I run the backups and keep the database from going down. I can't really scale it up or down based on the needs of the community or how many people are accessing it. And then if I want my colleagues to work on this it hasn't been very successful yet. I have to teach them how to do SQL queries, make a connection to a database. It's not that difficult, but I just haven't gotten over that barrier yet. Here at MSR this is the summary of the kind of work that I've been doing. One is to build an API for direct access to our data and then investigate running our hydrological models on Azure virtual machine and then work on visualization tools and see what we can do in terms of visualizing spatial and temporal variability of what we're working on. >>: Right now how does data get from your [indiscernible] to the Azure Center? >> Anthony Arendt: The raw data to the… The Telemeter climate data is just using, we just send a static IP address to our collaborators where the data is being generator then it gets, it's just a scheduled script that runs a Python script that then imports that into the database directly. >>: Do you send it over 3G or something? >> Anthony Arendt: It's iridium because it's out and there's no, if we had 3G we certainly would try to use out that but most of Alaska… >>: [indiscernible] >> Anthony Arendt: Right. >>: What is the rate? >> Anthony Arendt: For that particular data every two hours we get a dump of data that has 15 minute observations of temperature, wind speed and so on. That's not huge data sets. But any other data like the radar we would then manually in just that into the database when we get back to the lab. It's not in real time. This is the work of the API I'm going to talk about next. It's based on what Parker and MacReady here working on live oceans and I just want to recognize that Nels and Rob did a lot of the foundation work and I sort of took the Live API and just built that into what I'm calling ice2ocean. This is my schematic of that. The core of the work is this API and we've changed things on the site a little bit. We still have this database and it has all of our data and right now it's living as a virtual machine and we're running our snow model scripts, or executables, also on that virtual machine. The output is sent Azure blob storage and then they're calls from the API to get whatever result comes out of the model and it’s served out to multiple clients. I'm working on learning more about visualizations in the WorldWide Telescope. I don't have anything to show you on that yet. But learning how to make use of those tools and that looks very promising as well. This is just the splash page of the ice2ocean API. It uses a web service to get, in this particular case here we're asking for a runoff grid which is output from the snow model for a particular year, month and day and then the API handles things like apply generation, file transport, reprojection, that kind of thing. That call would produce in this case an example of a PNG that would come back through the API. I don't have to worry so much about this side of things. We are working with collaborators at NASA who are hoping to then make calls to the API and build them into whatever fancy web interface that they're developing. This is being put together using the Python tools to Visual Studio. The advantage here is, again, a lot of the research community uses things like Python and Matlab and this has allowed me to very quickly get into this and just take my scripts that existed and build them into the API with very little effort. That's been great. Using a django web framework and it gets published directly out of Visual Studio to an Azure web site. Looking under the hood of the API a little bit, so a typical call to the API is get vector and so this would be the request coming in from the API call and then I can just simply, like I said, cut and paste the queries that I have already written. This is the same query that I showed you earlier for the Denali data. My region where Denali, that's a variable going into this query, and then I go to another function that's going to be called fetch polygon that I'll show you in the next slide to actually execute the query and then the return through the API is the data set that comes up. Here is that other function called fetch polygon and it's using a particular library that allows for connection to any standard database. This would be the connection string to the database which, as I said, lives on the Azure virtual machine. Just set up a cursor to that connection and then here's where you execute the query pulling the data out of the database and then returns it and then back out to the API again. >>: Have you figured out how many machines on Azure? >> Anthony Arendt: How big is the… >>: The backend [indiscernible] the size that you are using Azure to store and serve all your data? >> Anthony Arendt: It's just a single virtual machine. I think I put like eight processors on right now. But I haven't really played with any of that. Nobody is really hitting this with any queries yet. It's just sort of in development. >>: Is that a plan that you have like, you know, so the general public can go look at it and see some visualization of [indiscernible] see what it all looks like? >> Anthony Arendt: This is what we would like to happen eventually, yeah. That's the goal here. I wasn't sure who the audience would be, but I have some questions or things that I have investigated that I wanted to just put up here. One is I did investigate migrating everything to Azure SQL Server database and we are currently still running things on the PostgreSQL database which I said earlier requires a lot more work on my part, so why not just put it right into an Azure database? The advantage here is really nice seamless integration with Visual Studio. The management of who can access the database is much cleaner through Azure, the web manager here. And then things like it's redundantly backed up for me already and there's georeplication so I don't have to worry about things like that. The challenges that I face in doing this is we have had to buy a third-party software in order to do the migration which hasn't been terribly smooth yet and I haven't really been able to quickly spin up on that. That just needs more time, but it's been a bit of a barrier and that might just be related to PostgreS not talking well to SQL Server. The other thing that I don't know much about yet is when it comes to me having to pay for these resources out of my research grants I have to then think about the costs associated with having things on Azure. I and I think you pay for how many queries people are hitting it with and so those are some of the cost benefits I would have to consider. >>: Have you considered NoSQL solutions at all? >> Anthony Arendt: So that would be using the table, Azure table? >>: Yeah. Use the Azure table or something or Hadoop or something like that? And you don't really have, you probably don't need database assets semantics and stuff. I'm giving you just store the data in text files as it is and Hadoop or Hadoop jobs that you can write including Python and Java that would basically read off these text files, put in the proper delimiters and execute the same query that you have the [indiscernible] credit. >> Anthony Arendt: Is it as fast as something would be in the SQL Azure? >>: [indiscernible] some of these cousins of the Hadoop framework, they allow for a lot of caching and memory and it could actually be faster. >> Anthony Arendt: Okay. And what about geospatial searching, finding things that are within a polygon? Does that framework accommodate that? >>: It is something you can build on top, but there are lots of optimizations otherwise in these NoSQL solutions that there might be an alternate design. >> Anthony Arendt: Okay. I've heard about that and now I just haven't gone down that path, but that is something I will do for sure. I want to talk a little bit about our modeling work. Here's the way things were set up before I got started here and that is we are working with colleagues at Oregon State University. They run the FORTRAN code, the snow model with all its subroutines on their cluster and then manually upload it to our Azure blob storage. We start to run those models on the Windows server virtual machine and then have a script that just automatically sends it out to blob storage and then from blob storage to the API to the client. What I would ideally like to build into this in the future, and I welcome any ideas on how to do this well, is to actually allow the clients to request for a model run that is for a specific time and area. Give me a bounding box and a certain time period so that we can then run the model according to whatever the call that is made from the client. That would look like this which is focusing in on that modeling site for a moment. We would actually go to the database to pull out the gridded climate data for that bounding box and that time, send that out to multiple instances of the virtual machine based on how many calls that are made and then that goes back out again. And application of this would be somebody living in the town of Valdez and they are concerned about the amount of water that is coming off of this area, the potential flooding of this little bridge that bridges this huge glacial river. So what's the discharge hydrograph going to look like for a particular time period and a particular area of interest? That's one application of that idea. This is something I haven't implemented, but it's my understanding of how I would go about doing this. I learned right when I got here about the simulation runner and it was developed here and in partnership with some other folks. This execute parameter sweep jobs which I think fits well to our application where we're doing multiple parallel executions of a program, and it's implemented within the Azure cloud services. I understand that the steps to do this are to create an affinity group which is essentially a virtual network. It makes sure that all of the services are in the same location for optimizing that process. The storage account stores the actual cloud service and then it uses SQL databases to store the information on different executed jobs and then it's deployed as an Azure cloud service. You can choose a cloud service package that lists how many nodes or processes you want to be associated with that. I haven't done this yet, but I understand that would be a good way to approach this problem. The final thing that I'll show you is where we want to go a step even further with it. I'm showing you snow model and how it can be run over different time periods and different spaces. Another step to add in here, remember we have all this great information from our field work that we can use to better optimize model runs. I'm representing that over here with what's known as assimilation. I talked about it earlier. Here's a model run shown in this first line that's uncorrected and we know from our observations out in the field that the snowpack should have been here and should have gone away by this time. We do an additional iterative model run that fits it back to the observation and through that iteration we actually get a better fit through process of assimilation. I think how that would look is, again, you have a call from the client for a time span and a bounding box. It pulls out the data that you need from the gridded climate products and then it also says grab all of the other observations from the fieldwork and other sources that also are within that bounding box and in that time period and send that to the snow assimilation subroutine and then do an iterative run that allows us to really improve that and, again, send that out to the blob storage location. That's the template, I think, for the assimilation kind of approach. >>: [indiscernible] sort of correcting 20 percent scale errors or 100 percent scale errors? >> Anthony Arendt: I think typically it's on the order of 20 to 30 percent, but I don't know if there is any limit on that. I think if you have errors that large then it becomes a little more nonlinear and the actual iterations might be more unstable, I think, so I think you want to be making relatively small adjustments. If you're off by 100 percent to begin with then you should go back to bias correct your initial input grids. This is more kind of tweaking things to a particular basin or time period. We're almost done here. I wanted to just say here's a future scenario a couple of years from now. Here's a couple of my colleagues out in the field. They have just taken some measurements of snow depth and some ground penetrating radar from snow machine. Maybe they have uploaded that to our SQL database and then they make a call to the API for a simulation of what's the snowpack going to look like next week. You can actually do that using weather forecast data. Instead of using historical climate products you can generate just like any weather forecast. That would run through the snow model. It would come back an hour later. Maybe they would have an app on their phone and they could say the snowpack is going to look like this. They can communicate with the heli-skiers and they can think about whether that is going to cause a likelihood of avalanches or that kind of thing, or talk with the city of Valdez about what is the likelihood of flooding this year. That happened a couple of years ago. The whole road got washed out and things were shut down. That's to me kind of the nice future model of how things could look. I think we have the capability to do something like that with resources dedicated to that. I'll just wrap up with a few summary points. I think these cloud resources are opening a lot of opportunities for collaborative research, real-time delivery of data to our stakeholders. Two main ways that we can prove things quickly is more science using relational databases and if we start putting our computations into virtual machines I think that can help us get a lot of our work accomplished. I really like this trend of Microsoft integrating with Python and other scripting languages which has allowed me to very quickly get up to speed and that is what a lot of academics are using and so I think this will help greater adoption of the MSR tools. Thank you. Thanks a lot. I appreciate it. [applause] >>: Can you comment on scientists using relational databases? Have you seen anybody go through that conversion process? That is the first part. It is a two-part question. And the second would be supposing that everybody works, the community just kind of went off and decided they were all going to learn relational databases, how would that transform the science? >> Anthony Arendt: At least in my field a lot of my colleagues don't really even speak that language. I think it's a different way of thinking. It really varies by discipline a lot and I think that people in geophysics just tend to come more from just I'm writing a Fortran program and there is not understanding of how data can be linked together or through a database. I think they don't realize that they are using a relational database when they use a GIS. That's basically what that is, is the front-end I relational database. At least in my career I don't see a lot of even just basic understanding of what could be done. When I show what we are doing people get really excited but they don't want or have the time to make that extra step because it is a big task getting everything into a standardized structure like app. But perhaps some of these other resources we talked about could help minimize that or make that an easier step. >>: [indiscernible] sample data [indiscernible] that we can get a [indiscernible] >> Anthony Arendt: Sure, you mean with the data that I have been showing here? Yeah, absolutely. >>: Make it so that we have like I was telling you about that thing that we are working on on this [indiscernible]. We have a system that is going. [indiscernible] and if there is some sample data and prototype queries we can try to run it to see how it… >> Anthony Arendt: I think that would be great. It may be compared how quickly the queries run. Yeah, I'd be happy. >>: [indiscernible] >> Anthony Arendt: Yeah. Let's do that. That would be great. I have plenty of different data sets that we could use, no problem. As far as transforming, I did get a couple of my colleagues recently to, in particular, one of the postdocs we are working with really adopted the PostgreSQL and database framework and the other co-authors on this paper were just kind of blown away with how quickly we could just investigate things. We really don't have to sit and write a huge script to do things. It really enables a quick look at our data, I think, and I think it would vastly improve our ability to collaborate and relate data together. I mean it's not the only answer to this problem. >>: No, it's just sort of like the most evolved in a sense. It's been around for the longest time and so we see efforts at the University of Washington with E-Science with Bill Howell working with the oceanographers to say let's all learn SQL and that's what SQL share is about. It's a variance of that theme but it's like you don't want to say everybody has to learn left and right joins, but you do want everybody to sort of have an awareness of the space and deal with that. To have that as how to get it just puts momentum behind the effort. >> Anthony Arendt: I agree. >>: There's a kind of an intermediate stage in between using SQL directly, let's say, SQL Server and user query and old-school like you were talking about, which is through the type of API you are talking about with live version live ocean and ice2ocean. To have the ability to sort of format stuff in terms of supported with API queries. This is something you might be able to do. And then as we go through and do that we try and build examples of an external app that generates those queries and then there's sort of an even closer to SQL step where you say I'm going to actually in my interface, my API interface I'm going to support a grammar, which is more difficult program but is potentially closer to flexibility. So there is like this whole dial between old-school flat file script and select star from. Anyway. >> Anthony Arendt: The API would accept SQL grammar, is that what you are saying? >>: The example I was talking to my other collaborators with yesterday was with this data system that we're building for these all organic matter which is actually related to this, the discussion about queries got so complicated they finally sort of threw up their hands and said we really need to create a logical grammar from which you can construct queries. >> Anthony Arendt: I see. >>: And so it's got predicates and it's got things that are more SQL looking than just API calls with a mean parameter and sub qualifiers. Anyway. It's a longer discussion but it's great to have you having this discussion. >>: I actually have a meeting at three. >> Anthony Arendt: Yeah. Can I have your contact information? That would be great.