>> Rob Fatland: Good afternoon. Our speaker today... University of Alaska, Fairbanks Geophysical Institute and he is a...

advertisement
>> Rob Fatland: Good afternoon. Our speaker today is Doctor Anthony Arendt from the
University of Alaska, Fairbanks Geophysical Institute and he is a career glaciologist who will be
addressing us today on his work here at Microsoft research as a visiting researcher working
with Azure and with his modeling framework. With that, Anthony.
>> Anthony Arendt: Thanks very much, Rob. That's really a pleasure to be here today and talk
to you about the work that we're doing. I wanted to quickly acknowledge the invitation from
Rob to come here as a visiting researcher. This has been a really incredible opportunity for me
to work with Microsoft resources and really apply it to the signs that we are developing at the
University of Alaska Fairbanks and also be really grateful for Microsoft Azure for research grant
which is supporting the user cloud resources that will be the basis of what I'll be showing you
today. A lot of this research is supported by a wide range of funding agencies, most of which
are shown here. A lot of collaborators students and technicians, too many to list, but I want to
just recognize that this research is a very collaborative effort and it comes from a lot of efferent
folks at multiple agencies. Today I want to cover a few different topics that are summarized
here. The first is to give you a brief overview of the scientific problems that we are addressing
in our research and a summary of the different kinds of data sets and I want you to get a sense
of the complexity of these different data sets in space and time and how this calls for a kind of
different computational infrastructure than we have had in the past. I want to talk about the
methods that we have been developing in Fairbanks over the past couple of years, sort of our
academic approach to these problems and the infrastructure that we built around that. And
then really some of these incredible tools that I've learned just in the past couple of months
here at MSR that are really broadening those tools and allowing us to accomplish really
innovative and collaborative science which is the goal of all of our efforts here. Let's just jump
into a bit of the science just to give you a bit of the flavor of the kinds of things that we're
looking at. We are studying the Gulf of Alaska system. This is a large band of glaciers and ice
caps that surround the Gulf of Alaska and this is some of the largest ice areas outside of the
Greenland and Antarctica ice sheets. The reason we've got so much ice in this area is that there
are some very large mountain ranges. There are seven different mountain ranges in Alaska and
they act as barriers to these massive storms that come in off of the gulf and deposit large
amounts of precipitation in the form of snow. To get upwards of six or seven meters some
years in some of these high mountain areas that create the formation of glaciers that then flow
down to the ocean and discharge that ice into the ocean. When you have all that ice and snow
close to a coastal marine environment, there are a lot of complex interactions that we have to
think about. One of the biggest things that we're concerned with is the fact that it's a vast
amount of fresh water stored in these glaciers that as the climate is changing this is being
released into the oceans and this is a major source of global sea level rise. Alaska contributes
somewhere around 7 percent to the global signal of sea level rise, which is very high
considering how much ice is there relative to the ice sheets. Furthermore, we have to think
about all this extra freshwater getting into streams and lakes and out into the coastal fjord
areas as that freshwater which has a different chemical and temperature signature than the
surrounding waters enters these areas. It will have impacts on things like the ability of salmon
to spawn, the kinds of terrestrial succession that occurs and then the overall ecology of these
fjord areas will really change rapidly as the climate varies and we get more and more of this
water getting into these fjord environments. We get different salinity and temperature of
these coastal areas that causes changes in the coastal current as you get further out into the
ocean. What we have here is really interdisciplinary and complex data sets for which we need a
computational infrastructure that allows us to do collaborative and across disciplinary research.
A little bit more about the region that we are studying. This is in black, the Gulf of Alaska
watershed. It has about 27,000 glaciers, somewhere around 87,000 square kilometers. That's
about 4 percent of the area of the state of Alaska. The annual discharge from this area is
somewhere on the order of the discharge of the Mississippi River, so very high fluxes of water
are passing through this area every year and we want to determine how this is changing over
time. Going to take you over just a few of the data sets that we work with. Again, I want you to
think about the complexity of these data sets and how they vary in terms of whether they are
vector data or raster data and how we can put together an infrastructure to handle all of this
information. The first tier is just basic inventory data which is where the glaciers are located.
These are really high density vector polygons that tell us the location of the ice in this particular
slide at two different times. The yellow is areas where the ice has retreated since the 1950s
and read being areas where there has been a bit of advance. We need this basic information
before we can run any of our models and do any simulations. We have point and line data that
is, for example, collected when we just put a ground penetrating radar underneath a helicopter
and we fly along and we use that to determine the travel time and the distance then to the
interface between the snow and the ice. That tells us the depth of the snow pack for that
particular year. And there's an example of us flying on a glacier near the Cordova region taking
these measurements to a high density of point observations and it gives us, you know, traces
like this from which we can determine the layering of the snowpack and pick out that surface
where we think the summer surfaces from the previous year and determine how much snow is
there and then that is used to determine the freshwater available for that particular season.
Go-ahead.
>>: When you fly over how do you get a picture of anything underneath the surface?
>> Anthony Arendt: It's a 500 MHz radar and it basically penetrates through the snow and
wherever you have a discontinuity where there's ice it will show up like that. This is actually up
in the higher accumulation area where we get multiple layers. It's called the fern zone. We
have to carefully trace that line upwards and there are tracing algorithms that help us do that
as well. Time series information from things like these high-altitude weather stations, Alaska is
very data poor in the high mountain regions. Most weather stations that we use to run our
models are located down near towns where people live so we install these stations and we get
often real-time satellite telemetry feeds of temperature, precipitation and so on and that is
another form of data that we are using in our analysis. Finally, a lot of remote sensing
information, in this particular case we are looking at data from what's called the gravity
recovery and climate experiment. And this is a really fascinating mission. It's two satellites and
we're measuring the inter-satellite range which tells us the influence of gravity on the orbits of
these satellites. From that we can determine temporal variations in the Earth's gravity peer
here's a map of the overall net loss of mass from different regions around the globe and you
can see it really maps into where we are seeing massive amounts of glacier ice loss. This is the
time series for Alaska. You see the overall trend of mass change and then the seasonal cycle
from which we can determine that runoff component. Again, Alaska is somewhere around 7
percent of the sea level signal.
>>: [indiscernible]
>> Anthony Arendt: That's the net change in ice over a year, so 65 gigatons of water is leaving
that storage, the ice and going to the ocean.
>>: [indiscernible] by the year?
>> Anthony Arendt: Why is it?
>>: It's going, there is lesser water that's being released in 2011 than in 2003?
>> Anthony Arendt: This is the cumulative mass, so we start at an arbitrary zero. We aren't
actually measuring the actual mass of the glacier. We're saying relative to where it was in 2003
this is how much mass it has, accumulated mass change, basically. So it's summertime, it's
losing mass. Wintertime it gains some mass and then because of that trending downwards the
net change of storage is around this value here.
>>: When that gets down and flattens out the ice is all gone and now you know how much ice
there was in the beginning.
>> Anthony Arendt: True.
>>: What is the highest contribution? Is it the Antarctic thing or is it Greenland?
>> Anthony Arendt: Greenland right now is the highest and what's happening here is not only
just the melting of the ice but a big component of the loss is because of this calving, the
breaking off of ice into the ocean. We call it dynamic mass losses. That's linked to the fact that
the oceans are warming up and kind of eroding away where the ice is meeting the ocean and
then you get potentially very catastrophic losses of mass in those areas.
>>: What percentage do you give Greenland credit for?
>> Anthony Arendt: I think it's around 25 to 30 percent. I'd have to look that up. Our goal is to
model this process as much as we can, to reconstruct past conditions, learn about the system
and also we want to be able to run future forecast models, safe where is the ice going to
disappear in the next hundred years and those kinds of things. There's a lot of ways to do this.
Go-ahead.
>>: This is a fairly narrow set of latitudes in which this is pretty much most of the melting is
happening in. Why is it that the other latitudes are not affected that much in terms of the ice
melting? [indiscernible] goes around the Arctic, right, Russian Siberia?
>> Anthony Arendt: So I cut the snap off a little bit, but you're saying why is the loss focused on
the…
>>: Why is it not focused more on the [indiscernible] as opposed to the other latitudes?
>> Anthony Arendt: Right. Okay. I understand. The biggest players are Patagonia, Alaska,
Canadian Arctic as far as the non-ice sheet areas. Also the Himalayas, but it's off the map. I just
cut it off, but it's not as big a signal. There's not quite as much ice there to begin with and also
these areas are more susceptible because, like I was saying earlier, they are close to the ocean
and are more sensitive to changes in the climate because of higher precipitation rates and they
have ice that actually goes into the ocean and they are sensitive to these dynamic losses as
well. Really, this is not as much ice at middle latitudes.
>>: You mentioned Siberia, for example. Siberia is this vast plains but it's not like big huge
reservoirs and glaciers as much as it is just frozen ground.
>>: So the main thing is is it just frozen ground or is it meeting the water?
>>: I think what Anthony is saying is when you have accumulated ice in glaciers, big huge
masses of ice that are facing into the ocean and they are warm that's what's happening in
Alaska. You have this mechanism in place for transport of water inside the ocean. Whereas, in
Siberia my understanding is you don't have these masses of ice. There are other things going
on there that are related to climate change, but not so much water transported to the ocean.
>> Anthony Arendt: This is the net loss; this is the trend in mass, right? That's this sort of red
line mapped onto here. Siberia may still have seasonality, but it's not going to be trending so
heavily because it just doesn't have all that ice from which can be taken out of storage. I'll tell
you a little bit about our modeling efforts. This is an energy balance model approach and it
formally calculates the fluxes of sensible latent heat. Solar radiation, solar incoming radiation is
the main factor that drives the melting of snow and ice on a glacier surface. There are a lot of
models to do this. We're using one called SnowModel, made by Glenn Liston and others. It has
these subroutines that calculate everything from surface energy bouts shown in the picture
here and then the redistribution of snowpack due to wind. The interpolation of climate data so
we're using these fairly coarse grid climate products and we have to downscale them to an
individual watershed to get better estimates at an individual location, and then the routing of
the water when we generate some melt at a given location how does that get distributed
across the landscape and that uses a simple linear reservoir approach. Here's a picture of how
that works. The input to the model will be things like a land cover mask and that's where we go
back to knowing where the glaciers are located with those polygons. These are examples of
those gridded climate products. We're using reanalysis data which is kind of, are you familiar
with reanalysis? It's like a weather forecast model but applied to past conditions, so it
calculates all the physics of the atmosphere, but it also assimilates in observations at the
stations to give you a best estimate of the past 30 years of climate or so. That gets fed into the
model and then we get things like an estimate of the water balance at our location of interest.
So this is an animation of the kind of output that we get. We're starting in the spring of 2004
and blue is where there is a net, this is a cumulative change in mass of the water. Blue is where
we are getting accumulations, these high mountain areas. And you'll start seeing some red
emerging and that's where there is a net loss of mass and that will be the low elevation parts of
the glaciers. We're getting into July and then it's going to stop here soon in August. You'll see
this is a net mass balance of this entire region. Losses at low elevations, accumulations up high
and we can use that to inform a lot of our other estimates and comparisons to the grace data.
Let's step back. I've shown you that there is point line polygon gridded data in multiple bands
when we look at satellite imagery. We have very spatial and temporal resolution. We needed
infrastructure to handle that. Go-ahead.
>>: Interesting about two points where substantial red right on the top left or something.
Those seem like the points with the most substantial red. Most of the other reds are closer to
the water.
>> Anthony Arendt: Yeah. This is the Alaska Range where, you are right. I picked this year. It's
a record warm year. That was a year where the melt extended all the way up to the top. There
was very little accumulation. Those glaciers were really hit hard that year. Are you asking that
with respect to what I was saying about glaciers near the ocean? Yeah. That wasn't a hard and
fast rule. These interior glaciers also are especially susceptible to the summer temperature,
whereas, here precipitation variability can play a major role as well. Things can really fluctuate
dramatically here. But you get a really warm summer and those can be hit quite hard as well.
In terms of sea level we don't, we are not as concerned about smaller glaciers. There are a few
up in the Brooks Range as well. These are the main players because they are so large and have
potential to add a lot to the ocean. But that is a good observation. Getting into more of the
computational part of the talk, what do we need to handle this information? We need,
everything I've shown you has a geospatial encoding to it, so we need to be able to do spatial
querying in order to accomplish that. We need to be able to route our data iteratively. Some
of our models require that we send the model output back in and do iterative model runs,
direct access to our collaborators, not just within our Institute, large storage capacities and
multiple processors to run these large models. I'll go through this quickly I wanted to outline
my different views of computational infrastructure that scientists tend to use. This is my
version of what I think typically happens in at least my field on the academic side is that a
researcher will work on an individual desktop and then typically we're using things like Python
and Matlab, FORTRAN code usually for the larger simulations, generate a lot of flat files. Our
raw data will be in tabular or flat file format and then if we're lucky that will get uploaded to
some kind of national data center and that's required by most of our funding agencies. And
then sent out from a web server there to multiple web clients. Let me just quickly go through
what I think is good and bad about this. This is a model that works okay for the individual
scientist who knows where all his files are, but there are disadvantages because these flat files
would require us to write a lot of elaborate scripts to do any kind of spatial relationships and it's
hard for us to collaborate with each other under that model.
>>: [inaudible] into a database or is it stored in a non-relational [indiscernible]
>> Anthony Arendt: It stored as a text file, literally so it's a really not a great way to do things at
all. Maybe, you know, some standardized…
>>: [inaudible]
>> Anthony Arendt: It gets sent, that flat file up to a server somewhere.
>>: [indiscernible] then it's just not multithreaded, right?
>> Anthony Arendt: No. It's not, not at all. I mean, speaking in general terms here. One thing
that's good about this I think is that the data tend to be in these national servers and labs, more
stable and longer-term. This is a question that came up in a talk I gave at the University of
Washington recently. I'll be telling you about the Azure solutions that we are developing, but
what happens to that of the long-term? Is it as stable as something like some of these snow
and ice data centers where we send our data? They tend to have fairly good version control
metadata requirements, but the main point here is that there is sort of a one-way flow of
information out to the client and if I do some reanalysis on the send or change some of the
products that I'm working with, I have to go and submit it again here get might not get out to
the end-user as easily as I like. This is the model in my lab I've been trying to develop which I
think is a bit of a step forward over the last couple of years. The core of it is to create a
relational database for all of our work. We still right scripts to call into that and to do our
analysis. If we're doing any modeling rerun that off, you know, and Institute computational
supercomputing cluster, and then I have a local machine right in my office that is a web server
in which this database lives and then that gets served up through my institute web server and
out to the community that way. I really think relational database is fantastic for our work and if
more of us put things into this format we would be able to advance quickly, I think. This is just
an example of some of the tables here. There are the detailed polygons. Here's some climate
data, a time series of weather stations, stream, discharge, that kind of thing. We can join and
relay all of this information together by using standardized relational queries. You probably
know this really well. I'm just showing you an example of SQL query. I'll run through it very
quickly. I just want to show quickly what you can achieve with this. This is an example of
simply taking a, I want to know what is the average temperature in Denali National Park. It's
the polygon outline for that and very simple SQL query like that and I can get with just a couple
of lines the time series of the monthly data there.
>>: [indiscernible]
>> Anthony Arendt: That sample is 2 kilometer grids over the state of Alaska, so it's not huge.
>>: [indiscernible]
>> Anthony Arendt: Yeah, gigabyte size. We're using PostGIS. This is really where the
functionality comes in being able to do queries where the geometry is contained in another
geometry. Do you work with SQL Server a lot?
>>: I work with SQL. [indiscernible] they have some of the query languages though.
>> Anthony Arendt: I'm going to talk a bit later. I tried to migrate over to SQL Server and I'm
investigating whether the spatial querying abilities are the same as what I have with PostGIS.
They are largely the same but there are some differences that are making that process a little…
>>: [indiscernible]
>> Anthony Arendt: Okay. This is an improvement. We can do these relational table searches.
We can generate map based products. We are using Esri ArcGIS server to serves that out to the
community right now. One of the disadvantages of that is acquiring the proprietary licenses
with working with those particular products. It's still kind of a one-way flow of information.
One of the problems is that I become database manager. I have to make sure I run the backups
and keep the database from going down. I can't really scale it up or down based on the needs
of the community or how many people are accessing it. And then if I want my colleagues to
work on this it hasn't been very successful yet. I have to teach them how to do SQL queries,
make a connection to a database. It's not that difficult, but I just haven't gotten over that
barrier yet. Here at MSR this is the summary of the kind of work that I've been doing. One is to
build an API for direct access to our data and then investigate running our hydrological models
on Azure virtual machine and then work on visualization tools and see what we can do in terms
of visualizing spatial and temporal variability of what we're working on.
>>: Right now how does data get from your [indiscernible] to the Azure Center?
>> Anthony Arendt: The raw data to the… The Telemeter climate data is just using, we just
send a static IP address to our collaborators where the data is being generator then it gets, it's
just a scheduled script that runs a Python script that then imports that into the database
directly.
>>: Do you send it over 3G or something?
>> Anthony Arendt: It's iridium because it's out and there's no, if we had 3G we certainly would
try to use out that but most of Alaska…
>>: [indiscernible]
>> Anthony Arendt: Right.
>>: What is the rate?
>> Anthony Arendt: For that particular data every two hours we get a dump of data that has 15
minute observations of temperature, wind speed and so on. That's not huge data sets. But any
other data like the radar we would then manually in just that into the database when we get
back to the lab. It's not in real time. This is the work of the API I'm going to talk about next. It's
based on what Parker and MacReady here working on live oceans and I just want to recognize
that Nels and Rob did a lot of the foundation work and I sort of took the Live API and just built
that into what I'm calling ice2ocean. This is my schematic of that. The core of the work is this
API and we've changed things on the site a little bit. We still have this database and it has all of
our data and right now it's living as a virtual machine and we're running our snow model scripts,
or executables, also on that virtual machine. The output is sent Azure blob storage and then
they're calls from the API to get whatever result comes out of the model and it’s served out to
multiple clients. I'm working on learning more about visualizations in the WorldWide
Telescope. I don't have anything to show you on that yet. But learning how to make use of
those tools and that looks very promising as well. This is just the splash page of the ice2ocean
API. It uses a web service to get, in this particular case here we're asking for a runoff grid which
is output from the snow model for a particular year, month and day and then the API handles
things like apply generation, file transport, reprojection, that kind of thing. That call would
produce in this case an example of a PNG that would come back through the API. I don't have
to worry so much about this side of things. We are working with collaborators at NASA who are
hoping to then make calls to the API and build them into whatever fancy web interface that
they're developing. This is being put together using the Python tools to Visual Studio. The
advantage here is, again, a lot of the research community uses things like Python and Matlab
and this has allowed me to very quickly get into this and just take my scripts that existed and
build them into the API with very little effort. That's been great. Using a django web
framework and it gets published directly out of Visual Studio to an Azure web site. Looking
under the hood of the API a little bit, so a typical call to the API is get vector and so this would
be the request coming in from the API call and then I can just simply, like I said, cut and paste
the queries that I have already written. This is the same query that I showed you earlier for the
Denali data. My region where Denali, that's a variable going into this query, and then I go to
another function that's going to be called fetch polygon that I'll show you in the next slide to
actually execute the query and then the return through the API is the data set that comes up.
Here is that other function called fetch polygon and it's using a particular library that allows for
connection to any standard database. This would be the connection string to the database
which, as I said, lives on the Azure virtual machine. Just set up a cursor to that connection and
then here's where you execute the query pulling the data out of the database and then returns
it and then back out to the API again.
>>: Have you figured out how many machines on Azure?
>> Anthony Arendt: How big is the…
>>: The backend [indiscernible] the size that you are using Azure to store and serve all your
data?
>> Anthony Arendt: It's just a single virtual machine. I think I put like eight processors on right
now. But I haven't really played with any of that. Nobody is really hitting this with any queries
yet. It's just sort of in development.
>>: Is that a plan that you have like, you know, so the general public can go look at it and see
some visualization of [indiscernible] see what it all looks like?
>> Anthony Arendt: This is what we would like to happen eventually, yeah. That's the goal
here. I wasn't sure who the audience would be, but I have some questions or things that I have
investigated that I wanted to just put up here. One is I did investigate migrating everything to
Azure SQL Server database and we are currently still running things on the PostgreSQL database
which I said earlier requires a lot more work on my part, so why not just put it right into an
Azure database? The advantage here is really nice seamless integration with Visual Studio. The
management of who can access the database is much cleaner through Azure, the web manager
here. And then things like it's redundantly backed up for me already and there's georeplication so I don't have to worry about things like that. The challenges that I face in doing
this is we have had to buy a third-party software in order to do the migration which hasn't been
terribly smooth yet and I haven't really been able to quickly spin up on that. That just needs
more time, but it's been a bit of a barrier and that might just be related to PostgreS not talking
well to SQL Server. The other thing that I don't know much about yet is when it comes to me
having to pay for these resources out of my research grants I have to then think about the costs
associated with having things on Azure. I and I think you pay for how many queries people are
hitting it with and so those are some of the cost benefits I would have to consider.
>>: Have you considered NoSQL solutions at all?
>> Anthony Arendt: So that would be using the table, Azure table?
>>: Yeah. Use the Azure table or something or Hadoop or something like that? And you don't
really have, you probably don't need database assets semantics and stuff. I'm giving you just
store the data in text files as it is and Hadoop or Hadoop jobs that you can write including
Python and Java that would basically read off these text files, put in the proper delimiters and
execute the same query that you have the [indiscernible] credit.
>> Anthony Arendt: Is it as fast as something would be in the SQL Azure?
>>: [indiscernible] some of these cousins of the Hadoop framework, they allow for a lot of
caching and memory and it could actually be faster.
>> Anthony Arendt: Okay. And what about geospatial searching, finding things that are within
a polygon? Does that framework accommodate that?
>>: It is something you can build on top, but there are lots of optimizations otherwise in these
NoSQL solutions that there might be an alternate design.
>> Anthony Arendt: Okay. I've heard about that and now I just haven't gone down that path,
but that is something I will do for sure. I want to talk a little bit about our modeling work.
Here's the way things were set up before I got started here and that is we are working with
colleagues at Oregon State University. They run the FORTRAN code, the snow model with all its
subroutines on their cluster and then manually upload it to our Azure blob storage. We start to
run those models on the Windows server virtual machine and then have a script that just
automatically sends it out to blob storage and then from blob storage to the API to the client.
What I would ideally like to build into this in the future, and I welcome any ideas on how to do
this well, is to actually allow the clients to request for a model run that is for a specific time and
area. Give me a bounding box and a certain time period so that we can then run the model
according to whatever the call that is made from the client. That would look like this which is
focusing in on that modeling site for a moment. We would actually go to the database to pull
out the gridded climate data for that bounding box and that time, send that out to multiple
instances of the virtual machine based on how many calls that are made and then that goes
back out again. And application of this would be somebody living in the town of Valdez and
they are concerned about the amount of water that is coming off of this area, the potential
flooding of this little bridge that bridges this huge glacial river. So what's the discharge
hydrograph going to look like for a particular time period and a particular area of interest?
That's one application of that idea. This is something I haven't implemented, but it's my
understanding of how I would go about doing this. I learned right when I got here about the
simulation runner and it was developed here and in partnership with some other folks. This
execute parameter sweep jobs which I think fits well to our application where we're doing
multiple parallel executions of a program, and it's implemented within the Azure cloud services.
I understand that the steps to do this are to create an affinity group which is essentially a virtual
network. It makes sure that all of the services are in the same location for optimizing that
process. The storage account stores the actual cloud service and then it uses SQL databases to
store the information on different executed jobs and then it's deployed as an Azure cloud
service. You can choose a cloud service package that lists how many nodes or processes you
want to be associated with that. I haven't done this yet, but I understand that would be a good
way to approach this problem. The final thing that I'll show you is where we want to go a step
even further with it. I'm showing you snow model and how it can be run over different time
periods and different spaces. Another step to add in here, remember we have all this great
information from our field work that we can use to better optimize model runs. I'm
representing that over here with what's known as assimilation. I talked about it earlier. Here's
a model run shown in this first line that's uncorrected and we know from our observations out
in the field that the snowpack should have been here and should have gone away by this time.
We do an additional iterative model run that fits it back to the observation and through that
iteration we actually get a better fit through process of assimilation. I think how that would
look is, again, you have a call from the client for a time span and a bounding box. It pulls out
the data that you need from the gridded climate products and then it also says grab all of the
other observations from the fieldwork and other sources that also are within that bounding box
and in that time period and send that to the snow assimilation subroutine and then do an
iterative run that allows us to really improve that and, again, send that out to the blob storage
location. That's the template, I think, for the assimilation kind of approach.
>>: [indiscernible] sort of correcting 20 percent scale errors or 100 percent scale errors?
>> Anthony Arendt: I think typically it's on the order of 20 to 30 percent, but I don't know if
there is any limit on that. I think if you have errors that large then it becomes a little more
nonlinear and the actual iterations might be more unstable, I think, so I think you want to be
making relatively small adjustments. If you're off by 100 percent to begin with then you should
go back to bias correct your initial input grids. This is more kind of tweaking things to a
particular basin or time period. We're almost done here. I wanted to just say here's a future
scenario a couple of years from now. Here's a couple of my colleagues out in the field. They
have just taken some measurements of snow depth and some ground penetrating radar from
snow machine. Maybe they have uploaded that to our SQL database and then they make a call
to the API for a simulation of what's the snowpack going to look like next week. You can
actually do that using weather forecast data. Instead of using historical climate products you
can generate just like any weather forecast. That would run through the snow model. It would
come back an hour later. Maybe they would have an app on their phone and they could say the
snowpack is going to look like this. They can communicate with the heli-skiers and they can
think about whether that is going to cause a likelihood of avalanches or that kind of thing, or
talk with the city of Valdez about what is the likelihood of flooding this year. That happened a
couple of years ago. The whole road got washed out and things were shut down. That's to me
kind of the nice future model of how things could look. I think we have the capability to do
something like that with resources dedicated to that. I'll just wrap up with a few summary
points. I think these cloud resources are opening a lot of opportunities for collaborative
research, real-time delivery of data to our stakeholders. Two main ways that we can prove
things quickly is more science using relational databases and if we start putting our
computations into virtual machines I think that can help us get a lot of our work accomplished.
I really like this trend of Microsoft integrating with Python and other scripting languages which
has allowed me to very quickly get up to speed and that is what a lot of academics are using
and so I think this will help greater adoption of the MSR tools. Thank you. Thanks a lot. I
appreciate it. [applause]
>>: Can you comment on scientists using relational databases? Have you seen anybody go
through that conversion process? That is the first part. It is a two-part question. And the
second would be supposing that everybody works, the community just kind of went off and
decided they were all going to learn relational databases, how would that transform the
science?
>> Anthony Arendt: At least in my field a lot of my colleagues don't really even speak that
language. I think it's a different way of thinking. It really varies by discipline a lot and I think
that people in geophysics just tend to come more from just I'm writing a Fortran program and
there is not understanding of how data can be linked together or through a database. I think
they don't realize that they are using a relational database when they use a GIS. That's basically
what that is, is the front-end I relational database. At least in my career I don't see a lot of even
just basic understanding of what could be done. When I show what we are doing people get
really excited but they don't want or have the time to make that extra step because it is a big
task getting everything into a standardized structure like app. But perhaps some of these other
resources we talked about could help minimize that or make that an easier step.
>>: [indiscernible] sample data [indiscernible] that we can get a [indiscernible]
>> Anthony Arendt: Sure, you mean with the data that I have been showing here? Yeah,
absolutely.
>>: Make it so that we have like I was telling you about that thing that we are working on on
this [indiscernible]. We have a system that is going. [indiscernible] and if there is some sample
data and prototype queries we can try to run it to see how it…
>> Anthony Arendt: I think that would be great. It may be compared how quickly the queries
run. Yeah, I'd be happy.
>>: [indiscernible]
>> Anthony Arendt: Yeah. Let's do that. That would be great. I have plenty of different data
sets that we could use, no problem. As far as transforming, I did get a couple of my colleagues
recently to, in particular, one of the postdocs we are working with really adopted the
PostgreSQL and database framework and the other co-authors on this paper were just kind of
blown away with how quickly we could just investigate things. We really don't have to sit and
write a huge script to do things. It really enables a quick look at our data, I think, and I think it
would vastly improve our ability to collaborate and relate data together. I mean it's not the
only answer to this problem.
>>: No, it's just sort of like the most evolved in a sense. It's been around for the longest time
and so we see efforts at the University of Washington with E-Science with Bill Howell working
with the oceanographers to say let's all learn SQL and that's what SQL share is about. It's a
variance of that theme but it's like you don't want to say everybody has to learn left and right
joins, but you do want everybody to sort of have an awareness of the space and deal with that.
To have that as how to get it just puts momentum behind the effort.
>> Anthony Arendt: I agree.
>>: There's a kind of an intermediate stage in between using SQL directly, let's say, SQL Server
and user query and old-school like you were talking about, which is through the type of API you
are talking about with live version live ocean and ice2ocean. To have the ability to sort of
format stuff in terms of supported with API queries. This is something you might be able to do.
And then as we go through and do that we try and build examples of an external app that
generates those queries and then there's sort of an even closer to SQL step where you say I'm
going to actually in my interface, my API interface I'm going to support a grammar, which is
more difficult program but is potentially closer to flexibility. So there is like this whole dial
between old-school flat file script and select star from. Anyway.
>> Anthony Arendt: The API would accept SQL grammar, is that what you are saying?
>>: The example I was talking to my other collaborators with yesterday was with this data
system that we're building for these all organic matter which is actually related to this, the
discussion about queries got so complicated they finally sort of threw up their hands and said
we really need to create a logical grammar from which you can construct queries.
>> Anthony Arendt: I see.
>>: And so it's got predicates and it's got things that are more SQL looking than just API calls
with a mean parameter and sub qualifiers. Anyway. It's a longer discussion but it's great to
have you having this discussion.
>>: I actually have a meeting at three.
>> Anthony Arendt: Yeah. Can I have your contact information? That would be great.
Download