1 >> Yan Xu: The keynote, first keynote speaker will be Alex Szalay, and he will talk about the Genealogy of the SkyServer From Galaxies to Genes. Thank you, Alex. >> Alexander Szalay: Thank you very much. So I thought it kind of more than ten years now that we discuss [indiscernible] life and I thought this a good time for a bit of a retrospect and also to see, actually, what are the spin-offs and lessons learned from the ten years of the SkyServer. So what we see today, so kind of the starting theme is basically, what we see is there is a diversification in scientific computing, where ten years ago, it was all super computers. So today, we see that one shoe doesn't fit all. And we also see all sorts of evolutionary pressure, often dictated by limited budgets so that we try to make every dollar count when we try to create a computing environment or decide what to use. And so in this world, the diversity grows a naturally, no matter what, because individual groups want specializations. And we see, for example, that over the last few years, a large floating point calculations have moved to GPUs, but if we needed lots of random IOs, people started to plug here and there solid state disks into their systems. We are learning more about how to write algorithms and build hardware to do high stream data -- high speed stream processing, and then we also see a diversification in terms of software. Where there was kind of [indiscernible] file systems, we very purely or we had [indiscernible] databases, now we have column stores, we have no SQL, we have SciDB, so there are offered a whole lot of possibilities. So at the same time, one can ask, okay, what remains in the middle with all this diversification on the surface? And how far do similarities go between different areas of science? And kind of the common denominator is big data, or data management in general. And one can even define data management, that data management is what everybody needs and nobody wants to build. And this is many ways the reason why we don't have a good after shelf solution to handle this problem, because everybody is just trying to do the bare minimum, what their particular project requires, because there is no incentive to do something better. Because that would a substantial extra investment, 2 unfortunately. So we are still building our own, and unfortunately, over and over. And so when I look back at the ten years of Sloan and the SkyServer, it is clear we talk up to now about the data life cycle. But also, there is very clearly not just a data life cycle, but there is also a service life cycle. So basically, once we are not just putting out data objects, but we are putting out data objects through inter region services and smart services. Basically, the services were built on the particular software environment and framework and operating system. And some of the web services for Sloan were the first web services done in science. And since then, the service environment and distributed computing has gone through several changes in the paradigms. And every once in a while, just as much as we need to curate the data itself and fix the bugs and so on, we need to give a face lift, actually, to the services and apply the most up to date computational paradigms and the frameworks and tool kits in order to keep them portable. So with this preamble, I'd like to get back into Sloan, which really changed my life in '92 and little did I know what am I getting into for the better and for the worse. So it was overall an amazing journey, but a journey I thought that would take only about eight years. Instead, it took now 16-plus, and there were some dark nights in between. We were two months away from bankruptcy several times. And somehow, some money always came along and we were able to go on. And so now Sloan is in its third incarnation. So there is Sloan 1, Sloan 2, and now we are on Sloan 3. And there are now plans shaping up for Sloan 4. But the archive kept going on and we kept every year, consistently we were putting out a data release, starting with the EDR, the early data release and then there was DR-1, DR-2, et cetera. And the data is public, has been. And amazingly, we set up this detailed schedule for the data releases, and then the data processing and even the data position was delayed, but we never missed a data release. Some cases they were like a week or ten days to go before I got the data, and I got the data to load before the data release happened. So a lot of people don't know that in most of the survey, there was essentially zero proprietary 3 period for the Sloan project. So our staff saw the data at the same time when it was put out for the public. So we set out to take two and a half terapixels. In the end, we covered about five terapixels, because we now imaged much of the solid sky as well. The plan was to take about ten terabytes of raw data. But now when we have to wrap up basically the Sloan 1 and 2, the Sloan 1 and 2 legacy dataset is about 120 tear bites. We thought that the database would be about a half a terabyte. Today, the main SkyServer that everybody's using, the best, DR-9, is now about 13 terabytes. But with all the ancillary projects and data products, we have more than 35 terabytes up in the public services. And so it started in '92 and finished in 2008. And what is amazing about this, that, of course if it had stay on schedule, we could not have done any of this, because this was enabled by Moore's Law and Kryder's Law. So it just became cheaper every year, basically. So sometimes it's not so bad to be a little late. So when we built the SkyServer, again, not too many people know that this was actually for outreach. So the project was violently against using SQL server. And so basically, a few -- it was a sort of [indiscernible] project. So Peter, Annie, Jim and I basically, over a Christmas holiday, took the old database from objectivity and poured it into SQL server and then basically built the simple front end and then we said okay, we don't want to change the public face of the project. This is just an outreach on the side. And after a few years, everybody started using this so we got a lot of help from everyone. Curtis helped to design, basically, the whole website and the individual look at it. And now in 20 years, we are at one billion web hits, a little over one billion web hits. And we recently did a count, we are now at 4 billion distinct IP addresses on the users. Which is pretty decent. And what we also see is the emergence of the internet scientist. So these are people who are not professional astronomers but are also more than just the casual user who after reading a New York Times article just clicks on an image on the Sloan website. So there are people who keep coming back. And it's today, after this [indiscernible] has been doing yearly evaluation of citations. And for a couple of times out of the last two years, the solar has 4 been the most used astronomy facility. server-side analysis done. Also, there is a lot of collaborative I will mention it after. So this shows, actually, both the website traffic as a function of time since 2001 and also the SQL queries. And the SQL queries are the queries that people are actually typing in. So this is not the service interface. It's just a pure SQL that's coming in from the outside. So this is for the SQL traffic. This is the one million mark per month. So that's actually quite a lot. So these are the main components. Now, it has grown into a fairly sophisticated system, so it's more than just a database on the front end. It's actually a fairly complex environment, how we go from the raw data. So we have a transform plug-ins, which convert the different [indiscernible] into a loadable format and then they create little preload databases, which have the same schema as the main database, but then we kind of crunch the data there and scrub the data for all the errors before they can get loaded into the real database. We have a very elaborate schema framework that all the metadata you need the schema descriptions, everything is held together and it's automatically parsed multiple ways, like the old web language by [indiscernible]. So the one parser generated the code and the other generated the tech book. So we try to do the same philosophy. We have a parallel loader this keeps track of the loading so you can follow every data and where it is in the pipeline. We wrote a whole bunch of sequences to SQL to take into -- to provide the [indiscernible] knowledge. Then we have a batch scheduler for the long jobs, and then the results of those go into a MyDB and then we also are currently working on a much easier converter, a Dropbox-like interface, how you can upload your own data into MyDB to join with the main database. And, of course, we have SQL and web logs. It's now a couple from day one, we have essentially every mouse click logged. this will be a very interesting data for science history and been using and how the usage patterns changed. We are still profiling figuring out how to improve the services. of terabytes. So So at some point, how people have looking at So the MyDB is one of the most characteristic and unique part of this. So basically, we registered after a while our power users and gave them the user 5 side database. This was done by Nolan Lee, who was a grad student, and he also spent several summers at Microsoft as an intern. So query goes to MyDB. >>: He was not a grad student. He was a full-time tech. >> Alexander Szalay: So this can be joined with the source database and then the results can be materialized from the MyDB. So you can download it in any of the formats you want it. The beauty of this is that it gives people still the -- all the power of the relational database tools, but also the flexibility on the servers while keeping the original database read-only and you don't have to touch anything inside it. And it increased the performance of the database by a factor of ten, because the outputs are going on a high speed network to a local high performance server instant over the worldwide web. And so right now, we have about 6,800 registered users using the MyDB, basically, day in and day out, which is a good fraction of the astronomy community. Users can collaborate and this basically emerged naturally. So after a while, people kept the output data tables, and they said okay, can I share this table with my friends? We are working on a paper together. So we enabled group sharing. After a while, people said okay, here is the paper. We published it. Can we actually make this table very visible. So it became also a publishing platform. And it's really a very, very powerful paradigm. Data delivery is via a C-sharp web service, and this is what's kind of the real manifestation of sending the analysis to the data. So a lot of this simple join, which would be otherwise quite complicated, because it involved millions of lines. They were done server side. So this is the genealogy of the SkyServer. So the two components. So basically, the basic SkyServer containing the Sloan data, and then this is the CASJobs MyDB component. And this joint system has been already cloned by the millenium simulation database Palomar QUEST. GALEX scope. We are now building a genomics archive at Hopkins. The basic engine has been useful [indiscernible] for environmental sensing, for radiation oncology, for galaxy 6 zoo. We built a simulation, turbulence simulation database. Then the whole Edinborough operation, Super Cosmos and are UKIDDS is based on the SkyServer. The Hubble legacy arch now converted to this framework and to the tools. And then these are all the VO services. This is a much bigger cloud because we have essentially most of the major astronomical data catalogs [indiscernible] contacts inside MyDB. So you can actually join here not only with Sloan, but also with GALEX and all the other catalogs outside in astronomy. And then out of the turbulence database, we actually built now another dare tive for [indiscernible] hydro dynamics and now we are building a bunch of cosmology simulations also which will derive partly from the millenium, but also taking some of the lessons from the turbulence database. So you can see these are all different science projects, spanning not just astronomy and different aspects of astronomy, but also many other sciences. Of course, these are all -- the SkyServer itself was derived from the terraserver. That was Jim's project, going back another five to six years. And basically, the first SkyServer was really actually taking our schema, but never just turning basically the terraserver, which was the first visual earth, Google Earth, turned basically outside in. Or inside out, sorry. So Galaxy Zoo is one of the derivatives where basically, we took the Sloan images and created using the mosaic or the little cut-out service to ask the public to visually classify 40 million galaxies and we expect that there will be a few thousand people coming. And in the next few days, we had about 300,000 separate users come in so we had to install new servers, we had so much traffic. And there were a bunch of real original discovers made that have been since sort of [indiscernible] that have been observed by space telescope [indiscernible] and so on. >>: So the day you turned Galaxy Zoo on and then suddenly things catch on fire, all right. And you say we need more servers. Do you just pull out a checkbook? >> Alexander Szalay: No. Luckily, we have a couple of boxes sitting with servers that were not yet installed, and they were therefore basically for the 7 form, for the [indiscernible] so then immediately we were plugging those in and putting them online. So this was a lucky coincidence, otherwise. But anyway, and certainly for going, getting [indiscernible], being on the BBC nightly news, for getting a five-minute segment on the BBC nightly news, that was a big help. So anyway, so then a lot of the Sloan services were formed the basis of a bunch of the VO web services. And there will be several talks later on in the conference about the VO so I would rather just focus on kind of the sky query and how we describe footprints and exposures in the sky. And this is again, this whole tool kit, this spatial tool kit is now used for Sloan, the HST exposures and it's at the heart of the GALEX, Chandra, Pan-STARRS and we're also testing it for LSST. We have written it in C++ and plugged it into my SQL that LSST is currently using for the test pad. And we are about to release the OpenSkyQuery which we'll really be able to do sort of a hundred million objects against a hundred million in a matter of two, three minutes with cross matches. So the spherical library was about 85. So originally, we wrote this with Jim, and that was not the dot-net version, but it was all computational geometry written in SQL that was not the task for the everyday programmer. So working with Jim helped, certainly. And then Thomas rewrote it in C-Sharp. And now it's an OS-independent realization. We also now have a pure C++ Linux implementation of this, but it does a very complex -- it's really two packages. One describes, basically, the geometry on the sky spherical [indiscernible] and the relationship between them and the other one does basically a very fast spatial index and spatial searches on the Boolean operation of all those. And it can compute exact areas. And debugging this on the Hubble deep field was an amazing experience, because there are some of the intersections of the hundreds of exposures and visits created sometimes little polygons which were a few milliarc seconds on the side. And on the Sloan, we have also polygons which are 130 degrees on the side, and it has to work with both. Okay. So and we also added a bunch of enhancements to be able to do computations within the database, which are also quite generic. So besides all the geometry, so curve, space time geometry in the database. We also added just multidimensional arrays, data types, around CUDA code user-defined 8 functions. We can generate arbitrary multidimensional. And this is not just on the sky, but inside a 3D simulation. We have increment of PCA over all the galaxy spectra in 15 minutes. We have direct visualization engines which are integrated with the database and we are now working also in integrating MPI with the database cluster so that MPI application convert with a parallel set of database engines. So this is the Onco Space radiation. So just a slide about the radiation oncology database. So this is using the SkyServer engine. What it is treating, basically, the evolution of tumors after treatment at Hopkins at the radiation oncology lab and basically provides a much faster feedback loop for the doctors to see whether a particular patient is on track to, basically to the recovery. So this is a slide on the Tera Base search engine. In genomics, we expect to see the same transformation as happened in astronomy ten years ago. So kind of before Sloan, every astronomer who got data from somebody else had to redo the basic image processing. Nobody trusted each other's data. And after Sloan, people tried this again, and they realized that after a while, that the data is good enough. So the objects, the calibrations are good enough to do actually the statistical queries and the sample selections. And I think in genomics, the same thing is happening. So right now, everybody is still obsessed in doing the alignments themselves. And once people realize that the alignment done by a central site is good enough, they will start focusing on just doing the statistical searches over thousands of individuals to see the variations and the commonalities across the genomes. And so then we need a search engine that can do this job. And basically, and for the thousand genomes project, it's about 200 terabyte data and growing. It has more than a trillion short reads. 90 percent is aligned to the reference genome, but ten percent is unaligned, and we already built a prototype where we can do in a few milliseconds, basically we can find any of the sequences in these 10 percent unaligned, which is a hard job. But it's aligned, basically just have to align the query sequence against the reference genome. But the outputs are also going to go into MyDB. framework applies beautifully. So basically, the same Life Under Your Feet is soil biology, taking much of sensor data and 9 integrating it with life biome data with a bunch of different areas. And the goal is to understand this soil fill to emission. The soil puts out 15 times more carbon dioxide in the circulation than all the human activities. But the human activities can modulate this. What we do in the rain forest and so on. So basically, we are trying to collect data using wireless sensors and this is the number of samples that we have collected over the years. The first few projects were started with Jim, and we wrote the first papers with [indiscernible] and Jim involved. And then you can see that now we are putting out more and more experiments, and we are basically growing fairly rapidly. So we are now about 200 million data points right now. And this shows some of our deployment. So this is an NSF site near Baltimore. This is in the Brazilian rain forest that then coordinated and so this was done in collaboration with Microsoft. This is up at Atacama, next to the cosmology telescope at Atacama. So one of our colleagues carried the suitcase of sensors up there and it was up for almost a year. And this is a visit in the lab by some people you may know. So there's Dan and Tony. And then I'd like to get to simulations. So what is changing today, also, that a largest simulations in various areas are also becoming instruments in their own rights, and the largest simulations start to approach petabytes. And we need and we need public access to this largest simulations. But how, this generates new challenges. How do we simply move the petabytes of data to a site to everything access it from the computers that we ran them. How do we interface, how do we look at it, how do we analyze it. And this is a transfer. We had ten days to transfer 150 terabytes from Oakridge before it was deleted and we did it. So this is a database of turbulence, where we basically try to adopt a metaphor, where we take incident of downloading the simulation files, everything is stored heavily indexed in a SQL server database, and then we can shoot test particles, virtual sensors from our laptop into the simulation, like we were in playing the movie Twister and the sensors reported back basically the fluid [indiscernible] pressure or in cosmology, we can think of temperature. So basically, physical quantities. And we are now loading this, or actually this 70 terabyte MHD is already loaded. And we are writing a nature paper about this. 10 And this is a typical daily usage, about 100 million points get served up every day from this database. And it's interesting for the users, it doesn't matter whether it's a 30 terabyte database or a 30 petabyte. All that they see, whether the points are coming back fast enough. And if they don't, they complain. So these are a whole bunch of papers that were submitted. You can see [indiscernible]. Now there's a nature paper coming. So this is generating first class science. Cosmology simulates same thing is happening. But a very profound thing happened in 205. So there was a simulation called the Millenium, run by a group of people in [indiscernible]. And Lamson built a relational database out of the data, and then suddenly took off like wildfire. So there are now much better, much bigger simulations scattered around. Still, everybody is using the millenium because it's easy. And it took the SkyServer framework and simply plugged in the same interface you can create basically the merger history of galaxies and so on. And you can now generate all sorts of realistic looking simulated observations of the data. But it is because it's easy to use, there are more than a thousand people using it daily. And we are now trying to build something at Hopkins with an NSF funded project to do a petabyte scale simulation, which is a single large simulation about the milky way with about 800 snapshots, and more than a billion particles in the database moving essentially just in the orbits around the galaxy. And then one of the particular projects trying to do first is to compute dark matter regulations interactively. So this is still from the Via Lactea, which is only 400 million particles. But the original -- this was part of the original Via Lactea paper by Michael Kuhlen. It took him eight hours to compute this annihilation. This was done on a GPU database. Time is 24 seconds. So it's more than a factor of a thousand faster. And the trick is that we, instead of computing the rate race, we actually used Open GL. So we used the stereo graphic core and that's where circles, map on to circles and so on and then we used basically a 20-line Open GL code to basically do the same rendering but at hardware speeds. 11 And so next, what we would like to try is that people can change the cross section so they can play [indiscernible] games with the physics of the dark matter particles. And in 20 seconds, they can wait 20 seconds to generate an image. They wouldn't wait eight hours. So how do we visualize petabyte? And here, the thing is that we need to send a rendering to the data the same way as if we send these analysis to the data before. It's much easier to send a high definition 3D video stream to an iPad than a petabyte of data to be rendered. So basically, the visualizations are becoming IO limited so we're trying now to integrate the visualization engines with the database so that the data traffic is going on the back plane of the modern board. This is just an example for the turbulence. So we have been streaming a little over a gigabyte a second. So on a relatively small server into a GPU about 25 billion particles. But the nice thing about this, that this is done really in real time on the fly so this is not rendered frame by frame. So this is a capture of the real video stream. Long tail. So Dennis already talked quite a bit about this. And so it's not only the big data that we need to be concerned about, there's a lot of big data integrated in the tail of the distribution. And there is -- is there a lesson to learn from Facebook? In a sense, what is Facebook? Bring many small, seemingly unrelated junk data together to a single cloud and to a single location and some magic happens, basically. Because people discover previously unknown associations and links. And is there a science equivalent? What is it and how do we do it? And there is another lesson from the public services, which is DropBox. DropBox is an interface that nobody has to learn it, because it's totally intuitive and it is so crisp that it is so hard to do such a crisp interface where it has nothing extra on it but only what is absolutely needed. But what is needed works like a charm. And there's a lesson there for science and for scientific interfaces. Okay. So we are trying to do something that takes some of these lessons. So what if we already have the VO Space, the virtual observer to reach the distributed storage with security added on it, which is visible for all the web services of the virtual observatory. 12 So what if we combine the DropBox like interface, a MyDB type relational database to store both the metadata and a lot of the tabular data. And then the VO Space, but make the VO Space completely fault tolerantly distributed. So to a real cloud storage, where the data is indestructible. And basically offer free cloud storage for the VO community so that you can upload all your images, maybe did it or not. Anyway, here is where you store it. And then once it's up there, without asking any more information from the user, and two coercive XML forms to fill out, let's just start mining all those headers and look at the context at what files have been placed to the same directory. Every once in a while, ask the user to maybe we see that there is some commonality. How would you call this common tag of all these objects? And this is the most intuitive I can imagine. Anyway, but we can, since in fit files there's a lot of metadata already, without asking anything from the user we can always derive a very rich context. And let's see how it can be generalized as to other areas of science. And this is sort of a rough architecture where we would have basically a VO Space interface. We would have a DropBox interface, but it would also hook up so it would basically hook up to a database, but we can also have multiple regions for fault tolerance. You can imagine having one of those instances at Cal Tech, another one at, up at CFA, another one in Baltimore, and the data is basically everywhere. Wherever you need the data, it's already there. Summary. Science is increasingly driven by data. The large data sets are here, the cheap, off the shelf solutions are still not here. We see a changing sociology in how the data acquisition is separating from the analysis. And in between that, those two theories is basically the data publication and the databases. What we see is that we are moving from hypothesis driven to data driven science. There are many deep similarity among disciplines related to data. This is, I think, the main lesson that we learned from the SkyServer. I was amazed at how much of the basic parts of our tool kit have applied from radiation oncology to environmental science and to genomics as much as to astronomy. 13 So we can, I can say with confidence that we can build reusable solutions. There is a problem in the long tail, but we see now this fourth paradigm of science is emerging. And I think Jim would smile. Thank you very much. >> Yan Xu: So we have some time for questions. >>: This isn't a question, but I got to see some of the most weird uses of it, having gone from the Sloan to Galaxy Zoo. At the planetarium, I'm watching what they call sky karaoke and they used Galaxy Zoo to fly through the universe with the lyrics to Major Tom showing up spelled out in individual galaxies that were matched to letters. So the most improbable things ever out of the SkyServer. Never would have guessed it would be Major Tom, spelled out in galaxies on a 200-foot dome. That's the long tail for you. >>: George? >>: So one part to the fourth paradigm that not really appreciated sufficiently, I think, is the paradigm shift that theory is now expressed as data, as opposed to its formula. We still have another theory, but all these complex phenomena can only be done numerically. And so what you're doing pushes this really in the forefront that you're observing simulations, you're treating simulations the same as you would measurements, because pretty much [indiscernible]. So what's your sense of world view change among the theories facing this transformation? >> Alexander Szalay: Well, I think it already has happened, or it almost has happened. So when I was in [indiscernible], then there an exchange. People were [indiscernible] and Simon White were talking about an email they got from Richard Ellis, and they are arguing about how to write the SQL created to get the correct of getting the merger tree and the mass function for the millenium database. Pundits like changed. So simulations, not all data this are starting to argue about SQL. You know that something has otherwise, I think the other change in our word, will be with the it very easy to generate lots of data. We will understand that is equal. So there will be some that we will just need to analyze 14 to keep for two weeks. There will be others that we need to keep for two months. And then kind of naturally, probably by natural selection, a few simulations emerge like millenium, which will become de facto references for the whole community. And then those we need to do right and those we need to keep for years at a time. But right now, we kind of shove all the data, we think that everything is equal. And this kind of is becoming a barrier. >>: So survey data, which is primarily what you're delivering at the moment, has evolved over the last 15 years or whatever from being discrete objects defined by a couple of single numbers so discreet objects defined by many numbers, including multiwavelength. That's been the big revolution at the moment. And I think it's the direction it's going is where it's not going to be discrete objects, but it's going to be a continuous field across the sky and the radio data is going to be a classic example of that. How do we deliver that kind of data with these kinds of technologies, which are really [inaudible]? >> Alexander Szalay: Well, so actually, I don't see a big deal at this point in scaling up, for example, keeping all the Sloan images on the fly and with enough GPUs around, we could basically give a custom, on the fly image processing for limited areas. Not for the whole sky. But if you have something like five square degrees, I don't see an issue with doing something funny to data and then running this extractor again with some more [indiscernible] parameters. That's perfectly doable. Or actually, you know, to do another thing, have a whole bunch of amateurs applauding a bunch of images to basically to the VO Space/DropBox and then essentially have a custom reprocessing of a good fraction of them, which happen to overlap with the Hubble deep field. We just found a big new super nova or something. So that's completely doable. It's a question of compute cycles and more [indiscernible] with us. On doing that a hundred petabyte scale, it's a different matter. But again, if you take little bits of the hundred petabyte, I don't see a problem with it. >> Yan Xu: Okay. Any other comments, question? Okay. Thank you very much 15 again, Alex. >> Phil Bernstein: Hi, I'm a database researcher, and I work at Microsoft. And I don't do astronomy. I don't actually apply database technology to science, in particular. I'm mostly an enterprise computing guy working on problems that are related to big business use of databases. But Alex and company asked me to talk about data integration technology, which is a part of the database work that, if you're working with data, you're doing a lot of this, and automating it is a fairly important problem. So I'd like to -- I'm here to give you a survey of what's done with data integration, but I'm not particularly from a science focus. So the problem is easy to state. That it's the task of combining information from multiple sources and presenting a unified view of that data. So the input is raw data. It can be arriving from instruments. It can be arriving from users typing stuff in. It's probably heterogenous in some way. That is to say different formats, different assumptions about the meaning of the data, perhaps different measurements systems and from different groups. So it's not like you can necessarily talk to the people next door to find out what the data means. It's a little more challenging than that. And so the question is, what do you have to do to get to the point where you can actually do the work that you want to do? This is a problem of data preparation. Everything that you do with the data before you actually get to solve the problem that interests you. It's hard, it's expensive, and I can tell you in the enterprise world, it's about half of the work in getting a database and making it useful. So as a perspective, the data warehousing field, which is a part of data integration, is probably about a third of the database market, maybe a little more. And half of the money that's spent, half of the labor that goes into it is just in getting the data into the data warehouse, to the point where somebody can actually run a query. So if you've been experiencing a lot of pain in this, it's not because you've got your particularly poorly organized or whatever. It's just the way things are when you work with data. So the next level of detail, what are the scenarios people run into? Well, one 16 is data translation. You're given data in one format, you've got to get it into another data format in ord to be able to use it. It's not only format. It's often changing the semantics, combining the data in funny ways. If you're dealing with a stream of data coming in off of instruments or from other sites, they might be a stream of XML messages. You might be loading up a data warehouse in the sense your big astronomy databases are, in fact, effectively data warehouses. Or you might leave the data right where it is. You might not actually pull it together. You might have a query interface that maps to the data, wherever it is, and then you run distributed queries to pull the data together. Or you may be building some nice portal to get access to the data, and the data's in various places. You're exposing it in a nice format and giving people the ability to write queries as well as form-based interfaces just to retrieve individual records. Or you might be writing a program in Java or C-Sharp or whatever in order to get access to the data with queries embedded inside. And in that case, you probably want a wrapper that maps from your object-oriented point of view from the program into the SQL point of view, which is the way the data is accessed where it lives. And so on. Report writers, query designers, form managers, web services. These all have that same characteristic of having to expose the data differently from the way it was originally produced and that gap has to be filled by a pile of work. What's common about all of these problems is that we're mapping between two representations of data. And there's generic technology to work with that; that is, to create mappings and to use them in order to do the translation that's required. Now, the space of problems that you can experience here is pretty broad. I'm only going to focus on a little bit of it so I want to kind of outline the whole space so you can see at least what I'm not going to be covering and what other issues might arise. So in one dimension, there's the question of the precision of the mapping that you're producing. You know, if you're answering queries in Bing or Google, the mapping can be pretty approximate. Nobody is expecting a perfect answer anyway. The important thing is to be able to automate it. On the other hand 17 if you're trying to draw scientific conclusions from the data, then you probably have to make sure this mapping is absolutely right, validated, you know, five times by different people to be really sure. Or in my world, if you're doing billion dollar money transfers, you also have to be absolutely precise in the mapping definitions. Then there's the kind of data that you're trying to integrate. It could simply be formatted records, just have numbers and strings and there's nothing special about it. But often, there's a lot more special. It can be text, you know, like paragraphs. It can be geographical information, images, time series. You probably know more of these than I do. And they all have special kinds of integration problems associated with them. Then there's the question of the process by which this integration is happening. You know if we're days away from an asteroid hitting earth, and you need to draw some quick conclusions, your problem is time boxed. Taking a year to integrate this data is not going to -- it's too long, right. Or similarly, if you're doing emergency management or you're responding to an emergency -- a military situation where you've got different military groups combined all of a sudden on the battlefield and you need to tie their data together. So you have to time box it. I talked about the fully automatic case in the case of Google and Bing. You might want to incrementally develop it. You're not even sure how much of this data you want to integrate so you're going to integrate a little bit of it and kind of see how it goes. And then based on the integration you've done, you're going to integrate a little more and kind of pay as you go, if you will, through the integration. And then there's the carefully engineered problem where you're creating this data resource, like the Sloan digital sky survey, and you're just going to put a ton of work into making it as perfect as it can be, because you expect a lot of people to be using it for a long time. So most of the solutions I'm talking about here are for precise data integration of formatted data with a carefully engineered process, which is probably pretty close to what you mostly work with. But as you can see, there are many other combinations here that are also very important in other settings. All right. So the problem is you've got these data sources. You want to 18 create a mapping to some target. Now, it could be all you've got is the data sources and you want to generate the target format from the data sources, or you might also have defined the target data source format ahead of time and now you want to map between the given sources and the target that's already been laid out. Could go either way. So ideally, the mapping should be as extract, because you want it to be short, easy to understand, so you can convince yourself it's right. It should be easy to express. The whole point is you don't want to have to do a lot of work to do this. To satisfy -- to be sure you've got the right one, it has to have clear semantics. That seems obvious, but many of these mapping languages don't have that property. And even when it is a precise language, if it ends up, the mapping is expressed in thousands of lines of imperative program code, then yeah, it has clear semantics, but how sure are you that it's actually doing what it is you hope it to do. And then finally, it's got to compile into something this executable, because the whole point of developing the mapping is to get the data into the format that you want to in which you want to use it. So in terms of abstraction, you know, SQL's better than Java or C-Sharp or Fortran, but it turns out that even higher level languages are needed and are commonly used. So let me give you an example of how this might go. I'm going to pick one that I'm intimately familiar with. I worked on it for many years. Then I'll try to give you a feel for how it applies in other settings. This is a problem of generating an object to relational wrapper. The scenario is the data's in a relational database. I want to write an object oriented program to access the data so I've got to somehow expose the data as a set of classes. And intuitively, it seems like this should be really dead simple. A table, it sort of the looks like a class in a programming language. Don't you just map them one to one and that's the end of the story? And the answer is hardly ever. And we'll see why. So here's a really simple example. Again, from the commercial world, but you'll all relate to it easily. On the left, I've got three classes. The root 19 class is called person and it has two specializations. One called employee, and one called customer. So I've got two kinds of employees. And let me just give myself a pointer here to work with. And there's inheritance. So the employee has a department field, but it also inherits ID and name from person. And then similarly, customer inherits ID and name from person. Now, the way this is stored in my database is in three tables. I've got a human resources table, which just contains the ID and maim of people, an employee of table, which for those persons who are employees, they also have a row in the employee table with their ID and department, because that's the department, recall, is the additional information for employees. And then customers, well, customers are special. I've got a whole separate table for them. So their information isn't stored in the HR table or the Empl table. It's in this separate table. So the mapping, I mean, this is pretty dead simple. The mapping, I'm sure you can all see what's going on, that the person information is obtained from the union of all the information in the HR table, plus the two columns of the client table that are relevant. ID and name. So you take the union of those two tables. And then the employee table is simply the join of HR and employee. And the customer table, well, the customer class, that's equal to the client table. So really simple. And you may think this is a joke, but this is actually the SQL you have to generate in order to implement the mapping on the previous page. And the main reason is -- I see a couple of people nodding their heads, it's like you've been through this. So the reason is that when you retrieve one of these people from the database, you don't know, necessarily, whether it's a customer, an employee, or just a person who has no specialization. And so you're pulling it -- you've got to pull the data out of all three tables and then based on what you find, sort out which class it is you're supposed to instantiate for that particular person. You know, if the person appears in the HR table and nowhere else, then it's a person and it's not a customer or an employee. But if it's in both the HR table and the EMPL table, then it's an employee and you want to instantiate that. 20 So what's going on here is a bunch of bookkeeping in order to figure out which of the objects you're supposed to actually create, all right. And this is utterly standard SQL and this is actually generated by a particular system that I work on. So how do you do this? Well, to generalize slightly, basically, there's three steps in solving mapping problems like this. The first is given the source and the target, just where are elements related? I mean, never mind the details of the semantics. Just to know that some elements of the source actually correspond to some elements of the target. You just draw arrows, basically, connecting them. This doesn't have any special semantics, but it's already useful all by itself, because it tells you what the lineage of the data is, how the target relates to the source. You may be able to use it for impact analysis. If I change the source data, which parts of the target are going to break, because they're related. The next step is to turn those into mapping constraints, like the formulas on the previous page, which actually tell you what the semantics of the mapping is, and that's generally done by partitioning up the correspondences into groups so that you can figure out which pieces of the connections actually relate to each other. And then finally, you want a function. You want a transformation that will actually translate the data from source to target in some language that your system supports. So it's a three-step process from schemas to correspondences. Correspondences to constraints, and then constraints to transformation. Now, this is -- I motivated this for the object to relational mapping, but clearly if you were populating a data warehouse, it's the same thing. That, you know, you're translating data from one format to another for any reason. It's always these three steps. So the first step is just matching up the schemas. And this is a field by itself, schema matches. You know, there's many ways in which you can figure out or guess how your source database formats map to your target formats. You know, you expect, of course, the element names to be the same. But they're 21 often not exactly the same. They're sort of the same. Maybe there are synonyms involved. Maybe by looking at the data instances, you can tell. Maybe by seeing the nesting structure of the elements is the same in source and target. There are literally hundreds of research papers describing different techniques that can be applied in different situations just for this one -- and basically, are all embodiments of some intuition that you might have when you're trying to figure out how to connect up the data. Ideally, these algorithms are available to you in some kind of a matching tool, where you've got a source database structure on the left, the target database structure on the right and you can start drawing lines between them to identify how they relate. Again, ideally, with some assistance. So in the tool that's shown here, if you right click on one of these elements here, then what it will do is pop up a list of lines that connect to possible elements on the other side that are good candidates for you to look at. You know, and if the schema's only got ten elements on both sides, this is not a big thing. But if it's got hundreds of elements on both sides, and in some cases it can have thousands, then obviously it's a big help to be able to get the system to assist you in picking good candidates. Next is getting from those correspondences, those lines into constraints. Generally, this is a manual process that mostly involves figuring out which combinations of correspondences are grouped together to define a particular mapping. You basically are looking at selection and projections of one piece of your sort database equal to selections and projections of some piece of the target. You want to avoid joint operators, if at all possible when you do this, just to kind of figure out like this little rectangle of my source is meant to be this little rectangle on my target is a way to think about it. And then finally, there's the generation of transformations and you want that to be totally automated. So there's a bunch of products that do exactly what I'm describing. The one I'm most familiar with, I helped develop, the ADO.net entity framework. But there's an open source version, targeted primarily for Java called Hibernate. Oracle's got one called TopLink. Ruby on Rails, essentially the rails portion of Ruby on Rails is essentially this, but with a much weaker mapping language than the other three that I talked about here. 22 But it's not just for object to relational mappings. scenarios that I listed on that early slide. It's also for these other There are other problems besides the one I just described that relate to this whole field. One is schema translation that you've got this source database. It's in a relational database. You want it to be an object oriented structure to make it easy to write a program against it. Do you really have to design this manually, or can somebody just write an algorithm that will just do the translation for you. So there's all kinds of combinations, going from object schemas to SQL schemas. You wrote the program. Please generate the database structure for me. Or I've got the database, please generate the program structure for me. Or there's this XML export format of this database that I need to use, can you please generate a database structure that I can store this in that's nicely shredded so that I can write queries. So there's a whole category of algorithms to do this kind of thing. And it's not just based on the schema. It's also based on the data. For example, here I have three tables, a course table, a course details table, and a course classroom table. And as you can see, they all have an ID field. And the course details table has the same number of rows as the course table. What that suggests to me is there's actually just one concept here, and somebody just cut it down vertically and put half in one place and half in the other place, probably because the course information was important than the course details information. By contrast, the course classroom information also has an ID, but it has fewer rows, but there's still a foreign key relationship here, which means that all of the ID values in course classroom appear in course. That suggests the course classroom is actually a specialization of course. You know, it's a sub-type. So just looking at these very simple pieces of information about the data itself, you can make intelligent choices as to what the schema ought to look like when translated, in this case, into an object oriented one. So this is just by way of example. There are lots of these kinds of algorithms, but they all try to use this kind of reasoning. So what I'm leading up to here is that there are a small set of operators to 23 manipulate schemas and mappings, and it is really only about eight or ten of them. I showed you how to match two schemas, how to generate constraints from mapping, how to generate transformations from constraints, how to translate schemas from one data model to another. There are only about four or five or operators like this. Merging schemas, differences two schemas, composing two mappings, inverting a mapping. These are all familiar mathematical concepts. So it's not surprising that these should be the operators here. Let me just give you one more example and the -- which is schema evolution. Here, you've gone to the trouble of doing everything I've just said, but now something changes, and the person who did the original mapping, they graduated. They don't work for you anymore. And so now you have to evolve what you've got to into a new format. So you've got this new schema that you need to support. So what do you do? You had a view or a mapping to some user view of it, you had the database. How do you handle the evolved schema? Well, it's the same problem, sort of the same problem I described before. You create a mapping from the old schema to the new one and then you generate a transformation from it. Pretty much the same thing that I described earlier. Hopefully automatically in order to migrate the database. But what about this view this is running on the old schema. You need to have it run on the new schema. Well that's a composition problem. You basically need to compose these two mappings in order to get from the evolved schema to the view. Now, if the mapping from your original schema to the view is going this way, and the mapping from the evolved schema to the old schema is going this way, this is function composition, utterly trivial, you know, just plug it in. But if the mappings are going in opposite directions, life is extremely hard, because now you have to invert one of the mappings in order to be able to compose them. And depending on the language, that can be hard or impossible. And so, you know, people write programs to do this without even thinking at the level of mappings and just say boy, this is really hard. How do I douche what do I do with this little piece of data in order to get it into the other format and don't realize the problem they're running into is they're trying to invert a 24 mapping that's not invertible and making bad guesses. Okay. Let me wrap up with just a few quick slides on another dimension of the problem, which is tools for manipulating instances. I've been talking entirely at the legal of schemas and mappings. What about actually just cleaning up your data and loading it? In the commercial world, this is called an extract transform and load, and it's a standard component of any data warehousing tool set. Here are some example functions it might include. Let me quickly talk about a few of them. So data profiling is the problem of giving you a summary of the instances of the data without going blind looking through the instances one by one and trying to imagine what the entire dataset looks like. The first column of your table, gee, that's unique, you know. There are no two rows that have the same value. That's useful information. Or maybe it's not the first column. Maybe the first and the third together will uniquely identify every row of the table. Or that every value first column of the Value ranges, value or it might just be in this column of table two is actually contained in the table one and is obviously a reference going on here. distributions. The result can be exact if it's exhaustive, approximate, based on sampling. I don't know how often this comes up in your world. In the commercial worked it comes up all the time. People give you data and you've got to kind of figure out what it is they actually have. Fuzzy look-up is another one. You start wanting to search the data, but in the beginning, you're not really sure what the values actually are. You know that there's a value, there must be a row in there for this product. You don't know how it's spelled, and that's sort of the whole point, but you want to find the row. So you need to do fuzzy look-up. And it's more like kind of keyword search against the internet, rather than doing a SQL query to find a row. But often, fuzzy lookup is not supported by the database system. So it's a separate mechanism that's needed here. Then there's parsing. Some of the fields you're given are actually just a pile of text, but that's not how you want to store it. You've got names and addresses, but of course you want to parse that out into a structure that you can actually work with. And, you know, if you could just give it a few examples, maybe the system can just figure out the rest for you and develop a parser that will automatically do that reconstruction. 25 Then finally, there's duplicate detection. The problem is, of course, duplicates are hardly ever exact duplicates. They're almost duplicates so you need to do fuzzy duplicate detection. And everybody's experienced this, because we all get mail from solicitations or advertising come three envelopes from the same place, because the address is a little different. That's a product category. People sell software to de-duplicate address lists. So it comes up all the time, and there's generic technology to do this. So you want to be able to figure that out automatically. So that's kind of my whirlwind tour of the data integration field. You need to define mappings. You want them to be high level. You need these -- ideally you have these powerful operators to manipulate the schemas and the mappings. And after you develop the mapping and it's doing its job for you, you're going to have to maintain it as systems evolve, if the data has any long life to it. And finally, you also need tools to be able to clean the data initially and then to be able to go back and do that again as well. So that's all I had prepared here. >> Yan Xu: Questions, please? >>: I was interested in your fuzzy lookup stuff. In astronomy, we often get the problem. Let's say I take a Sloan database. I know [indiscernible] union on to cross-match that to another database with another wavelength. So typically, an entry there, that's probably that one, but it could be that one. Maybe there's no set probability of [indiscernible]. Neither is actually right. Really, I need to have an output database which contains those probabilities. So it's a bit like your fuzzy lookup there. that might be able to handle that case? Is any machinery in the pipeline >> Phil Bernstein: I don't know how much we can automate it. I mean, the details of how you calculate those probabilities and what the tolerance is for declaring one element being a candidate match for the other one are clearly very important. But the general area of probabilistic databases is actually something that a lot of people have been working on in recent years. 26 Now, the problem is that is we love to do things generically in the database field. That's your point of leverage is to come up with some very general, like SQL, some very general purpose technology and then it can be used for everything from accounting to astronomy. So the same thing is true in this probabilistic database area, where people are trying to develop generic ways of manipulating rows where they have some probability of values being correct, some probability of values matching. Whether that general probabilistic model meets your needs, I couldn't tell you. And I think that that's -- so I don't know whether there's something in the pipeline that will actually help you with that, but there is a field, and if somebody takes an interest in learning about probabilistic databases, you can see whether it applies. And I'm quite sure people doing that work would love to hear about an application area that doesn't -- whether it fits or not, to understand exactly what the mapping is and whether there's something they can do a little differently in order to meet your needs. I'd be happy to connect you up to people in particular if that's an interest of yours. >>: So the one thing you didn't talk about in the summary here is the tools to clean data. That by and large, data are dirty. That particularly instruments that sit out there in the field and measure [indiscernible]. So one of the things is that it's often difficult to automatically identify those kinds of things. And so it sounds like this probabilistic analysis is something that would be useful. Is there work going on in this area to -- in other words, is there a way for me to -- typically, the way I do this is I plot data. And I look up ->> Phil Bernstein: The outliers -- >>: I say that doesn't look so good. But that's hard to generalize. In other words, how do I find similar kinds of things through the whole dataset. >> Phil Bernstein: Right. So as I said, there are these ETL extract transform load systems and they often have nice graphical interfaces for developing pipelines of operators. But figuring out how to actually clean the data requires some domain expertise about the data. So the question is, what can the tools do to eliminate what you consider the tedious part of the problem and allow you to focus on the intellectually 27 challenging part of the problem? And that's an ongoing iteration. I mean, every release of tools does better at that. So, you know, clearly, visualization is part of it. >>: The question is I'd like to, you know, circle something on the graph and say find me these things. >> Phil Bernstein: data. Ah, you want to get from the visualization back to the >>: Obviously, if I look at a month's data, I can see ones that are bad. It's when I'm looking at, you know, 60 years of data across 200 different locations that I can't do it all visually. And so the question is, you know, the work on the tools that are needed to let me somehow tell the computer why I think this particular point is bad and then, you know, find me things that are like this. >> Phil Bernstein: It sounds like a very interesting problem. We should talk about it. It's a -- well, okay, but data mining and databases, you know, there's a fuzzy connection there that we do both. But it's pattern matching, it's -- there's a providence problem in there as well to figure out because, you know, you're looking at data in some visual format. You know, let's look at the exact instances of that data and then use those as templates maybe to find related instances and so on. There's probably multiple components to you're developing an end user scenario, if you will, for using the data. And then the question is what components are needed in the pipeline to actually be able to make that happen. It sounds really interesting. I hope we'll have a chance to talk about it soon. >>: Where do B to B [indiscernible] issues come in here? Are they just another mapping, or is there [indiscernible] impact you're describing here have on the [indiscernible]. >> Phil This is dealing a form. catalog Bernstein: You're talking about business to business communication. -- this is where the action is in B to B e-commerce. So in Amazon with heaven knows how many suppliers of products, you know, they define How do you suppose all those folks get their product into the Amazon so when you browse it, they all look the same. You know, and they're 28 readable. And the answer, of course, is that Amazon has a target schema. And if you're a vendor, you've got a catalog, you've got a map from your format into their format or you've got to get somebody to do it for you. >>: So in that case, they're the biggest -- >> Phil Bernstein: >>: They get to see it. >> Phil Bernstein: >>: They get to define, yes. Right. But in more peer to peer context, are there other [indiscernible]. >> Phil Bernstein: In more peer-to-peer contexts, then it's a business negotiation as to who's going to do the work that, you know, I don't know if I have any good example. Sometimes, mergers and acquisitions are an example sometimes. It's Bank A, you know, acquires Bank B and they've got to integrate all the data. It's clear who the source and who the target is. And then they've got to figure out just how much work it's going to be. >>: [inaudible]. >> Phil Bernstein: Who's going to do the work. Then you worry about making sure the people in the acquired company are willing to stay. What incentives are you going to give them, because they know they've got no future unless they want to move to the other bank's headquarters. And there's, in some cases, it's so expensive, you've got to size the project. In some cases, it's so expensive, the decision might be made to run the two systems in parallel indefinitely because it's just too expensive to make them into one system anytime soon. So I don't know. I'm sure that -- I'm sure that there's lots of smaller examples of this. And perhaps research labs collaborating. I've seen bioinformatics. You get a post-doc at some biolab and they're grabbing data from three collaborators. Source data formats change. Just willy-nilly, nobody bothers to tell you. It's just that you bring the data in one day and 29 it just doesn't run anymore. same in science as business. And you don't know does she anyway. So it's the >>: One of your early slides showed a simple map of just spaghetti code. Are there tools for validating the correctness of the code? That does just what you think it does. >> Phil Bernstein: Boy, I want to say yes to that one, but I'm not sure I can. There's some fairly fancy math that goes into doing the comp ration. This is not simply a matter of a developer sitting down and making a best effort to compile the constraints into a SQL program. In the case of the entity framework which was the mapping language that I described, that met with -- the constraint actually compile into two different sets of view definitions. One set of view definitions map the relational format into the object format so that it can be exposed to the program when it runs a query. The other views map the object format into the relational format so that updates that are applied to objects can be propagated down to the relational format. Now, a moment's thought will tell you that it would be awfully good if the composition of those two mappings was the identity. I mean, if you store something and you bring it back, you hope you're going to get back exactly what you stored. No more, no less. So there are tests on the compiled view definitions that you can run in order to make sure that's true. They're not cheap. This is an offline activity. But you, in principle, you can do that. And, in fact, we do. And, in fact, for technical reasons, we have to because it's -- in our system, it's actually possible to define constraints that we can compile into view definitions that compose to the identity. I mean, you know, the reason why we do that is because if we were to restrict the mapping language enough to ensure that never happened, we would end up making it impossible for you to do some mappings that you really want to have. So instead, we allow you to express some mappings that we can't handle so that we can allow mappings that you do want and then during the compilation process, we figure out whether you produced one of the mappings we know what to do with or one of the ones we don't know. And we'll tell you which ones. And we do it through this check. 30 So it's not exactly a complete correctness test, but it does give you some additional confidence, because the two mappings are generated independently. So if they compose to the identity, they're probably right. >> Yan Xu: One last question. >>: So going to a different part in the space, basically, there was [indiscernible] tables. So would you care to comment about the usefulness. >> Phil Bernstein: Yes. So I'm involved in that kind of work as well. We have quite a big project here along the same lines. And so I'm not sure. Are you talking about web tables or fusion tables? Because there's two different projects that he's involved in. So web tables is the observation that if you take the result of a web crawl and look for all of the HTML tables that are there, and get rid of all the ones that are just HTML tables used for formatting a web page. So you just look at the tables that are tables, they have data in them, there's about a half a billion of them out there. Surprising. That doesn't count, you know, the forms interfaces that are sitting in front of databases. We're just talking about tables, where the data is actually sitting there on the web page. That's a lot of information, clearly a lot of it is redundant. There's a lot of useful information you can get out of combining those tables. You know, if you take a given topic, you might find a hundred or 200 tables about that topic and you can kind of merge them and, you know, majority rules. You can do good things with them. That's one project. I think there's going to be a lot more of that. certainly do a lot more of that. We're The other is really along the lines of a lot of these talks we've heard today, about data sharing. That Allan's group, in particular, at Google is allowing users to put tables and store tables in the Google infrastructure and then they will, if it's geographical data, they will put it on a map. There are certain other simple kinds of value you can get for putting data there. And we're doing the same thing. We have something called Azure data market, which is basically a brokerage 31 house for commercial data. You can buy or rent access to Dow Jones or Reuters data or what have you. Or get government data for free. And this is an ongoing project, where there's likely to be, you know, a lot more data made available that way. And there's a website called SQL Labs, which has a project there called data explorer. It looks like Excel, but in fact what it is, it's a way of you putting data into an excel-like format, but running ETL programs in order to transform it to get it into the format you want. So, you know, we're all kind of experimenting in this space to figure out what large numbers of users really want. I can -- anybody who is really interested, I'd be happy to give you pointers to the websites. And if you have any problems that you wish a system like this would solve for you, we're all ears. Because I think the whole field is trying to figure out just what it is that databases as a service ought to be offering to customers. Obviously, you know, we offer SQL as a service, and you can just store your database in SQL Azure and it's plain old SQL. But what else? And we're all working hard to figure out what that next step might be. >> Yan Xu: Okay. Thank you very much. presentation from NSF. We should move to the next >> Nigel Sharp: I wasn't at all sure what this was supposed to be. This is all George's fault. I didn't even agree to do this. But so I'm going to make maybe three points based upon today's discussion that I think we should emphasize and I think we should think about how we're going to bring them about. The first one is related to the comment that was made that most researchers aren't interested in becoming their own system managers. They don't want to know how to run a supercomputer. They don't even usually want to know how to run the cluster their department. So they hire a system manager. I think in the same way we're going to have to see the development of the data scientist. This has been talked about before, but you won't do the data yourself. The researcher isn't particularly interested in that. They want to get science done. So we're going to see that specialty. 32 We started doing something of this sort with the supercomputer centers when they were introduced. There were people there whose job was to help you use them. And that's what we need. We need that interface. And we need that to be a valued career path for the sort of people who are interested in doing it. From the NSF perspective, the way you create a career path is quite simple. Universities like money. So you create the mechanism that funds people doing that, and that creates the career path. We can't create jobs. You give a grant to a university, that doesn't give them any incentive to create an entire position, because that's a decades-long commitment by the university. But if you have a mechanism that provides career capability, then you can introduce that, and that's something I think we need to think more about. The expert who understands the data aspects but can talk to the researcher in their own language. Because many of the conversations that you're going to have, you talk at cross purposes or, indeed, complete misunderstanding using terms that other people don't understand. So the interface person, I think, is so important. I can't think of the easiest way to put it. I think my function really at this sort of meeting is to answer questions rather than to talk. I didn't write a presentation because, you know, I don't know what you're interested in hearing. I think in many cases, it's easiest if you ask me questions, rather than that. What's the panel coming up next? >> Yan Xu: Who's on that? [inaudible]. >> Nigel Sharp: I wasn't sure who you were planning to put on it. I've got the -- is it here? I'm sorry. All right. Okay. So the other one was sparked by something I thought was missing this morning. We were talking about models. It's been said that data archiving, data curation, long-term support of data. So there are two things related to that. The first, of course, is the purpose of storing data is not to store data. It's to enable new research to be done. It's to enable validation of the existing learn that was done with those data, but it's also to enable people with clever ideas to do new things with it. So we need to think about, you know, the fact is we're going to throw data away. 33 Data can be funded. It has to be funded. It takes power, takes issues. But the model that wasn't mentioned this morning, I just want to throw out. I expect other people have thought of it already. Of course, it's libraries. We've had libraries for decades, centuries even. Libraries have a funding model. Some of them are by subscription. Some of them are pay per use. Some of them are supported as part of overhead by universities. There are specialized collections. There are general purpose and so on. So I don't think we should dismiss that simply because libraries existed to curate large collections of paper. I don't think we should dismiss that model and possible funding model. Also, there's a vast amount of funding on the internet based on advertising. Many sites that deliver things to people purely in order to get words in front of eyes. That's the advertising model. We don't think we should discount that possibility as well in terms of supporting these things. The problem from our perspective is NSF runs on a very simple model of starting out research. If you go to the website for the NSF, you'll see the sub text underneath national science foundation says where discoveries begin. It doesn't say where discoveries are preserved, curated and held for all time. So we're very good alt starting things. We're very interested in getting things going. But we're not terribly interested in a long-term sustainability. We tend to figure that if it's a good idea, it will attract its own long-term support. So those models, I think, we need to worry about. Nevertheless, if you end up with something like a subscription service, then you can, in fact, request money on NSF grants to pay your fees to that. It's not an unfunded mandate. It's a multiply funded mandate. It's not an always-funded mandate. What I think of as a great strength of the NSF and lot of people think of it as a weakness. One of the things that is a strength of the NSF is everything that goes on in the NSF is judged by the community that's asking for money. That is, the review process is handled by the same community that is requesting, typically, reviews are done often by people who got funding in the previous cycle. 34 And so what is a standard in each community is applied to the next one. This is one of the challenges we've faced when we introduced that data management plan requirement was, you can't make a mandate for every field of science and engineering in the country that says you will make all your data available to anyone for free. You can't do that, because for whole fields of research at the NSF supports where that would actually stop the research altogether. Areas with commercial value in Geo sciences, for example, researchers have made agreements with companies. You can't then say just because I got an NSF grant for part of this work that those data are going to be public. There are legal issues in the social sciences concerning personally identifiable information. So you can't make a mandate. And so what we came up with was intended to be a very -- the other message, the soundbite was the policy is you have to have a policy. The idea was to make people think about data management. We've always wanted people to share data. Not necessarily for free, but we've always wanted them to. We wanted to make people think about it. I've talked to a bunch of university groups and one of the things that surprised me after the policy came out was the number of universities where the computer section was saying, you know, let us help you with the NSF mandate to archive your data, because we didn't write it that way. Certain parts of the NSF have added their own restrictions on top of it. Some of them, I think, unworkable and I think the number of community should be protesting. So that was an unintended consequence. Another other thing I should say about that, of course, is it's a work in progress. If you don't like it, protest to somebody, because it will be rewritten. The first version of any policy never, never stands forever. Anyway, I've talked enough. >>: Dan? Not bad for a guy who had nothing to say. >> Nigel Sharp: How long have you known me, George? >>: Do you care to speculate what the portfolio review might mean for astroinformatics, roughly speaking? >> Nigel Sharp: Well, I don't really know what astroinformatics is. I have a 35 number of opinions from a number of people. No, the portfolio review is, for those, I expect most of you know, the Division Of Astronomical Science is in the national [indiscernible] survey predicted that our budget would grow faster than it has. In fact, in fiscal '12, we are $45 million below where the [indiscernible] survey said we would be. This is particularly annoying to me personally, because I gave the talk to the Decadal survey committee, explaining our budget projection, which they threw out on the grounds it was too pessimistic to make a plan. In fact, the final budget is even more pessimistic than the one I gave them. It's particularly irritating that they didn't listen to me. So we held this portfolio review to see where money should be spent and shouldn't be spent. At this point, it's too difficult to predict, because we have to do an implementation plan that is, we have to do a response to the review or to the report. And the response will condition what happens, and the response is tied up with enormous amounts of politics, including the fact that we can't give any details because the fiscal '14 budget request is completely embargoed from anybody being allowed to discuss whether we're going to be in business or not. So it's complicated. I will say -- have we got alcohol this evening? beer, and I'll say something else when it's not being recorded. Buy me a >> Daniel Katz: All right. So I'm Dan Katz in the Office of Cyberinfrastructure. I guess I should say with a caveat that this is what the Office of Cyberinfrastructure has been and will be until at least September 30th, and then it's not completely clear what will happen after that. But I think this will continue at least according to the memos that I've seen, so hopefully this is reasonable. So cyberinfrastructure, just to mention for people that aren't aware, computer systems, data storage systems, instruments, repositories, visualization and people linked together by software and high performance networks. So I think this is the best definition that I'm familiar with for cyberinfrastructure, because it has the people involved and it also talks about the software and the networks as the things that really link things together. So software here is part of the infrastructure developed by individuals in groups in this context, developed for a purpose and used by a community. So 36 this isn't general software. infrastructure. This is software that's part of an I'm going to skip this for a time. So what I think we're trying to do at NSF in terms of software is to, as I said before, to create and maintain a software ecosystem that provides capabilities that advance and accelerate scientific inquiry and new complexity and scale. And I guess I would say somewhat differently than what Nigel said. In OCI, we feel like really what we're doing is not doing research, but is providing the infrastructure that people can use to do research. So that actually is a little bit of a difference between our part of NSF and most of the rest of NSF. So in order to do this, we need to have research that goes into this. So there's foundational research in computer science and mathematics and astronomy that develops the algorithms and develops the tools and data analysis routines that then become part of the software infrastructure. And by using that, we actually get science out in the end. Ideally, transformative, interdisciplinary collaborative science and engineering. But in many cases, domain-specific, incremental. We'd like more of one than the other, but we're happy, really, with science coming out. >>: I'm sorry, not incremental. foundational. >> Daniel Katz: Okay. The opposite in transformative is Well, we do incremental too. >>: Read the policy document. >>: Those papers are incremental. >> Daniel Katz: In any case, the other two pieces that go along with this is the fact that we want to try to transform practice through policies. And policies end up being a big challenge, as many people know. So ideally, we would like to create incentives to challenge the academic culture, to promote open dissemination and use, reproducibility and trust, sustainability, policies for governance of community software, policies for citation of software, and use of software, stewardship and attribution of software authorship. So these 37 are really the hard things that we have difficulty doing, but we feel like really are needed to be done in order to have this ecosystem that really lives onward over time. And then finally, we need to use the software in the ecosystem to train students, to train the next generation, to train the people working in the field today. And we need to make sure that the people that are coming up both know how to use the software as well as know how to write software in the future, or else we'll be stuck where we are. So this is, I would say that this is really the view of OCI in terms of where software is as an infrastructure, but you could actually use this same picture and substitute data or substitute hardware or a bunch of other things. This is intended to be fairly broad. So the goals of OCI, in terms of software, and I've kind of said some of this, is that we want to encourage and contribute to the development and deployment of comprehensive, integrated, sustainable and secure cyberinfrastructure at the national and international scales. And this is including, again, software as a piece of this. We want to have an effective cyberinfrastructure impact that has clearly defined benefits across multiple research disciplines so there is a number of things in our software that probably don't fit into OCI. They may fit into astronomy. They may fit into other disciplines. We want to focus on the things that are impacting multiple disciplines. We want to promote the transition from basic research to effective practice. Again, the basic research probably is not our part of it. But moving things that have been proven in research into the infrastructure is, and we want to build on existing and upcoming investments as well as CI investments that are coming from other units so that we're really kind of putting things together, not doing a bunch of individual pieces. So it's worth mentioning how we are working in NSF in terms of some of the software programs, and particularly the SI-2 program and the CDSNE program that are coming up. And in the SI-2 program, in particular, we have a cross-NSF software working group that has members from all directorates. We talk about solicitations ahead of time, determine which directorates in which divisions are going to participate in each solicitation, discuss and participate jointly 38 in the review process, and then work together to fund worthy proposals. So I'm kind of going through this just from the point of view that if you're writing a proposal, you might want to think of who's going to be looking at the proposal and how decisions are going to be made. And the fact that your proposal is not going to go just to OCI or just to astronomy may actually make some impact in how you write things. There might be people in other fields, also, that are looking at this and trying to say, is there some piece of the software that is going to be reusable in my discipline also. So if we put this together, particularly in the SI-2 program, which is the sustainable software program, but in software in general, we think of things in multiple classes. So we think of software elements as kind of these small pieces that are written by one or two PIs. But again, that are sustainable elements of cyberinfrastructure, not research activities. We think of multiple elements being integrated into frameworks or other integrated activities. They don't have to be frameworks. They can be some other collection. We think of then, in the future, soft institutes that will look at the elements and the frameworks and try to build whatever community things are needed around them. So these might be policies. These might be workshops leading to new requirements that then get fed back into the software developers themselves. There's a variety of things. There could be activities looking at new publication mechanisms, new access policies, a few different things. So as we go kind of across the spectrum, we get larger and larger communities, ideally. So the community that we would want to see for a software element that we would view as a successful one may be fairly small, but it needs to be well defined. If we're looking alt an institute, it needs to have a fairly large community that's well defined. And finally, in all these pieces, we really expect that these are going to be reusable. So the software that's being generated is not being generated just for one science question or even one discipline, ideally. It's something that's going to be useful across a number of disciplines. If you're interested in the SI-2 solicitations in general, there's a couple of them listed on the bottom that are, the first one was the element in the 39 frameworks. The second one was the institutes. These have both passed, but there likely will be more equivalent solicitations coming in the future. There's the one that's on the second from the bottom line is one that's partnerships between the U.S. and the U.K. in chemistry, specifically. We had some interest from EPSRC in the U.K. and managed to figure out some place that we could work together. We're certainly interested in doing this with other countries as well and other funding agencies. And I think in the future versions of the solicitations, we will probably be much more explicit about international partnerships and the fact that we're looking for them where we can find funding agencies on the other side that will fund the other part of the partnership, because NSF won't fund things outside of the U.S. in general, but we're very happy to look at projects where we can work with another agency and fund a larger project. And then finally, I just wanted to mention that, again, for people that are in the U.S., in the GPG, there are opportunities for supplements and EAGERs that are always open. So EAGERS are exploratory, kind of small projects. I don't know, Nigel probably will be very unhappy if I say this, but I would say that many program officers are really eager to hear about ideas that people have for extending their current projects and for new exploratory projects that come in outside of the formal solicitation mechanism. >> Nigel Sharp: Why would I be unhappy about that? >> Daniel Katz: Okay, good. I agree with that. >> Nigel Sharp: The early concept asks for exploratory research for a whole idea of, you know, testing key ideas before you write a full proposal. Very, very valuable. >>: How do those get funded? >> Nigel Sharp: Well, the problem with an EAGER, of course, is since normally people don't submit until they've discussed it, they generally have close to 100 percent success rate, because they don't get submitted until the discussion has reached the point where, you know, it's going to get a good reception. >> Daniel Katz: I think there's a fair number of these, and at least before I came to NSF, when I was just at a university, I really wasn't very much aware 40 of these as opportunities. So I guess I just wanted to make sure that people are aware that program officers in general are interested in hearing about new ideas that we can try to encourage to lead into a future proposal. >>: [inaudible]. >> Daniel Katz: There are 300,000 at maximum? >>: It depends on the division. Engineering ones tend to be more, because engineering is more expensive. The math ones are cheap. >> Daniel Katz: I think the GPG says 300,000 maximum. >> Nigel Sharp: The maximum, but that's because it has to meet all -- it also says at levels commensurate with normal awards within the program. So it varies. I wrote that language too. >>: You've got a good memory. >>: He wrote it. >> Daniel Katz: So I just want to throw out some open software questions that we have that are things that are worth thinking about. So one is that software that's intended to be used as infrastructure has its challenge that unlike in business, when you get more customers, you get more revenue, if you have more users for open software, it ends up actually often being more work for the developers. And there's no real -- there's nothing that really comes back that benefits you. So there's a question about how we actually address that. Taking something that's a research piece of software that proves a concept and turning it into something that's really reusable is often as much work as the first part or maybe a multiple of the amount of work of the first part. And so we need to figure out how to really do that in a cost effective way. Question about really what NSF can do to make these things easier, I think, is an open question. So the simple answer of giving more money is probably not the best one. But if there are systematic things that can be done that help these situations by trying to group people together, group projects or looking for common issues, common solutions, we're certainly interested in that. 41 Another question is how we should measure the impact of software. I think this is a large question that we have. If you have software that you're offering for people to download, the easiest thing that you can do is to count the number of people that download it. That's also probable the least effective thing that you can do. We'd much rather know how many people are using it. And beyond that, we'd rather know what they're using it for. If people are really using it for science, if they're using it for education, if they're using it just to test it. I think those things make kind of a big difference. We have a fixed budget in some sense. I mean, not necessarily year to year fixed. But within a year, it's fixed and we have to figure out what fraction of the funds should be supporting existing projects, keeping things maintainable, keeping things as infrastructure versus putting money into new things that are going to become the infrastructure in the next few years. I think one of the questions that comes up here is, in particular, for software that's intended to be downloaded and built versus software that's intended to be used as a service, I think there are probably fairly substantial differences between how we look at that being used and what the rules are between what we want to encourage and what we want to discourage. And I don't think we really have a good sense, yet, of the software that's a service. I think we have a pretty good idea of software that's downloaded, how to encourage different people to do things with it and how to measure it a little bit and some of those things. But the service piece is a little bit more fuzzy. If we're bringing in new pieces of software into the infrastructure, then we have to eventually stop supporting other pieces, again with a fixed budget so I don't know, at this point, how we make the decision that something has reached the end of its life and we're not going to support it anymore. Finally, almost finally, encouraging reuse and discouraging duplication is, I think, is quite a challenge because the piece of software that somebody writes that maybe is not 100 percent ideal for somebody else, but it's a lot cheaper for us if they use that software than if they develop something else sometimes. So the question of kind of what the right balance there is, I think, is a good 42 question and how we make the original software more reusable so there's less duplication fits into that. And then finally, again, talking about what following Nigel, we want to create and to support career paths for software developers as well as for people working in data curation and data issues. And this is also something where we can't really do this by ourselves, right. As I think that Nigel was exactly right. Giving money in grants helps, but it needs to be really a long-term commitment and that has to come from the places where people are hired, the universities and the laboratories. And so I think that's all that I was going to say. >> Yan Xu: Questions? >> Daniel Katz: I'll also say that I'm happy to have general questions now and I'll be around this evening and tomorrow as well for anything, anybody that wants to talk about anything specifically. >>: In terms of when to stop supporting software and general maintainability, isn't there a parallel with, like, instrumentation? Can that serve as a model? >> Daniel Katz: I don't know. I guess there's clearly life cycle issues in almost any kind of infrastructure and any kind of research. But I'm not really quite sure what the parallel is there. Like how far it is. >>: If you support the creation of the instrument, at what point does it become something that you don't want to support anymore. >> Daniel Katz: Right. So I think there's multiple issues with software. And one thing is that at least the feeling in a workshop last week in Chicago is that community codes have a, kind of a natural lifetime of maybe 20 years or so before the underlying computer architecture changes so much that we need to write something new. I mean, so that's one way of looking at it is that way. Another way is when somebody develops something in the community, starts moving to that next thing, then maybe we stop the first thing. But it's -- I don't know. We don't necessarily want to be making decisions on who's winning and who's losing. We want to try to figure out the way that the community makes these decisions and 43 we fund the things that the community needs in order to do the science. don't know that it's really clear for us exactly how to do that yet. And I >>: There's an interesting intellectual property issues that go under this. In that if I have a grant to do science from NSF and I write software as a result of that, I own the copyright, my university, to that software. I can do what I want in terms of it. If I'm funded from an outside project, the deliverable is software, then the university owns it. Then the university owns the copyright, at least in the University of California. Are there things you do to sort of encourage those circumstances that that software is, in fact, gets released under a BSE license so that other people can use it? >> Daniel Katz: >>: Yeah, so in the -- Because our IP office is impossible. >> Daniel Katz: So if you write a proposal into the SI-2 program, you write how you're going to make the software usable. And so if you, as a PI, write that you're going to release it under GPL or whatever license you pick, the university then will either agree with that or not. But if the proposal goes in, then we assume that we're going to work that out. I would say that OCI, in particular, unlike some parts of NSF, is not really strict on Open Source or on any particular license. We are fairly comfortable with other models. If you can make an argument that Open Source is not the right model to make your software reusable and it's going to be more reusable by having some licensing fee or working with some company, we're certainly willing to listen to that. And if that's what the peer reviewers agree with as well, this is the way the software's really going to get used, then that's fine. >> Nigel Sharp: I think that goes to the point I made that it's reviewed by the community that's going to use it. I mean, if you say, I mean, you can say anything. If you say we're going to commercialize this in a spin-off company and the review process says yeah, we're fine with that. That's the best way to make this available to the community, fine. We've had cases where the reviewers say, you know, we don't think you should fund this because this code is not going to be Open Source. So I think that's one of the strengths. You say something, and your community says yay or nay. 44 >>: I guess I'm worried about if I were to do something like this, of getting it through IP office in time to meet any deadline. >> Nigel Sharp: But your office legally agrees to the conditions of a grant before they draw dollar one from that award. So if they get an award and they take a dollar out of it, they've agreed to all the conditions, including those terms. >>: Remember, it's always easier to ask for forgiveness -- >> Nigel Sharp: Not in the government. >> Daniel Katz: I will say in one of my previous university roles, we had this issue where Open Source software was actually very difficult in general. But if it was something that was in a proposal and that's what you said you were going to do, the university said you have to do it, that's fine. >>: That makes it easier, okay, because I actually spent six months trying to get something like this through our IP office, even though the award had been agreed on. >> Yan Xu: Okay. Thank you very much. And since our speakers who are here tonight to continue their questions and discussions, let's give them a prop. And before we move to the last panel, Eric is going to make some announcements. >>: And before he does that, we will have the group photo after this session. Since we're running late, we'll probably have the Daniel discussion, say, until 5:15. The dinner is already ready out there. So when we're done here, we'll all go out, take a picture, then you can eat. >> Eric Feigelson: My name is Eric Feigelson, Center For Astrostatistics at Penn State. George has asked me to review for you the organizational changes that are relevant ->>: Not to review, to mention. >> Eric Feigelson: In three minutes, okay. It's an alphabet soup. I'll start at 2010. The International Statistical Institute is the largest, oh, 45 150-year-old association for statisticians. It formed an astrostatistics community and now has 150 members. This is now morphing into the International Astro Statistics Association. This is going to be a joke. There are so many initials here. Okay. Probably in the next few weeks or months, which is going to be an independent international organization, 501C to allow the ISI to raise it into something called an associated society. So the statisticians are quite interested in astrostatistics. get the message there. You should just A famous project called the LSST in America, the largest NSF-funded project for the next decade, I think, I'm not sure, formed the ISSC, the Information and Statistical Science Collaboration in 2010. The leader of this one is Professor Joseph Hilby, retired from Arizona State. The leader of this one is Kirk Borne, who is sitting right over here, and this group has been fairly quiet but is now sort of beginning to talk a lot about how to contribute statistical and informatics expertise into the LSST project. Now, in 2012, that means the last three months, two things have is the American Astronomical Society has formed a working group astroinformatics and astrostatistics. And this is now one of a working groups that can last for years within the AAS, which is national society in the world of astronomers. happened. One in number of the largest This indicates that the astronomical community, at least in the United States, is very interested in astroinformatics and astrostatistics. And then last week, the International Astronomical Union, that's the largest international group of astronomers, formed a working group, I'm sorry for the joke here, of astrostatistics and astroinformatics, which is slightly different than the other working group name. And that shows that the international groups of astronomers are very interested in this. Kirk Bourne is the leader of this. [indiscernible], a professor at the University of Washington who will be here tomorrow, I'm told, is the head of this. And their leadership is going to be decided very soon. I may be involved. Actually, I'm sort of involved slightly in all of them. And there is something that I am involved in. This is sort of why I'm interested in this, is that I am sort of, Joe Hilby and I have formed something 46 called the -- let's forget writing it all out. The astrostatistics and astroinformatics portal, and it's ASAIP.PSU.EDU -- PSU is my university, it's just hosting this. And the reason for promoting this is that this is going to be a vehicle for at least three and possibly all four of these organizations to talk to the public and to talk to themselves in public and to have resources available both for the expert community and for the innocent community, the thousands of astronomers who don't know very much about informatics and statistics. And also for the communities of computer scientists and statisticians to just sort of browse astronomy and find out what interests them. So I encourage you to -- this website is not finished. In fact, you may go there and feel it's rather incomplete. Well, like it has five forums that sort of don't have any discussion threads yet. Well, that's because it hasn't begun. So I very much encourage you to do two things. One is join one or more of these societies that pleases you. And two, pay attention, especially over the next few months, to the website, become a member of ASAIP, to contribute to it and sort of build the online social network intellectual community of computer scientists, statisticians and astronomers who have mutual interests in methodology for astronomy. Thank you. >> Yan Xu: So the analysts of the last panel please. >> Kirk Borne: I'm Kirk Borne. I was just mentioned in reference to the LSST group by Eric. Thank you. So I have been involved in astroinformatics before it had that name for a long time with the virtual observer since its infancy. >> Masatoshi Ohishi: My name is Masatoshi Ohishi. I'm from Tokyo, National Astronomical Observatory of Japan. I have been involved with the virtual observatory activity in my observatory for the last ten years. And last week, I'd like to say we had special session number 15, which was associated with the original assembly. The aim, to discuss not only the current status of various virtual observatories and how to extract new information through, you know, statistical data analysis, including astroinformatics. So I'm very happy to be here. >> Bob Hanisch: I'm Bob Hanisch from the Space Telescope Science Institute, and I'm the director of the Virtual Astronomical Observatory Program in the U.S. I also am the new president of commission five that I've taken over from Masatoshi. Term limits, term limits is what it is. 47 And the working group that Eric just talked about is under commission five in the IU. So how shall we proceed? I've got a few notes of things that I wanted to start things off. >> Yan Xu: Please go ahead. >> Bob Hanisch: So as Masatoshi just alluded, the virtual observatory has been basically a decade in the making. The initial problems we'd tackle were quite easy. We'd sort of, you know, skim the cream, and we were very optimistic, perhaps overly optimistic about how easy it would be to carry on from there. Things got more difficult. Achieving national and international consensus takes time. The infrastructure, though, that we conceived and saw evolve over the past decade is now mature, to the point that we are now in some cases doing second tier and third generation evolution of capabilities. The core capabilities of data discovery, access and interoperability are largely solved problems. And VO-based data access is really now pervasive. Many people in the community do not understand that they are using virtual observatory protocols every day. There are millions of VO-enabled data accesses per month just from the U.S. via collaborating organizations, the ones that we count, and I'm sure it's another factor of two to ten more than that when you integrate all over the worldwide VO-enabled data resources. So the VO is here now and people just still are not as aware of it as they should be and, of course, I think those of us in the VO projects do have to take responsibility for making sure that this dissemination works better in the future. So the challenge now is scaleability. Scaleability has always been part of the plan. We sort of started the VO initiatives in partnership with grid projects. But frankly, astronomers were not interested. They, for the past decade, have said oh, I can just get the data on my workstation, on my laptop, and I'm happy. Don't tell me about the grid. Don't tell me about the cloud. But they have to get interested now, because in the coming decade, the data sets being produced by the observatories of the future or even now, with, like, low far, are just not possible to bring on to your laptop or to your desktop and analyze there. 48 And, of course, astroinformatics and astrostatistics are part and parcel of this next stage. Eric did not mention, for example, his VO stat package, which has now been updated and released by Penn State. There's a nice poster here on can far and sky tree. So we're seeing these astrostatistics and informatics tools being exposed more widely to the research community and getting takeup of them, I think, is now a challenge that has already come up today in several other contexts. So this whole thing of community engagement is, I think, where we are now. In the VAO, we are participating in summer schools, we are hosting community days. We are going to conferences like the AAS and helping to organize things like the IU special session to help get the word out that the VO is here, it's ready, and we're really trying to push out capabilities to the community so that they can develop their own VO-enabled tools. We can't write everything for everybody. We've never really intended to do that. But we can provide an environment that makes it easy for people to develop things that build on the VO infrastructure. And, of course, coupling this with education of the community as Eric, for example, has been doing for many years now about astroinformatics and statistics is, I think, critical. So now is the time. I think we've got the basics done, and we are ready to move into the age of big data, big information and using these tools that will allow people to distill the essence of very complicated data products into new science. >> Yan Xu: Okay. >>: Can I ask about the meaning of the title of this session, why from virtual observatory to astroinformatics and any other combination of words? >> Bob Hanisch: Well, that was George's title. >>: I feel like it was a collection of words. >>: It was the lottery. >> Bob Hanisch: That's what came out. I think the VO and astroinformatics are a partnership, not a 49 transition, although there is a change in emphasis here from discovery and interoperability to now incorporating these tools for informatics and statistics into this framework. So in that sense, there is a from and a to. partnership. But I think it's an ongoing >> Masatoshi Ohishi: I'd like to add some words. The VO concept consists of three categories. One is registry service, how to -- you know, in order to find data services. Then after finding specific data services, we retrieve data. The next step is computing service. And my understanding of this computing service includes, you know, statistical data analysis or astroinformatics and astrostatistics. That's why, you know, VO, from VO to astroinformatics is closely related. >> Kirk Borne: So I just want to say a few words about the end of that title, astroinformatics. More than once today, people have said I don't know what that is, and so I'm just going to take the opportunity, since I'm in the chair right now, to say what I think astroinformatics is and why I think it is a research discipline. For quite a while now, I've been in this area. But it started sort of the quantum jump in my interest in this field jumped back in the summer of 2003 when I heard Jim Gray give that talk in online science, which we heard mentioned this morning. He has given that talk in a number of venues in 2003, 2004. And he talked about X-informatics, where X referred to any discipline. Bioinformatics, geoinformatics, and he didn't actually say astroinformatics in that particular slide, but that transition in my head from bio to geo to astro happened, the a-ha moment for me was these are independent, in a sense independent research disciplines within the bio community, within the Geo community. They have their own journals. They have their own departments. They have their own conferences. They're not infrastructure. Too often, there's feedback that why are you calling this astroinformatics. You're just talking about computing infrastructure. And it's almost the same sort of self-definition that the VO 50 community or the self-identification VO community has had, which is are you just computing infrastructure project, or are you an astronomy research project. And I think those who have been working on these projects recognize them as astronomy research projects. But in order to make the astronomy research enabled, you need that infrastructure. So when I think about astroinformatics, I think about many pieces and maybe this is too broad. But for me, it is broad. It's the data access and discovery. It is these registry services. It is what terms do you choose to index the data, to tag the data, to edit the data. And those could be done by humans or they could be automated. We heard a little built this morning about automatic tagging, automatic keyword extraction, whether it's in an abstract of a paper that talks about the dataset or if it's actually, you know, a set of keywords that the author provided. But how do you index data to make it discoverable and accessible through things like the VO? So it's that data access, discovery, data management stuff, if you will. There's also data structures. I think of data structures as informatics. So how do you visualize multidimensional data? I mean, you have to put the data in some special structure, perhaps, to enable that. How do you do an end point correlation function in a thousand dimensions? Again, what is the data structure? Is it a graph? Is it a table? Is it a cube? So these things are part of research. Understand what are the best ways to make the data interactively explorable by the research community to enable research. And then, of course, more than that, so it's those things, but it's more than that to me. It's also the machine learning or applicational machine learning to large data sets, which is data mining. So frequently, those terms of interchangeable. So I tell my students, the application of machine algorithms to big data is called data mining. The application of machine learning to machinery is robotics. So you talk about decision trees and neural nets, et cetera, these are the exact same algorithms that you learn in those robotics systems. So developing those algorithms is research, okay. It's not computing infrastructure. It's actually applied mathematical, in effect, some cases, 51 pure mathematical research. And then, of course, statistics. So statistics, again, there's such a rich wealth of algorithms historically, plus still being developed by that community, all of that enables discovery from data. So these are all informatics and specifically applied to astronomy, astroinformatics. And lest I leave one off, seeing Matthew sitting here, semantic astronomy. The whole concept of ontologies, knowledge representation. Because when we're dealing with big data, the real issue is how do we get the data to a point where it's so manageable, I can do something with it? So it's that extraction from data to information to knowledge to get us to that insight. So it's that value chain where we're dealing with petabytes down here at the data level, but how do we go from that to extracted information from that, finding the patterns, correlations, trends in the data in that information stream, which is now the knowledge, okay. And from that knowledge, actually understand what it means for the universe. And that's the insight and understanding step. So that's quite a reduction. So you can imagine all of the data that goes into a project that's looking for the Hubble constant or the dark energy parameters. At the end of the project, there's one number, which people publish, and then they become famous. But in order to feed into that one number, there's an enormous amount of experimental result. And so how do we do that for other types of things which are maybe not quite as sexy and high profile as dark energy or Hubble expansion or something like that. But how does the scientist deal with huge quantities of data to extract that nugget. So for me, in my career, it was always the galaxy merger rate. For me, that was the holy grail for my career. I was going to help find what the galaxy merger rate is. And whether I achieve that or not is probably almost immaterial now in the late age of my career here. But for me, things that I did were driving towards a single number. The number of mergers per galaxy per Hubble time. Of course, that's an over simplification of a complex problem. But that's sort of the idea, how do we go from the big data to the big idea, which is what we're seeking in science. And so for me, astroinformatics 52 incorporates all of those pieces and maybe that's too broad of a definition. But for me, I think that explains why we can have a community sitting around this room where we're talking about data management. We're talking about databases. We're talking about data integration, data warehousing in a previous talk, but also at the same time talking about statistics and more tomorrow about data mining algorithms and the applied math that goes into all of this. >>: I think Kirk's basically right, although I think you missed a couple things. >> Kirk Borne: Well, yeah. >>: The library is electronic publishing and also I think means of enhancing collaboration communication. >> Kirk Borne: I just didn't have it all. >>: So the way I view any science informatics is means by which computing technology can and has stimulated some domain sites. In our case, astronomy. And in that sense, there's probably going to be perishable field, because soon enough, all of astronomy is going to be astroinformatics or maybe it's already. All the biology will be bioinformatics, in some sense. This will be just a normal way of doing things. But until that happens, until the communities embrace new ways of doing, we do need these bridge fields. The second role for the science informatics field is, I think, to talk to each other. To share the methodology and tools or ideas or experiences so that we don't have to reinvent the wheels. And in that sense, I think you're right. It should be broad and anything that helps us do better science using computing information technology some tangible fashion belongs to it. So in that sense, I would say statistics obviously means understanding the data, but it's just one part of the whole picture. I think the only reason why he has working groups of both astroinformatics and astrostatistics is that Eric is interested in both and he was the one who took trouble to form it. >>: Just a short answer to that, I like to say that the informatics deals with 53 the volume and statistics deals with the complexity, the variety. >>: I disagree. I think that data mining deals with complexity. >>: No, I'm just saying that the discovery -- dealing with large data is different than the sort of statistical validation of your results, which can be on small data. >>: Well, [indiscernible]. Hubble [indiscernible] take a dozen or 20 spectra or photographic plates, look at them, do a bit of thinking, make up the expansion of the universe. We now deal with petabytes of data. No human mind can actually deal with that amount of data. So for me, astroinformatics is a process of extracting science from the data, because we've moved beyond the point where you might just look at the data, extract the science. It's actually difficult, those steps, all the way from data, information, knowledge to science. You need sophisticated tools. >>: I think he said the same thing, only more compactly. >>: In the game of alternative definitions, the definition that I've been working with in terms of defining, talking about discipline or self-discipline in astroinformatics, I think could battle with Kirk, probably could battle with George's and maybe [indiscernible], which focusing on informaticians or informaticists in the sense that there may be a category of people who are the people who have perhaps -- are perhaps computer scientists who form a particularly interesting area or scientists gone to the bad, in terms of getting their kicks more from the technology and making the contribution through the intersection area. So it may be that if there's a discipline of astroinformatics, it's primarily motivated, is defined by implication with the interest for [indiscernible] like some sort of intersection between those two disciplines. >>: Let's say that I find this very refreshing, because for almost one century, of course, I won't ask [indiscernible] to provide an operational definition of geometry, and I think you all know that this answer would give me a couple days. In a couple days, say geometry is the set of complex or knowledge which [indiscernible] over X [indiscernible]. It's not a definition -- 54 But in any case, even if the definition is a familiar thing, at least find a [indiscernible] definition of [indiscernible]. For instance, I had the same feeling as George. Astrostatistics and astroinformatics work together. They are two completely different things, and it's difficult to say which includes which. They're two completely different things. In my opinion, astroinformatics includes astrostatistics, because astrostatistics is one thing which is needed to do astroinformatics, but I may be wrong about that. I will discuss it with Eric. But I didn't agree it's not the specific [indiscernible] what we're discussing is what's happening here and astroinformatics is actually the knowledge we need in order to serve. The problem is just the physics is phasing out. So it is agreed that this incorporation of a slash or [indiscernible] networks is everything which we need to solve this problem. >> Yan Xu: Okay. It's been a long day and we're behind schedule so we'll take the last question and, of course, we'll have opportunity to continue discussions in lunch time. Dinner time. Alex. >>: Just to return the favor for the quantum computing question, to get you out of your comfort zone, so what would the word like be in this field in ten years from now? Just a three-sentence question. >>: The outgoing president of -- >>: In ten years from now, I believe, you know, astroinformatics or astrostatistics, whatever it is, various useful tools will be prevailed and many instruments will be used and without seeing them. So that's the way we should go and I'd like to stress that useful tools and successful use cases, I think these are the keys for our success. >>: I'm hoping in ten years astroinformatics won't exist. irrelevant. It will be totally >> Yan Xu: Well, let's thank all the speakers for very stimulating discussions and presentations and the panelists that gets us set for the next two days. Okay. We'll follow George's instruction for the picture and then let's enjoy dinner. Thank you very much again.