1

advertisement
1
>> Yan Xu: The keynote, first keynote speaker will be Alex Szalay, and he will
talk about the Genealogy of the SkyServer From Galaxies to Genes. Thank you,
Alex.
>> Alexander Szalay: Thank you very much. So I thought it kind of more than
ten years now that we discuss [indiscernible] life and I thought this a good
time for a bit of a retrospect and also to see, actually, what are the
spin-offs and lessons learned from the ten years of the SkyServer.
So what we see today, so kind of the starting theme is basically, what we see
is there is a diversification in scientific computing, where ten years ago, it
was all super computers. So today, we see that one shoe doesn't fit all. And
we also see all sorts of evolutionary pressure, often dictated by limited
budgets so that we try to make every dollar count when we try to create a
computing environment or decide what to use.
And so in this world, the diversity grows a naturally, no matter what, because
individual groups want specializations. And we see, for example, that over the
last few years, a large floating point calculations have moved to GPUs, but if
we needed lots of random IOs, people started to plug here and there solid state
disks into their systems.
We are learning more about how to write algorithms and build hardware to do
high stream data -- high speed stream processing, and then we also see a
diversification in terms of software. Where there was kind of [indiscernible]
file systems, we very purely or we had [indiscernible] databases, now we have
column stores, we have no SQL, we have SciDB, so there are offered a whole lot
of possibilities.
So at the same time, one can ask, okay, what remains in the middle with all
this diversification on the surface? And how far do similarities go between
different areas of science? And kind of the common denominator is big data, or
data management in general. And one can even define data management, that data
management is what everybody needs and nobody wants to build.
And this is many ways the reason why we don't have a good after shelf solution
to handle this problem, because everybody is just trying to do the bare
minimum, what their particular project requires, because there is no incentive
to do something better. Because that would a substantial extra investment,
2
unfortunately.
So we are still building our own, and unfortunately, over and over. And so
when I look back at the ten years of Sloan and the SkyServer, it is clear we
talk up to now about the data life cycle. But also, there is very clearly not
just a data life cycle, but there is also a service life cycle.
So basically, once we are not just putting out data objects, but we are putting
out data objects through inter region services and smart services. Basically,
the services were built on the particular software environment and framework
and operating system. And some of the web services for Sloan were the first
web services done in science.
And since then, the service environment and distributed computing has gone
through several changes in the paradigms. And every once in a while, just as
much as we need to curate the data itself and fix the bugs and so on, we need
to give a face lift, actually, to the services and apply the most up to date
computational paradigms and the frameworks and tool kits in order to keep them
portable.
So with this preamble, I'd like to get back into Sloan, which really changed my
life in '92 and little did I know what am I getting into for the better and for
the worse. So it was overall an amazing journey, but a journey I thought that
would take only about eight years. Instead, it took now 16-plus, and there
were some dark nights in between. We were two months away from bankruptcy
several times. And somehow, some money always came along and we were able to
go on.
And so now Sloan is in its third incarnation. So there is Sloan 1, Sloan 2,
and now we are on Sloan 3. And there are now plans shaping up for Sloan 4.
But the archive kept going on and we kept every year, consistently we were
putting out a data release, starting with the EDR, the early data release and
then there was DR-1, DR-2, et cetera. And the data is public, has been. And
amazingly, we set up this detailed schedule for the data releases, and then the
data processing and even the data position was delayed, but we never missed a
data release.
Some cases they were like a week or ten days to go before I got the data, and I
got the data to load before the data release happened. So a lot of people
don't know that in most of the survey, there was essentially zero proprietary
3
period for the Sloan project. So our staff saw the data at the same time when
it was put out for the public.
So we set out to take two and a half terapixels. In the end, we covered about
five terapixels, because we now imaged much of the solid sky as well. The plan
was to take about ten terabytes of raw data. But now when we have to wrap up
basically the Sloan 1 and 2, the Sloan 1 and 2 legacy dataset is about 120 tear
bites. We thought that the database would be about a half a terabyte. Today,
the main SkyServer that everybody's using, the best, DR-9, is now about 13
terabytes. But with all the ancillary projects and data products, we have more
than 35 terabytes up in the public services.
And so it started in '92 and finished in 2008. And what is amazing about this,
that, of course if it had stay on schedule, we could not have done any of this,
because this was enabled by Moore's Law and Kryder's Law. So it just became
cheaper every year, basically.
So sometimes it's not so bad to be a little late.
So when we built the SkyServer, again, not too many people know that this was
actually for outreach. So the project was violently against using SQL server.
And so basically, a few -- it was a sort of [indiscernible] project. So Peter,
Annie, Jim and I basically, over a Christmas holiday, took the old database
from objectivity and poured it into SQL server and then basically built the
simple front end and then we said okay, we don't want to change the public face
of the project. This is just an outreach on the side.
And after a few years, everybody started using this so we got a lot of help
from everyone. Curtis helped to design, basically, the whole website and the
individual look at it. And now in 20 years, we are at one billion web hits, a
little over one billion web hits. And we recently did a count, we are now at 4
billion distinct IP addresses on the users. Which is pretty decent.
And what we also see is the emergence of the internet scientist. So these are
people who are not professional astronomers but are also more than just the
casual user who after reading a New York Times article just clicks on an image
on the Sloan website. So there are people who keep coming back.
And it's today, after this [indiscernible] has been doing yearly evaluation of
citations. And for a couple of times out of the last two years, the solar has
4
been the most used astronomy facility.
server-side analysis done.
Also, there is a lot of collaborative
I will mention it after. So this shows, actually, both the website traffic as
a function of time since 2001 and also the SQL queries. And the SQL queries
are the queries that people are actually typing in. So this is not the service
interface. It's just a pure SQL that's coming in from the outside. So this is
for the SQL traffic. This is the one million mark per month. So that's
actually quite a lot.
So these are the main components. Now, it has grown into a fairly
sophisticated system, so it's more than just a database on the front end. It's
actually a fairly complex environment, how we go from the raw data. So we have
a transform plug-ins, which convert the different [indiscernible] into a
loadable format and then they create little preload databases, which have the
same schema as the main database, but then we kind of crunch the data there and
scrub the data for all the errors before they can get loaded into the real
database.
We have a very elaborate schema framework that all the metadata you need the
schema descriptions, everything is held together and it's automatically parsed
multiple ways, like the old web language by [indiscernible]. So the one parser
generated the code and the other generated the tech book.
So we try to do the same philosophy. We have a parallel loader this keeps
track of the loading so you can follow every data and where it is in the
pipeline. We wrote a whole bunch of sequences to SQL to take into -- to
provide the [indiscernible] knowledge. Then we have a batch scheduler for the
long jobs, and then the results of those go into a MyDB and then we also are
currently working on a much easier converter, a Dropbox-like interface, how you
can upload your own data into MyDB to join with the main database.
And, of course, we have SQL and web logs. It's now a couple
from day one, we have essentially every mouse click logged.
this will be a very interesting data for science history and
been using and how the usage patterns changed. We are still
profiling figuring out how to improve the services.
of terabytes. So
So at some point,
how people have
looking at
So the MyDB is one of the most characteristic and unique part of this. So
basically, we registered after a while our power users and gave them the user
5
side database. This was done by Nolan Lee, who was a grad student, and he also
spent several summers at Microsoft as an intern.
So query goes to MyDB.
>>:
He was not a grad student.
He was a full-time tech.
>> Alexander Szalay: So this can be joined with the source database and then
the results can be materialized from the MyDB. So you can download it in any
of the formats you want it. The beauty of this is that it gives people still
the -- all the power of the relational database tools, but also the flexibility
on the servers while keeping the original database read-only and you don't have
to touch anything inside it.
And it increased the performance of the database by a factor of ten, because
the outputs are going on a high speed network to a local high performance
server instant over the worldwide web.
And so right now, we have about 6,800 registered users using the MyDB,
basically, day in and day out, which is a good fraction of the astronomy
community. Users can collaborate and this basically emerged naturally.
So after a while, people kept the output data tables, and they said okay, can I
share this table with my friends? We are working on a paper together. So we
enabled group sharing. After a while, people said okay, here is the paper. We
published it. Can we actually make this table very visible. So it became also
a publishing platform.
And it's really a very, very powerful paradigm. Data delivery is via a C-sharp
web service, and this is what's kind of the real manifestation of sending the
analysis to the data. So a lot of this simple join, which would be otherwise
quite complicated, because it involved millions of lines. They were done
server side.
So this is the genealogy of the SkyServer. So the two components. So
basically, the basic SkyServer containing the Sloan data, and then this is the
CASJobs MyDB component. And this joint system has been already cloned by the
millenium simulation database Palomar QUEST. GALEX scope. We are now building
a genomics archive at Hopkins. The basic engine has been useful
[indiscernible] for environmental sensing, for radiation oncology, for galaxy
6
zoo. We built a simulation, turbulence simulation database. Then the whole
Edinborough operation, Super Cosmos and are UKIDDS is based on the SkyServer.
The Hubble legacy arch now converted to this framework and to the tools. And
then these are all the VO services. This is a much bigger cloud because we
have essentially most of the major astronomical data catalogs [indiscernible]
contacts inside MyDB. So you can actually join here not only with Sloan, but
also with GALEX and all the other catalogs outside in astronomy.
And then out of the turbulence database, we actually built now another dare
tive for [indiscernible] hydro dynamics and now we are building a bunch of
cosmology simulations also which will derive partly from the millenium, but
also taking some of the lessons from the turbulence database.
So you can see these are all different science projects, spanning not just
astronomy and different aspects of astronomy, but also many other sciences.
Of course, these are all -- the SkyServer itself was derived from the
terraserver. That was Jim's project, going back another five to six years.
And basically, the first SkyServer was really actually taking our schema, but
never just turning basically the terraserver, which was the first visual earth,
Google Earth, turned basically outside in. Or inside out, sorry.
So Galaxy Zoo is one of the derivatives where basically, we took the Sloan
images and created using the mosaic or the little cut-out service to ask the
public to visually classify 40 million galaxies and we expect that there will
be a few thousand people coming. And in the next few days, we had about
300,000 separate users come in so we had to install new servers, we had so much
traffic.
And there were a bunch of real original discovers made that have been since
sort of [indiscernible] that have been observed by space telescope
[indiscernible] and so on.
>>: So the day you turned Galaxy Zoo on and then suddenly things catch on
fire, all right. And you say we need more servers. Do you just pull out a
checkbook?
>> Alexander Szalay: No. Luckily, we have a couple of boxes sitting with
servers that were not yet installed, and they were therefore basically for the
7
form, for the [indiscernible] so then immediately we were plugging those in and
putting them online. So this was a lucky coincidence, otherwise.
But anyway, and certainly for going, getting [indiscernible], being on the BBC
nightly news, for getting a five-minute segment on the BBC nightly news, that
was a big help.
So anyway, so then a lot of the Sloan services were formed the basis of a bunch
of the VO web services. And there will be several talks later on in the
conference about the VO so I would rather just focus on kind of the sky query
and how we describe footprints and exposures in the sky. And this is again,
this whole tool kit, this spatial tool kit is now used for Sloan, the HST
exposures and it's at the heart of the GALEX, Chandra, Pan-STARRS and we're
also testing it for LSST. We have written it in C++ and plugged it into my SQL
that LSST is currently using for the test pad. And we are about to release the
OpenSkyQuery which we'll really be able to do sort of a hundred million objects
against a hundred million in a matter of two, three minutes with cross matches.
So the spherical library was about 85. So originally, we wrote this with Jim,
and that was not the dot-net version, but it was all computational geometry
written in SQL that was not the task for the everyday programmer. So working
with Jim helped, certainly. And then Thomas rewrote it in C-Sharp.
And now it's an OS-independent realization. We also now have a pure C++ Linux
implementation of this, but it does a very complex -- it's really two packages.
One describes, basically, the geometry on the sky spherical [indiscernible] and
the relationship between them and the other one does basically a very fast
spatial index and spatial searches on the Boolean operation of all those. And
it can compute exact areas.
And debugging this on the Hubble deep field was an amazing experience, because
there are some of the intersections of the hundreds of exposures and visits
created sometimes little polygons which were a few milliarc seconds on the
side. And on the Sloan, we have also polygons which are 130 degrees on the
side, and it has to work with both.
Okay. So and we also added a bunch of enhancements to be able to do
computations within the database, which are also quite generic. So besides all
the geometry, so curve, space time geometry in the database. We also added
just multidimensional arrays, data types, around CUDA code user-defined
8
functions. We can generate arbitrary multidimensional. And this is not just
on the sky, but inside a 3D simulation. We have increment of PCA over all the
galaxy spectra in 15 minutes. We have direct visualization engines which are
integrated with the database and we are now working also in integrating MPI
with the database cluster so that MPI application convert with a parallel set
of database engines.
So this is the Onco Space radiation. So just a slide about the radiation
oncology database. So this is using the SkyServer engine. What it is
treating, basically, the evolution of tumors after treatment at Hopkins at the
radiation oncology lab and basically provides a much faster feedback loop for
the doctors to see whether a particular patient is on track to, basically to
the recovery.
So this is a slide on the Tera Base search engine. In genomics, we expect to
see the same transformation as happened in astronomy ten years ago. So kind of
before Sloan, every astronomer who got data from somebody else had to redo the
basic image processing. Nobody trusted each other's data.
And after Sloan, people tried this again, and they realized that after a while,
that the data is good enough. So the objects, the calibrations are good enough
to do actually the statistical queries and the sample selections. And I think
in genomics, the same thing is happening. So right now, everybody is still
obsessed in doing the alignments themselves. And once people realize that the
alignment done by a central site is good enough, they will start focusing on
just doing the statistical searches over thousands of individuals to see the
variations and the commonalities across the genomes.
And so then we need a search engine that can do this job. And basically, and
for the thousand genomes project, it's about 200 terabyte data and growing. It
has more than a trillion short reads. 90 percent is aligned to the reference
genome, but ten percent is unaligned, and we already built a prototype where we
can do in a few milliseconds, basically we can find any of the sequences in
these 10 percent unaligned, which is a hard job. But it's aligned, basically
just have to align the query sequence against the reference genome.
But the outputs are also going to go into MyDB.
framework applies beautifully.
So basically, the same
Life Under Your Feet is soil biology, taking much of sensor data and
9
integrating it with life biome data with a bunch of different areas. And the
goal is to understand this soil fill to emission. The soil puts out 15 times
more carbon dioxide in the circulation than all the human activities. But the
human activities can modulate this. What we do in the rain forest and so on.
So basically, we are trying to collect data using wireless sensors and this is
the number of samples that we have collected over the years.
The first few projects were started with Jim, and we wrote the first papers
with [indiscernible] and Jim involved. And then you can see that now we are
putting out more and more experiments, and we are basically growing fairly
rapidly. So we are now about 200 million data points right now. And this
shows some of our deployment. So this is an NSF site near Baltimore. This is
in the Brazilian rain forest that then coordinated and so this was done in
collaboration with Microsoft.
This is up at Atacama, next to the cosmology telescope at Atacama. So one of
our colleagues carried the suitcase of sensors up there and it was up for
almost a year. And this is a visit in the lab by some people you may know. So
there's Dan and Tony.
And then I'd like to get to simulations. So what is changing today, also, that
a largest simulations in various areas are also becoming instruments in their
own rights, and the largest simulations start to approach petabytes. And we
need and we need public access to this largest simulations. But how, this
generates new challenges. How do we simply move the petabytes of data to a
site to everything access it from the computers that we ran them.
How do we interface, how do we look at it, how do we analyze it. And this is a
transfer. We had ten days to transfer 150 terabytes from Oakridge before it
was deleted and we did it.
So this is a database of turbulence, where we basically try to adopt a
metaphor, where we take incident of downloading the simulation files,
everything is stored heavily indexed in a SQL server database, and then we can
shoot test particles, virtual sensors from our laptop into the simulation, like
we were in playing the movie Twister and the sensors reported back basically
the fluid [indiscernible] pressure or in cosmology, we can think of
temperature. So basically, physical quantities. And we are now loading this,
or actually this 70 terabyte MHD is already loaded. And we are writing a
nature paper about this.
10
And this is a typical daily usage, about 100 million points get served up every
day from this database.
And it's interesting for the users, it doesn't matter whether it's a 30
terabyte database or a 30 petabyte. All that they see, whether the points are
coming back fast enough. And if they don't, they complain.
So these are a whole bunch of papers that were submitted. You can see
[indiscernible]. Now there's a nature paper coming. So this is generating
first class science.
Cosmology simulates same thing is happening. But a very profound thing
happened in 205. So there was a simulation called the Millenium, run by a
group of people in [indiscernible]. And Lamson built a relational database out
of the data, and then suddenly took off like wildfire. So there are now much
better, much bigger simulations scattered around.
Still, everybody is using the millenium because it's easy. And it took the
SkyServer framework and simply plugged in the same interface you can create
basically the merger history of galaxies and so on. And you can now generate
all sorts of realistic looking simulated observations of the data. But it is
because it's easy to use, there are more than a thousand people using it daily.
And we are now trying to build something at Hopkins with an NSF funded project
to do a petabyte scale simulation, which is a single large simulation about the
milky way with about 800 snapshots, and more than a billion particles in the
database moving essentially just in the orbits around the galaxy.
And then one of the particular projects trying to do first is to compute dark
matter regulations interactively. So this is still from the Via Lactea, which
is only 400 million particles. But the original -- this was part of the
original Via Lactea paper by Michael Kuhlen. It took him eight hours to
compute this annihilation. This was done on a GPU database. Time is 24
seconds. So it's more than a factor of a thousand faster. And the trick is
that we, instead of computing the rate race, we actually used Open GL. So we
used the stereo graphic core and that's where circles, map on to circles and so
on and then we used basically a 20-line Open GL code to basically do the same
rendering but at hardware speeds.
11
And so next, what we would like to try is that people can change the cross
section so they can play [indiscernible] games with the physics of the dark
matter particles. And in 20 seconds, they can wait 20 seconds to generate an
image. They wouldn't wait eight hours.
So how do we visualize petabyte? And here, the thing is that we need to send a
rendering to the data the same way as if we send these analysis to the data
before. It's much easier to send a high definition 3D video stream to an iPad
than a petabyte of data to be rendered.
So basically, the visualizations are becoming IO limited so we're trying now to
integrate the visualization engines with the database so that the data traffic
is going on the back plane of the modern board. This is just an example for
the turbulence. So we have been streaming a little over a gigabyte a second.
So on a relatively small server into a GPU about 25 billion particles.
But the nice thing about this, that this is done really in real time on the fly
so this is not rendered frame by frame. So this is a capture of the real video
stream.
Long tail. So Dennis already talked quite a bit about this. And so it's not
only the big data that we need to be concerned about, there's a lot of big data
integrated in the tail of the distribution. And there is -- is there a lesson
to learn from Facebook? In a sense, what is Facebook? Bring many small,
seemingly unrelated junk data together to a single cloud and to a single
location and some magic happens, basically. Because people discover previously
unknown associations and links. And is there a science equivalent? What is it
and how do we do it?
And there is another lesson from the public services, which is DropBox.
DropBox is an interface that nobody has to learn it, because it's totally
intuitive and it is so crisp that it is so hard to do such a crisp interface
where it has nothing extra on it but only what is absolutely needed. But what
is needed works like a charm. And there's a lesson there for science and for
scientific interfaces.
Okay. So we are trying to do something that takes some of these lessons. So
what if we already have the VO Space, the virtual observer to reach the
distributed storage with security added on it, which is visible for all the web
services of the virtual observatory.
12
So what if we combine the DropBox like interface, a MyDB type relational
database to store both the metadata and a lot of the tabular data. And then
the VO Space, but make the VO Space completely fault tolerantly distributed.
So to a real cloud storage, where the data is indestructible.
And basically offer free cloud storage for the VO community so that you can
upload all your images, maybe did it or not. Anyway, here is where you store
it. And then once it's up there, without asking any more information from the
user, and two coercive XML forms to fill out, let's just start mining all those
headers and look at the context at what files have been placed to the same
directory. Every once in a while, ask the user to maybe we see that there is
some commonality. How would you call this common tag of all these objects?
And this is the most intuitive I can imagine.
Anyway, but we can, since in fit files there's a lot of metadata already,
without asking anything from the user we can always derive a very rich context.
And let's see how it can be generalized as to other areas of science. And this
is sort of a rough architecture where we would have basically a VO Space
interface. We would have a DropBox interface, but it would also hook up so it
would basically hook up to a database, but we can also have multiple regions
for fault tolerance.
You can imagine having one of those instances at Cal Tech, another one at, up
at CFA, another one in Baltimore, and the data is basically everywhere.
Wherever you need the data, it's already there.
Summary. Science is increasingly driven by data. The large data sets are
here, the cheap, off the shelf solutions are still not here. We see a changing
sociology in how the data acquisition is separating from the analysis. And in
between that, those two theories is basically the data publication and the
databases.
What we see is that we are moving from hypothesis driven to data driven
science. There are many deep similarity among disciplines related to data.
This is, I think, the main lesson that we learned from the SkyServer. I was
amazed at how much of the basic parts of our tool kit have applied from
radiation oncology to environmental science and to genomics as much as to
astronomy.
13
So we can, I can say with confidence that we can build reusable solutions.
There is a problem in the long tail, but we see now this fourth paradigm of
science is emerging. And I think Jim would smile. Thank you very much.
>> Yan Xu:
So we have some time for questions.
>>: This isn't a question, but I got to see some of the most weird uses of it,
having gone from the Sloan to Galaxy Zoo. At the planetarium, I'm watching
what they call sky karaoke and they used Galaxy Zoo to fly through the universe
with the lyrics to Major Tom showing up spelled out in individual galaxies that
were matched to letters.
So the most improbable things ever out of the SkyServer. Never would have
guessed it would be Major Tom, spelled out in galaxies on a 200-foot dome.
That's the long tail for you.
>>:
George?
>>: So one part to the fourth paradigm that not really appreciated
sufficiently, I think, is the paradigm shift that theory is now expressed as
data, as opposed to its formula. We still have another theory, but all these
complex phenomena can only be done numerically. And so what you're doing
pushes this really in the forefront that you're observing simulations, you're
treating simulations the same as you would measurements, because pretty much
[indiscernible].
So what's your sense of world view change among the theories facing this
transformation?
>> Alexander Szalay: Well, I think it already has happened, or it almost has
happened. So when I was in [indiscernible], then there an exchange. People
were [indiscernible] and Simon White were talking about an email they got from
Richard Ellis, and they are arguing about how to write the SQL created to get
the correct of getting the merger tree and the mass function for the millenium
database.
Pundits like
changed. So
simulations,
not all data
this are starting to argue about SQL. You know that something has
otherwise, I think the other change in our word, will be with the
it very easy to generate lots of data. We will understand that
is equal. So there will be some that we will just need to analyze
14
to keep for two weeks. There will be others that we need to keep for two
months. And then kind of naturally, probably by natural selection, a few
simulations emerge like millenium, which will become de facto references for
the whole community.
And then those we need to do right and those we need to keep for years at a
time. But right now, we kind of shove all the data, we think that everything
is equal. And this kind of is becoming a barrier.
>>: So survey data, which is primarily what you're delivering at the moment,
has evolved over the last 15 years or whatever from being discrete objects
defined by a couple of single numbers so discreet objects defined by many
numbers, including multiwavelength. That's been the big revolution at the
moment.
And I think it's the direction it's going is where it's not going to be
discrete objects, but it's going to be a continuous field across the sky and
the radio data is going to be a classic example of that. How do we deliver
that kind of data with these kinds of technologies, which are really
[inaudible]?
>> Alexander Szalay: Well, so actually, I don't see a big deal at this point
in scaling up, for example, keeping all the Sloan images on the fly and with
enough GPUs around, we could basically give a custom, on the fly image
processing for limited areas. Not for the whole sky.
But if you have something like five square degrees, I don't see an issue with
doing something funny to data and then running this extractor again with some
more [indiscernible] parameters. That's perfectly doable. Or actually, you
know, to do another thing, have a whole bunch of amateurs applauding a bunch of
images to basically to the VO Space/DropBox and then essentially have a custom
reprocessing of a good fraction of them, which happen to overlap with the
Hubble deep field. We just found a big new super nova or something.
So that's completely doable. It's a question of compute cycles and more
[indiscernible] with us. On doing that a hundred petabyte scale, it's a
different matter. But again, if you take little bits of the hundred petabyte,
I don't see a problem with it.
>> Yan Xu:
Okay.
Any other comments, question?
Okay.
Thank you very much
15
again, Alex.
>> Phil Bernstein: Hi, I'm a database researcher, and I work at Microsoft.
And I don't do astronomy. I don't actually apply database technology to
science, in particular. I'm mostly an enterprise computing guy working on
problems that are related to big business use of databases.
But Alex and company asked me to talk about data integration technology, which
is a part of the database work that, if you're working with data, you're doing
a lot of this, and automating it is a fairly important problem. So I'd like
to -- I'm here to give you a survey of what's done with data integration, but
I'm not particularly from a science focus.
So the problem is easy to state. That it's the task of combining information
from multiple sources and presenting a unified view of that data. So the input
is raw data. It can be arriving from instruments. It can be arriving from
users typing stuff in. It's probably heterogenous in some way. That is to say
different formats, different assumptions about the meaning of the data, perhaps
different measurements systems and from different groups. So it's not like you
can necessarily talk to the people next door to find out what the data means.
It's a little more challenging than that.
And so the question is, what do you have to do to get to the point where you
can actually do the work that you want to do? This is a problem of data
preparation. Everything that you do with the data before you actually get to
solve the problem that interests you. It's hard, it's expensive, and I can
tell you in the enterprise world, it's about half of the work in getting a
database and making it useful.
So as a perspective, the data warehousing field, which is a part of data
integration, is probably about a third of the database market, maybe a little
more. And half of the money that's spent, half of the labor that goes into it
is just in getting the data into the data warehouse, to the point where
somebody can actually run a query.
So if you've been experiencing a lot of pain in this, it's not because you've
got your particularly poorly organized or whatever. It's just the way things
are when you work with data.
So the next level of detail, what are the scenarios people run into?
Well, one
16
is data translation. You're given data in one format, you've got to get it
into another data format in ord to be able to use it. It's not only format.
It's often changing the semantics, combining the data in funny ways.
If you're dealing with a stream of data coming in off of instruments or from
other sites, they might be a stream of XML messages. You might be loading up a
data warehouse in the sense your big astronomy databases are, in fact,
effectively data warehouses. Or you might leave the data right where it is.
You might not actually pull it together. You might have a query interface that
maps to the data, wherever it is, and then you run distributed queries to pull
the data together.
Or you may be building some nice portal to get access to the data, and the
data's in various places. You're exposing it in a nice format and giving
people the ability to write queries as well as form-based interfaces just to
retrieve individual records. Or you might be writing a program in Java or
C-Sharp or whatever in order to get access to the data with queries embedded
inside. And in that case, you probably want a wrapper that maps from your
object-oriented point of view from the program into the SQL point of view,
which is the way the data is accessed where it lives.
And so on. Report writers, query designers, form managers, web services.
These all have that same characteristic of having to expose the data
differently from the way it was originally produced and that gap has to be
filled by a pile of work.
What's common about all of these problems is that we're mapping between two
representations of data. And there's generic technology to work with that;
that is, to create mappings and to use them in order to do the translation
that's required.
Now, the space of problems that you can experience here is pretty broad. I'm
only going to focus on a little bit of it so I want to kind of outline the
whole space so you can see at least what I'm not going to be covering and what
other issues might arise.
So in one dimension, there's the question of the precision of the mapping that
you're producing. You know, if you're answering queries in Bing or Google, the
mapping can be pretty approximate. Nobody is expecting a perfect answer
anyway. The important thing is to be able to automate it. On the other hand
17
if you're trying to draw scientific conclusions from the data, then you
probably have to make sure this mapping is absolutely right, validated, you
know, five times by different people to be really sure. Or in my world, if
you're doing billion dollar money transfers, you also have to be absolutely
precise in the mapping definitions.
Then there's the kind of data that you're trying to integrate. It could simply
be formatted records, just have numbers and strings and there's nothing special
about it. But often, there's a lot more special. It can be text, you know,
like paragraphs. It can be geographical information, images, time series. You
probably know more of these than I do. And they all have special kinds of
integration problems associated with them.
Then there's the question of the process by which this integration is
happening. You know if we're days away from an asteroid hitting earth, and you
need to draw some quick conclusions, your problem is time boxed. Taking a year
to integrate this data is not going to -- it's too long, right. Or similarly,
if you're doing emergency management or you're responding to an emergency -- a
military situation where you've got different military groups combined all of a
sudden on the battlefield and you need to tie their data together.
So you have to time box it. I talked about the fully automatic case in the
case of Google and Bing. You might want to incrementally develop it. You're
not even sure how much of this data you want to integrate so you're going to
integrate a little bit of it and kind of see how it goes. And then based on
the integration you've done, you're going to integrate a little more and kind
of pay as you go, if you will, through the integration.
And then there's the carefully engineered problem where you're creating this
data resource, like the Sloan digital sky survey, and you're just going to put
a ton of work into making it as perfect as it can be, because you expect a lot
of people to be using it for a long time.
So most of the solutions I'm talking about here are for precise data
integration of formatted data with a carefully engineered process, which is
probably pretty close to what you mostly work with. But as you can see, there
are many other combinations here that are also very important in other
settings.
All right.
So the problem is you've got these data sources.
You want to
18
create a mapping to some target. Now, it could be all you've got is the data
sources and you want to generate the target format from the data sources, or
you might also have defined the target data source format ahead of time and now
you want to map between the given sources and the target that's already been
laid out. Could go either way.
So ideally, the mapping should be as extract, because you want it to be short,
easy to understand, so you can convince yourself it's right. It should be easy
to express. The whole point is you don't want to have to do a lot of work to
do this.
To satisfy -- to be sure you've got the right one, it has to have clear
semantics. That seems obvious, but many of these mapping languages don't have
that property. And even when it is a precise language, if it ends up, the
mapping is expressed in thousands of lines of imperative program code, then
yeah, it has clear semantics, but how sure are you that it's actually doing
what it is you hope it to do.
And then finally, it's got to compile into something this executable, because
the whole point of developing the mapping is to get the data into the format
that you want to in which you want to use it.
So in terms of abstraction, you know, SQL's better than Java or C-Sharp or
Fortran, but it turns out that even higher level languages are needed and are
commonly used.
So let me give you an example of how this might go. I'm going to pick one that
I'm intimately familiar with. I worked on it for many years. Then I'll try to
give you a feel for how it applies in other settings. This is a problem of
generating an object to relational wrapper.
The scenario is the data's in a relational database. I want to write an object
oriented program to access the data so I've got to somehow expose the data as a
set of classes. And intuitively, it seems like this should be really dead
simple. A table, it sort of the looks like a class in a programming language.
Don't you just map them one to one and that's the end of the story? And the
answer is hardly ever. And we'll see why.
So here's a really simple example. Again, from the commercial world, but
you'll all relate to it easily. On the left, I've got three classes. The root
19
class is called person and it has two specializations. One called employee,
and one called customer. So I've got two kinds of employees.
And let me just give myself a pointer here to work with. And there's
inheritance. So the employee has a department field, but it also inherits ID
and name from person. And then similarly, customer inherits ID and name from
person.
Now, the way this is stored in my database is in three tables. I've got a
human resources table, which just contains the ID and maim of people, an
employee of table, which for those persons who are employees, they also have a
row in the employee table with their ID and department, because that's the
department, recall, is the additional information for employees. And then
customers, well, customers are special. I've got a whole separate table for
them. So their information isn't stored in the HR table or the Empl table.
It's in this separate table.
So the mapping, I mean, this is pretty dead simple. The mapping, I'm sure you
can all see what's going on, that the person information is obtained from the
union of all the information in the HR table, plus the two columns of the
client table that are relevant. ID and name. So you take the union of those
two tables.
And then the employee table is simply the join of HR and employee. And the
customer table, well, the customer class, that's equal to the client table. So
really simple. And you may think this is a joke, but this is actually the SQL
you have to generate in order to implement the mapping on the previous page.
And the main reason is -- I see a couple of people nodding their heads, it's
like you've been through this.
So the reason is that when you retrieve one of these people from the database,
you don't know, necessarily, whether it's a customer, an employee, or just a
person who has no specialization. And so you're pulling it -- you've got to
pull the data out of all three tables and then based on what you find, sort out
which class it is you're supposed to instantiate for that particular person.
You know, if the person appears in the HR table and nowhere else, then it's a
person and it's not a customer or an employee. But if it's in both the HR
table and the EMPL table, then it's an employee and you want to instantiate
that.
20
So what's going on here is a bunch of bookkeeping in order to figure out which
of the objects you're supposed to actually create, all right. And this is
utterly standard SQL and this is actually generated by a particular system that
I work on.
So how do you do this? Well, to generalize slightly, basically, there's three
steps in solving mapping problems like this. The first is given the source and
the target, just where are elements related? I mean, never mind the details of
the semantics. Just to know that some elements of the source actually
correspond to some elements of the target. You just draw arrows, basically,
connecting them.
This doesn't have any special semantics, but it's already useful all by itself,
because it tells you what the lineage of the data is, how the target relates to
the source. You may be able to use it for impact analysis. If I change the
source data, which parts of the target are going to break, because they're
related.
The next step is to turn those into mapping constraints, like the formulas on
the previous page, which actually tell you what the semantics of the mapping
is, and that's generally done by partitioning up the correspondences into
groups so that you can figure out which pieces of the connections actually
relate to each other.
And then finally, you want a function. You want a transformation that will
actually translate the data from source to target in some language that your
system supports.
So it's a three-step process from schemas to correspondences. Correspondences
to constraints, and then constraints to transformation. Now, this is -- I
motivated this for the object to relational mapping, but clearly if you were
populating a data warehouse, it's the same thing. That, you know, you're
translating data from one format to another for any reason. It's always these
three steps.
So the first step is just matching up the schemas. And this is a field by
itself, schema matches. You know, there's many ways in which you can figure
out or guess how your source database formats map to your target formats. You
know, you expect, of course, the element names to be the same. But they're
21
often not exactly the same.
They're sort of the same.
Maybe there are synonyms involved. Maybe by looking at the data instances, you
can tell. Maybe by seeing the nesting structure of the elements is the same in
source and target. There are literally hundreds of research papers describing
different techniques that can be applied in different situations just for this
one -- and basically, are all embodiments of some intuition that you might have
when you're trying to figure out how to connect up the data.
Ideally, these algorithms are available to you in some kind of a matching tool,
where you've got a source database structure on the left, the target database
structure on the right and you can start drawing lines between them to identify
how they relate. Again, ideally, with some assistance. So in the tool that's
shown here, if you right click on one of these elements here, then what it will
do is pop up a list of lines that connect to possible elements on the other
side that are good candidates for you to look at.
You know, and if the schema's only got ten elements on both sides, this is not
a big thing. But if it's got hundreds of elements on both sides, and in some
cases it can have thousands, then obviously it's a big help to be able to get
the system to assist you in picking good candidates.
Next is getting from those correspondences, those lines into constraints.
Generally, this is a manual process that mostly involves figuring out which
combinations of correspondences are grouped together to define a particular
mapping. You basically are looking at selection and projections of one piece
of your sort database equal to selections and projections of some piece of the
target. You want to avoid joint operators, if at all possible when you do
this, just to kind of figure out like this little rectangle of my source is
meant to be this little rectangle on my target is a way to think about it.
And then finally, there's the generation of transformations and you want that
to be totally automated.
So there's a bunch of products that do exactly what I'm describing. The one
I'm most familiar with, I helped develop, the ADO.net entity framework. But
there's an open source version, targeted primarily for Java called Hibernate.
Oracle's got one called TopLink. Ruby on Rails, essentially the rails portion
of Ruby on Rails is essentially this, but with a much weaker mapping language
than the other three that I talked about here.
22
But it's not just for object to relational mappings.
scenarios that I listed on that early slide.
It's also for these other
There are other problems besides the one I just described that relate to this
whole field. One is schema translation that you've got this source database.
It's in a relational database. You want it to be an object oriented structure
to make it easy to write a program against it. Do you really have to design
this manually, or can somebody just write an algorithm that will just do the
translation for you.
So there's all kinds of combinations, going from object schemas to SQL schemas.
You wrote the program. Please generate the database structure for me. Or I've
got the database, please generate the program structure for me. Or there's
this XML export format of this database that I need to use, can you please
generate a database structure that I can store this in that's nicely shredded
so that I can write queries.
So there's a whole category of algorithms to do this kind of thing. And it's
not just based on the schema. It's also based on the data. For example, here
I have three tables, a course table, a course details table, and a course
classroom table. And as you can see, they all have an ID field. And the
course details table has the same number of rows as the course table. What
that suggests to me is there's actually just one concept here, and somebody
just cut it down vertically and put half in one place and half in the other
place, probably because the course information was important than the course
details information.
By contrast, the course classroom information also has an ID, but it has fewer
rows, but there's still a foreign key relationship here, which means that all
of the ID values in course classroom appear in course. That suggests the
course classroom is actually a specialization of course. You know, it's a
sub-type. So just looking at these very simple pieces of information about the
data itself, you can make intelligent choices as to what the schema ought to
look like when translated, in this case, into an object oriented one.
So this is just by way of example. There are lots of these kinds of
algorithms, but they all try to use this kind of reasoning.
So what I'm leading up to here is that there are a small set of operators to
23
manipulate schemas and mappings, and it is really only about eight or ten of
them. I showed you how to match two schemas, how to generate constraints from
mapping, how to generate transformations from constraints, how to translate
schemas from one data model to another.
There are only about four or five or operators like this. Merging schemas,
differences two schemas, composing two mappings, inverting a mapping. These
are all familiar mathematical concepts. So it's not surprising that these
should be the operators here.
Let me just give you one more example and the -- which is schema evolution.
Here, you've gone to the trouble of doing everything I've just said, but now
something changes, and the person who did the original mapping, they graduated.
They don't work for you anymore. And so now you have to evolve what you've got
to into a new format. So you've got this new schema that you need to support.
So what do you do? You had a view or a mapping to some user view of it, you
had the database. How do you handle the evolved schema? Well, it's the same
problem, sort of the same problem I described before. You create a mapping
from the old schema to the new one and then you generate a transformation from
it. Pretty much the same thing that I described earlier. Hopefully
automatically in order to migrate the database.
But what about this view this is running on the old schema. You need to have
it run on the new schema. Well that's a composition problem. You basically
need to compose these two mappings in order to get from the evolved schema to
the view. Now, if the mapping from your original schema to the view is going
this way, and the mapping from the evolved schema to the old schema is going
this way, this is function composition, utterly trivial, you know, just plug it
in.
But if the mappings are going in opposite directions, life is extremely hard,
because now you have to invert one of the mappings in order to be able to
compose them.
And depending on the language, that can be hard or impossible. And so, you
know, people write programs to do this without even thinking at the level of
mappings and just say boy, this is really hard. How do I douche what do I do
with this little piece of data in order to get it into the other format and
don't realize the problem they're running into is they're trying to invert a
24
mapping that's not invertible and making bad guesses.
Okay. Let me wrap up with just a few quick slides on another dimension of the
problem, which is tools for manipulating instances. I've been talking entirely
at the legal of schemas and mappings. What about actually just cleaning up
your data and loading it?
In the commercial world, this is called an extract transform and load, and it's
a standard component of any data warehousing tool set. Here are some example
functions it might include. Let me quickly talk about a few of them.
So data profiling is the problem of giving you a summary of the instances of
the data without going blind looking through the instances one by one and
trying to imagine what the entire dataset looks like. The first column of your
table, gee, that's unique, you know. There are no two rows that have the same
value. That's useful information. Or maybe it's not the first column. Maybe
the first and the third together will uniquely identify every row of the table.
Or that every value
first column of the
Value ranges, value
or it might just be
in this column of table two is actually contained in the
table one and is obviously a reference going on here.
distributions. The result can be exact if it's exhaustive,
approximate, based on sampling.
I don't know how often this comes up in your world. In the commercial worked
it comes up all the time. People give you data and you've got to kind of
figure out what it is they actually have. Fuzzy look-up is another one. You
start wanting to search the data, but in the beginning, you're not really sure
what the values actually are. You know that there's a value, there must be a
row in there for this product. You don't know how it's spelled, and that's
sort of the whole point, but you want to find the row. So you need to do fuzzy
look-up. And it's more like kind of keyword search against the internet,
rather than doing a SQL query to find a row. But often, fuzzy lookup is not
supported by the database system. So it's a separate mechanism that's needed
here.
Then there's parsing. Some of the fields you're given are actually just a pile
of text, but that's not how you want to store it. You've got names and
addresses, but of course you want to parse that out into a structure that you
can actually work with. And, you know, if you could just give it a few
examples, maybe the system can just figure out the rest for you and develop a
parser that will automatically do that reconstruction.
25
Then finally, there's duplicate detection. The problem is, of course,
duplicates are hardly ever exact duplicates. They're almost duplicates so you
need to do fuzzy duplicate detection. And everybody's experienced this,
because we all get mail from solicitations or advertising come three envelopes
from the same place, because the address is a little different.
That's a product category. People sell software to de-duplicate address lists.
So it comes up all the time, and there's generic technology to do this. So you
want to be able to figure that out automatically.
So that's kind of my whirlwind tour of the data integration field. You need to
define mappings. You want them to be high level. You need these -- ideally
you have these powerful operators to manipulate the schemas and the mappings.
And after you develop the mapping and it's doing its job for you, you're going
to have to maintain it as systems evolve, if the data has any long life to it.
And finally, you also need tools to be able to clean the data initially and
then to be able to go back and do that again as well.
So that's all I had prepared here.
>> Yan Xu:
Questions, please?
>>: I was interested in your fuzzy lookup stuff. In astronomy, we often get
the problem. Let's say I take a Sloan database. I know [indiscernible] union
on to cross-match that to another database with another wavelength. So
typically, an entry there, that's probably that one, but it could be that one.
Maybe there's no set probability of [indiscernible]. Neither is actually
right. Really, I need to have an output database which contains those
probabilities.
So it's a bit like your fuzzy lookup there.
that might be able to handle that case?
Is any machinery in the pipeline
>> Phil Bernstein: I don't know how much we can automate it. I mean, the
details of how you calculate those probabilities and what the tolerance is for
declaring one element being a candidate match for the other one are clearly
very important. But the general area of probabilistic databases is actually
something that a lot of people have been working on in recent years.
26
Now, the problem is that is we love to do things generically in the database
field. That's your point of leverage is to come up with some very general,
like SQL, some very general purpose technology and then it can be used for
everything from accounting to astronomy.
So the same thing is true in this probabilistic database area, where people are
trying to develop generic ways of manipulating rows where they have some
probability of values being correct, some probability of values matching.
Whether that general probabilistic model meets your needs, I couldn't tell you.
And I think that that's -- so I don't know whether there's something in the
pipeline that will actually help you with that, but there is a field, and if
somebody takes an interest in learning about probabilistic databases, you can
see whether it applies. And I'm quite sure people doing that work would love
to hear about an application area that doesn't -- whether it fits or not, to
understand exactly what the mapping is and whether there's something they can
do a little differently in order to meet your needs. I'd be happy to connect
you up to people in particular if that's an interest of yours.
>>: So the one thing you didn't talk about in the summary here is the tools to
clean data. That by and large, data are dirty. That particularly instruments
that sit out there in the field and measure [indiscernible]. So one of the
things is that it's often difficult to automatically identify those kinds of
things. And so it sounds like this probabilistic analysis is something that
would be useful. Is there work going on in this area to -- in other words, is
there a way for me to -- typically, the way I do this is I plot data. And I
look up ->> Phil Bernstein:
The outliers --
>>: I say that doesn't look so good. But that's hard to generalize. In other
words, how do I find similar kinds of things through the whole dataset.
>> Phil Bernstein: Right. So as I said, there are these ETL extract transform
load systems and they often have nice graphical interfaces for developing
pipelines of operators. But figuring out how to actually clean the data
requires some domain expertise about the data.
So the question is, what can the tools do to eliminate what you consider the
tedious part of the problem and allow you to focus on the intellectually
27
challenging part of the problem? And that's an ongoing iteration. I mean,
every release of tools does better at that. So, you know, clearly,
visualization is part of it.
>>: The question is I'd like to, you know, circle something on the graph and
say find me these things.
>> Phil Bernstein:
data.
Ah, you want to get from the visualization back to the
>>: Obviously, if I look at a month's data, I can see ones that are bad. It's
when I'm looking at, you know, 60 years of data across 200 different locations
that I can't do it all visually.
And so the question is, you know, the work on the tools that are needed to let
me somehow tell the computer why I think this particular point is bad and then,
you know, find me things that are like this.
>> Phil Bernstein: It sounds like a very interesting problem. We should talk
about it. It's a -- well, okay, but data mining and databases, you know,
there's a fuzzy connection there that we do both. But it's pattern matching,
it's -- there's a providence problem in there as well to figure out because,
you know, you're looking at data in some visual format. You know, let's look
at the exact instances of that data and then use those as templates maybe to
find related instances and so on.
There's probably multiple components to you're developing an end user scenario,
if you will, for using the data. And then the question is what components are
needed in the pipeline to actually be able to make that happen. It sounds
really interesting. I hope we'll have a chance to talk about it soon.
>>: Where do B to B [indiscernible] issues come in here? Are they just
another mapping, or is there [indiscernible] impact you're describing here have
on the [indiscernible].
>> Phil
This is
dealing
a form.
catalog
Bernstein: You're talking about business to business communication.
-- this is where the action is in B to B e-commerce. So in Amazon
with heaven knows how many suppliers of products, you know, they define
How do you suppose all those folks get their product into the Amazon
so when you browse it, they all look the same. You know, and they're
28
readable.
And the answer, of course, is that Amazon has a target schema.
And if you're a vendor, you've got a catalog, you've got a map from your format
into their format or you've got to get somebody to do it for you.
>>:
So in that case, they're the biggest --
>> Phil Bernstein:
>>:
They get to see it.
>> Phil Bernstein:
>>:
They get to define, yes.
Right.
But in more peer to peer context, are there other [indiscernible].
>> Phil Bernstein: In more peer-to-peer contexts, then it's a business
negotiation as to who's going to do the work that, you know, I don't know if I
have any good example. Sometimes, mergers and acquisitions are an example
sometimes. It's Bank A, you know, acquires Bank B and they've got to integrate
all the data. It's clear who the source and who the target is.
And then they've got to figure out just how much work it's going to be.
>>:
[inaudible].
>> Phil Bernstein: Who's going to do the work. Then you worry about making
sure the people in the acquired company are willing to stay. What incentives
are you going to give them, because they know they've got no future unless they
want to move to the other bank's headquarters.
And there's, in some cases, it's so expensive, you've got to size the project.
In some cases, it's so expensive, the decision might be made to run the two
systems in parallel indefinitely because it's just too expensive to make them
into one system anytime soon.
So I don't know. I'm sure that -- I'm sure that there's lots of smaller
examples of this. And perhaps research labs collaborating. I've seen
bioinformatics. You get a post-doc at some biolab and they're grabbing data
from three collaborators. Source data formats change. Just willy-nilly,
nobody bothers to tell you. It's just that you bring the data in one day and
29
it just doesn't run anymore.
same in science as business.
And you don't know does she anyway.
So it's the
>>: One of your early slides showed a simple map of just spaghetti code. Are
there tools for validating the correctness of the code? That does just what
you think it does.
>> Phil Bernstein: Boy, I want to say yes to that one, but I'm not sure I can.
There's some fairly fancy math that goes into doing the comp ration. This is
not simply a matter of a developer sitting down and making a best effort to
compile the constraints into a SQL program.
In the case of the entity framework which was the mapping language that I
described, that met with -- the constraint actually compile into two different
sets of view definitions. One set of view definitions map the relational
format into the object format so that it can be exposed to the program when it
runs a query.
The other views map the object format into the relational format so that
updates that are applied to objects can be propagated down to the relational
format.
Now, a moment's thought will tell you that it would be awfully good if the
composition of those two mappings was the identity. I mean, if you store
something and you bring it back, you hope you're going to get back exactly what
you stored. No more, no less. So there are tests on the compiled view
definitions that you can run in order to make sure that's true. They're not
cheap. This is an offline activity. But you, in principle, you can do that.
And, in fact, we do. And, in fact, for technical reasons, we have to because
it's -- in our system, it's actually possible to define constraints that we can
compile into view definitions that compose to the identity.
I mean, you know, the reason why we do that is because if we were to restrict
the mapping language enough to ensure that never happened, we would end up
making it impossible for you to do some mappings that you really want to have.
So instead, we allow you to express some mappings that we can't handle so that
we can allow mappings that you do want and then during the compilation process,
we figure out whether you produced one of the mappings we know what to do with
or one of the ones we don't know. And we'll tell you which ones. And we do it
through this check.
30
So it's not exactly a complete correctness test, but it does give you some
additional confidence, because the two mappings are generated independently.
So if they compose to the identity, they're probably right.
>> Yan Xu:
One last question.
>>: So going to a different part in the space, basically, there was
[indiscernible] tables. So would you care to comment about the usefulness.
>> Phil Bernstein: Yes. So I'm involved in that kind of work as well. We
have quite a big project here along the same lines. And so I'm not sure. Are
you talking about web tables or fusion tables? Because there's two different
projects that he's involved in. So web tables is the observation that if you
take the result of a web crawl and look for all of the HTML tables that are
there, and get rid of all the ones that are just HTML tables used for
formatting a web page. So you just look at the tables that are tables, they
have data in them, there's about a half a billion of them out there.
Surprising. That doesn't count, you know, the forms interfaces that are
sitting in front of databases. We're just talking about tables, where the data
is actually sitting there on the web page.
That's a lot of information, clearly a lot of it is redundant. There's a lot
of useful information you can get out of combining those tables. You know, if
you take a given topic, you might find a hundred or 200 tables about that topic
and you can kind of merge them and, you know, majority rules. You can do good
things with them.
That's one project. I think there's going to be a lot more of that.
certainly do a lot more of that.
We're
The other is really along the lines of a lot of these talks we've heard today,
about data sharing. That Allan's group, in particular, at Google is allowing
users to put tables and store tables in the Google infrastructure and then they
will, if it's geographical data, they will put it on a map. There are certain
other simple kinds of value you can get for putting data there. And we're
doing the same thing.
We have something called Azure data market, which is basically a brokerage
31
house for commercial data. You can buy or rent access to Dow Jones or Reuters
data or what have you. Or get government data for free.
And this is an ongoing project, where there's likely to be, you know, a lot
more data made available that way. And there's a website called SQL Labs,
which has a project there called data explorer. It looks like Excel, but in
fact what it is, it's a way of you putting data into an excel-like format, but
running ETL programs in order to transform it to get it into the format you
want.
So, you know, we're all kind of experimenting in this space to figure out what
large numbers of users really want. I can -- anybody who is really interested,
I'd be happy to give you pointers to the websites. And if you have any
problems that you wish a system like this would solve for you, we're all ears.
Because I think the whole field is trying to figure out just what it is that
databases as a service ought to be offering to customers.
Obviously, you know, we offer SQL as a service, and you can just store your
database in SQL Azure and it's plain old SQL. But what else? And we're all
working hard to figure out what that next step might be.
>> Yan Xu: Okay. Thank you very much.
presentation from NSF.
We should move to the next
>> Nigel Sharp: I wasn't at all sure what this was supposed to be. This is
all George's fault. I didn't even agree to do this. But so I'm going to make
maybe three points based upon today's discussion that I think we should
emphasize and I think we should think about how we're going to bring them
about.
The first one is related to the comment that was made that most researchers
aren't interested in becoming their own system managers. They don't want to
know how to run a supercomputer. They don't even usually want to know how to
run the cluster their department. So they hire a system manager.
I think in the same way we're going to have to see the development of the data
scientist. This has been talked about before, but you won't do the data
yourself. The researcher isn't particularly interested in that. They want to
get science done. So we're going to see that specialty.
32
We started doing something of this sort with the supercomputer centers when
they were introduced. There were people there whose job was to help you use
them. And that's what we need. We need that interface. And we need that to
be a valued career path for the sort of people who are interested in doing it.
From the NSF perspective, the way you create a career path is quite simple.
Universities like money. So you create the mechanism that funds people doing
that, and that creates the career path.
We can't create jobs. You give a grant to a university, that doesn't give them
any incentive to create an entire position, because that's a decades-long
commitment by the university. But if you have a mechanism that provides career
capability, then you can introduce that, and that's something I think we need
to think more about. The expert who understands the data aspects but can talk
to the researcher in their own language.
Because many of the conversations that you're going to have, you talk at cross
purposes or, indeed, complete misunderstanding using terms that other people
don't understand. So the interface person, I think, is so important.
I can't think of the easiest way to put it. I think my function really at this
sort of meeting is to answer questions rather than to talk. I didn't write a
presentation because, you know, I don't know what you're interested in hearing.
I think in many cases, it's easiest if you ask me questions, rather than that.
What's the panel coming up next?
>> Yan Xu:
Who's on that?
[inaudible].
>> Nigel Sharp: I wasn't sure who you were planning to put on it. I've got
the -- is it here? I'm sorry. All right. Okay. So the other one was sparked
by something I thought was missing this morning. We were talking about models.
It's been said that data archiving, data curation, long-term support of data.
So there are two things related to that.
The first, of course, is the purpose of storing data is not to store data.
It's to enable new research to be done. It's to enable validation of the
existing learn that was done with those data, but it's also to enable people
with clever ideas to do new things with it. So we need to think about, you
know, the fact is we're going to throw data away.
33
Data can be funded. It has to be funded. It takes power, takes issues. But
the model that wasn't mentioned this morning, I just want to throw out. I
expect other people have thought of it already. Of course, it's libraries.
We've had libraries for decades, centuries even. Libraries have a funding
model. Some of them are by subscription. Some of them are pay per use. Some
of them are supported as part of overhead by universities. There are
specialized collections. There are general purpose and so on.
So I don't think we should dismiss that simply because libraries existed to
curate large collections of paper. I don't think we should dismiss that model
and possible funding model.
Also, there's a vast amount of funding on the internet based on advertising.
Many sites that deliver things to people purely in order to get words in front
of eyes. That's the advertising model. We don't think we should discount that
possibility as well in terms of supporting these things.
The problem from our perspective is NSF runs on a very simple model of starting
out research. If you go to the website for the NSF, you'll see the sub text
underneath national science foundation says where discoveries begin. It
doesn't say where discoveries are preserved, curated and held for all time.
So we're very good alt starting things. We're very interested in getting
things going. But we're not terribly interested in a long-term sustainability.
We tend to figure that if it's a good idea, it will attract its own long-term
support.
So those models, I think, we need to worry about. Nevertheless, if you end up
with something like a subscription service, then you can, in fact, request
money on NSF grants to pay your fees to that. It's not an unfunded mandate.
It's a multiply funded mandate. It's not an always-funded mandate.
What I think of as a great strength of the NSF and lot of people think of it as
a weakness. One of the things that is a strength of the NSF is everything that
goes on in the NSF is judged by the community that's asking for money. That
is, the review process is handled by the same community that is requesting,
typically, reviews are done often by people who got funding in the previous
cycle.
34
And so what is a standard in each community is applied to the next one. This
is one of the challenges we've faced when we introduced that data management
plan requirement was, you can't make a mandate for every field of science and
engineering in the country that says you will make all your data available to
anyone for free. You can't do that, because for whole fields of research at
the NSF supports where that would actually stop the research altogether. Areas
with commercial value in Geo sciences, for example, researchers have made
agreements with companies. You can't then say just because I got an NSF grant
for part of this work that those data are going to be public.
There are legal issues in the social sciences concerning personally
identifiable information. So you can't make a mandate. And so what we came up
with was intended to be a very -- the other message, the soundbite was the
policy is you have to have a policy. The idea was to make people think about
data management.
We've always wanted people to share data. Not necessarily for free, but we've
always wanted them to. We wanted to make people think about it. I've talked
to a bunch of university groups and one of the things that surprised me after
the policy came out was the number of universities where the computer section
was saying, you know, let us help you with the NSF mandate to archive your
data, because we didn't write it that way. Certain parts of the NSF have added
their own restrictions on top of it. Some of them, I think, unworkable and I
think the number of community should be protesting.
So that was an unintended consequence. Another other thing I should say about
that, of course, is it's a work in progress. If you don't like it, protest to
somebody, because it will be rewritten. The first version of any policy never,
never stands forever.
Anyway, I've talked enough.
>>:
Dan?
Not bad for a guy who had nothing to say.
>> Nigel Sharp:
How long have you known me, George?
>>: Do you care to speculate what the portfolio review might mean for
astroinformatics, roughly speaking?
>> Nigel Sharp:
Well, I don't really know what astroinformatics is.
I have a
35
number of opinions from a number of people. No, the portfolio review is, for
those, I expect most of you know, the Division Of Astronomical Science is in
the national [indiscernible] survey predicted that our budget would grow faster
than it has. In fact, in fiscal '12, we are $45 million below where the
[indiscernible] survey said we would be.
This is particularly annoying to me personally, because I gave the talk to the
Decadal survey committee, explaining our budget projection, which they threw
out on the grounds it was too pessimistic to make a plan. In fact, the final
budget is even more pessimistic than the one I gave them. It's particularly
irritating that they didn't listen to me.
So we held this portfolio review to see where money should be spent and
shouldn't be spent. At this point, it's too difficult to predict, because we
have to do an implementation plan that is, we have to do a response to the
review or to the report. And the response will condition what happens, and the
response is tied up with enormous amounts of politics, including the fact that
we can't give any details because the fiscal '14 budget request is completely
embargoed from anybody being allowed to discuss whether we're going to be in
business or not.
So it's complicated. I will say -- have we got alcohol this evening?
beer, and I'll say something else when it's not being recorded.
Buy me a
>> Daniel Katz: All right. So I'm Dan Katz in the Office of
Cyberinfrastructure. I guess I should say with a caveat that this is what the
Office of Cyberinfrastructure has been and will be until at least September
30th, and then it's not completely clear what will happen after that. But I
think this will continue at least according to the memos that I've seen, so
hopefully this is reasonable.
So cyberinfrastructure, just to mention for people that aren't aware, computer
systems, data storage systems, instruments, repositories, visualization and
people linked together by software and high performance networks. So I think
this is the best definition that I'm familiar with for cyberinfrastructure,
because it has the people involved and it also talks about the software and the
networks as the things that really link things together.
So software here is part of the infrastructure developed by individuals in
groups in this context, developed for a purpose and used by a community. So
36
this isn't general software.
infrastructure.
This is software that's part of an
I'm going to skip this for a time. So what I think we're trying to do at NSF
in terms of software is to, as I said before, to create and maintain a software
ecosystem that provides capabilities that advance and accelerate scientific
inquiry and new complexity and scale.
And I guess I would say somewhat differently than what Nigel said. In OCI, we
feel like really what we're doing is not doing research, but is providing the
infrastructure that people can use to do research.
So that actually is a little bit of a difference between our part of NSF and
most of the rest of NSF. So in order to do this, we need to have research that
goes into this. So there's foundational research in computer science and
mathematics and astronomy that develops the algorithms and develops the tools
and data analysis routines that then become part of the software
infrastructure.
And by using that, we actually get science out in the end. Ideally,
transformative, interdisciplinary collaborative science and engineering. But
in many cases, domain-specific, incremental. We'd like more of one than the
other, but we're happy, really, with science coming out.
>>: I'm sorry, not incremental.
foundational.
>> Daniel Katz:
Okay.
The opposite in transformative is
Well, we do incremental too.
>>:
Read the policy document.
>>:
Those papers are incremental.
>> Daniel Katz: In any case, the other two pieces that go along with this is
the fact that we want to try to transform practice through policies. And
policies end up being a big challenge, as many people know. So ideally, we
would like to create incentives to challenge the academic culture, to promote
open dissemination and use, reproducibility and trust, sustainability, policies
for governance of community software, policies for citation of software, and
use of software, stewardship and attribution of software authorship. So these
37
are really the hard things that we have difficulty doing, but we feel like
really are needed to be done in order to have this ecosystem that really lives
onward over time.
And then finally, we need to use the software in the ecosystem to train
students, to train the next generation, to train the people working in the
field today. And we need to make sure that the people that are coming up both
know how to use the software as well as know how to write software in the
future, or else we'll be stuck where we are.
So this is, I would say that this is really the view of OCI in terms of where
software is as an infrastructure, but you could actually use this same picture
and substitute data or substitute hardware or a bunch of other things. This is
intended to be fairly broad.
So the goals of OCI, in terms of software, and I've kind of said some of this,
is that we want to encourage and contribute to the development and deployment
of comprehensive, integrated, sustainable and secure cyberinfrastructure at the
national and international scales. And this is including, again, software as a
piece of this.
We want to have an effective cyberinfrastructure impact that has clearly
defined benefits across multiple research disciplines so there is a number of
things in our software that probably don't fit into OCI. They may fit into
astronomy. They may fit into other disciplines. We want to focus on the
things that are impacting multiple disciplines.
We want to promote the transition from basic research to effective practice.
Again, the basic research probably is not our part of it. But moving things
that have been proven in research into the infrastructure is, and we want to
build on existing and upcoming investments as well as CI investments that are
coming from other units so that we're really kind of putting things together,
not doing a bunch of individual pieces.
So it's worth mentioning how we are working in NSF in terms of some of the
software programs, and particularly the SI-2 program and the CDSNE program that
are coming up. And in the SI-2 program, in particular, we have a cross-NSF
software working group that has members from all directorates. We talk about
solicitations ahead of time, determine which directorates in which divisions
are going to participate in each solicitation, discuss and participate jointly
38
in the review process, and then work together to fund worthy proposals.
So I'm kind of going through this just from the point of view that if you're
writing a proposal, you might want to think of who's going to be looking at the
proposal and how decisions are going to be made. And the fact that your
proposal is not going to go just to OCI or just to astronomy may actually make
some impact in how you write things. There might be people in other fields,
also, that are looking at this and trying to say, is there some piece of the
software that is going to be reusable in my discipline also.
So if we put this together, particularly in the SI-2 program, which is the
sustainable software program, but in software in general, we think of things in
multiple classes. So we think of software elements as kind of these small
pieces that are written by one or two PIs. But again, that are sustainable
elements of cyberinfrastructure, not research activities.
We think of multiple elements being integrated into frameworks or other
integrated activities. They don't have to be frameworks. They can be some
other collection. We think of then, in the future, soft institutes that will
look at the elements and the frameworks and try to build whatever community
things are needed around them.
So these might be policies. These might be workshops leading to new
requirements that then get fed back into the software developers themselves.
There's a variety of things. There could be activities looking at new
publication mechanisms, new access policies, a few different things.
So as we go kind of across the spectrum, we get larger and larger communities,
ideally. So the community that we would want to see for a software element
that we would view as a successful one may be fairly small, but it needs to be
well defined. If we're looking alt an institute, it needs to have a fairly
large community that's well defined.
And finally, in all these pieces, we really expect that these are going to be
reusable. So the software that's being generated is not being generated just
for one science question or even one discipline, ideally. It's something
that's going to be useful across a number of disciplines.
If you're interested in the SI-2 solicitations in general, there's a couple of
them listed on the bottom that are, the first one was the element in the
39
frameworks. The second one was the institutes. These have both passed, but
there likely will be more equivalent solicitations coming in the future.
There's the one that's on the second from the bottom line is one that's
partnerships between the U.S. and the U.K. in chemistry, specifically. We had
some interest from EPSRC in the U.K. and managed to figure out some place that
we could work together. We're certainly interested in doing this with other
countries as well and other funding agencies.
And I think in the future versions of the solicitations, we will probably be
much more explicit about international partnerships and the fact that we're
looking for them where we can find funding agencies on the other side that will
fund the other part of the partnership, because NSF won't fund things outside
of the U.S. in general, but we're very happy to look at projects where we can
work with another agency and fund a larger project.
And then finally, I just wanted to mention that, again, for people that are in
the U.S., in the GPG, there are opportunities for supplements and EAGERs that
are always open. So EAGERS are exploratory, kind of small projects. I don't
know, Nigel probably will be very unhappy if I say this, but I would say that
many program officers are really eager to hear about ideas that people have for
extending their current projects and for new exploratory projects that come in
outside of the formal solicitation mechanism.
>> Nigel Sharp:
Why would I be unhappy about that?
>> Daniel Katz:
Okay, good.
I agree with that.
>> Nigel Sharp: The early concept asks for exploratory research for a whole
idea of, you know, testing key ideas before you write a full proposal. Very,
very valuable.
>>:
How do those get funded?
>> Nigel Sharp: Well, the problem with an EAGER, of course, is since normally
people don't submit until they've discussed it, they generally have close to
100 percent success rate, because they don't get submitted until the discussion
has reached the point where, you know, it's going to get a good reception.
>> Daniel Katz: I think there's a fair number of these, and at least before I
came to NSF, when I was just at a university, I really wasn't very much aware
40
of these as opportunities. So I guess I just wanted to make sure that people
are aware that program officers in general are interested in hearing about new
ideas that we can try to encourage to lead into a future proposal.
>>:
[inaudible].
>> Daniel Katz:
There are 300,000 at maximum?
>>: It depends on the division. Engineering ones tend to be more, because
engineering is more expensive. The math ones are cheap.
>> Daniel Katz:
I think the GPG says 300,000 maximum.
>> Nigel Sharp: The maximum, but that's because it has to meet all -- it also
says at levels commensurate with normal awards within the program. So it
varies. I wrote that language too.
>>:
You've got a good memory.
>>:
He wrote it.
>> Daniel Katz: So I just want to throw out some open software questions that
we have that are things that are worth thinking about. So one is that software
that's intended to be used as infrastructure has its challenge that unlike in
business, when you get more customers, you get more revenue, if you have more
users for open software, it ends up actually often being more work for the
developers. And there's no real -- there's nothing that really comes back that
benefits you.
So there's a question about how we actually address that. Taking something
that's a research piece of software that proves a concept and turning it into
something that's really reusable is often as much work as the first part or
maybe a multiple of the amount of work of the first part. And so we need to
figure out how to really do that in a cost effective way.
Question about really what NSF can do to make these things easier, I think, is
an open question. So the simple answer of giving more money is probably not
the best one. But if there are systematic things that can be done that help
these situations by trying to group people together, group projects or looking
for common issues, common solutions, we're certainly interested in that.
41
Another question is how we should measure the impact of software. I think this
is a large question that we have. If you have software that you're offering
for people to download, the easiest thing that you can do is to count the
number of people that download it. That's also probable the least effective
thing that you can do.
We'd much rather know how many people are using it. And beyond that, we'd
rather know what they're using it for. If people are really using it for
science, if they're using it for education, if they're using it just to test
it. I think those things make kind of a big difference.
We have a fixed budget in some sense. I mean, not necessarily year to year
fixed. But within a year, it's fixed and we have to figure out what fraction
of the funds should be supporting existing projects, keeping things
maintainable, keeping things as infrastructure versus putting money into new
things that are going to become the infrastructure in the next few years.
I think one of the questions that comes up here is, in particular, for software
that's intended to be downloaded and built versus software that's intended to
be used as a service, I think there are probably fairly substantial differences
between how we look at that being used and what the rules are between what we
want to encourage and what we want to discourage.
And I don't think we really have a good sense, yet, of the software that's a
service. I think we have a pretty good idea of software that's downloaded, how
to encourage different people to do things with it and how to measure it a
little bit and some of those things. But the service piece is a little bit
more fuzzy.
If we're bringing in new pieces of software into the infrastructure, then we
have to eventually stop supporting other pieces, again with a fixed budget so I
don't know, at this point, how we make the decision that something has reached
the end of its life and we're not going to support it anymore.
Finally, almost finally, encouraging reuse and discouraging duplication is, I
think, is quite a challenge because the piece of software that somebody writes
that maybe is not 100 percent ideal for somebody else, but it's a lot cheaper
for us if they use that software than if they develop something else sometimes.
So the question of kind of what the right balance there is, I think, is a good
42
question and how we make the original software more reusable so there's less
duplication fits into that.
And then finally, again, talking about what following Nigel, we want to create
and to support career paths for software developers as well as for people
working in data curation and data issues. And this is also something where we
can't really do this by ourselves, right. As I think that Nigel was exactly
right. Giving money in grants helps, but it needs to be really a long-term
commitment and that has to come from the places where people are hired, the
universities and the laboratories.
And so I think that's all that I was going to say.
>> Yan Xu:
Questions?
>> Daniel Katz: I'll also say that I'm happy to have general questions now and
I'll be around this evening and tomorrow as well for anything, anybody that
wants to talk about anything specifically.
>>: In terms of when to stop supporting software and general maintainability,
isn't there a parallel with, like, instrumentation? Can that serve as a model?
>> Daniel Katz: I don't know. I guess there's clearly life cycle issues in
almost any kind of infrastructure and any kind of research. But I'm not really
quite sure what the parallel is there. Like how far it is.
>>: If you support the creation of the instrument, at what point does it
become something that you don't want to support anymore.
>> Daniel Katz: Right. So I think there's multiple issues with software. And
one thing is that at least the feeling in a workshop last week in Chicago is
that community codes have a, kind of a natural lifetime of maybe 20 years or so
before the underlying computer architecture changes so much that we need to
write something new.
I mean, so that's one way of looking at it is that way. Another way is when
somebody develops something in the community, starts moving to that next thing,
then maybe we stop the first thing. But it's -- I don't know. We don't
necessarily want to be making decisions on who's winning and who's losing. We
want to try to figure out the way that the community makes these decisions and
43
we fund the things that the community needs in order to do the science.
don't know that it's really clear for us exactly how to do that yet.
And I
>>: There's an interesting intellectual property issues that go under this.
In that if I have a grant to do science from NSF and I write software as a
result of that, I own the copyright, my university, to that software. I can do
what I want in terms of it. If I'm funded from an outside project, the
deliverable is software, then the university owns it. Then the university owns
the copyright, at least in the University of California. Are there things you
do to sort of encourage those circumstances that that software is, in fact,
gets released under a BSE license so that other people can use it?
>> Daniel Katz:
>>:
Yeah, so in the --
Because our IP office is impossible.
>> Daniel Katz: So if you write a proposal into the SI-2 program, you write
how you're going to make the software usable. And so if you, as a PI, write
that you're going to release it under GPL or whatever license you pick, the
university then will either agree with that or not. But if the proposal goes
in, then we assume that we're going to work that out.
I would say that OCI, in particular, unlike some parts of NSF, is not really
strict on Open Source or on any particular license. We are fairly comfortable
with other models. If you can make an argument that Open Source is not the
right model to make your software reusable and it's going to be more reusable
by having some licensing fee or working with some company, we're certainly
willing to listen to that. And if that's what the peer reviewers agree with as
well, this is the way the software's really going to get used, then that's
fine.
>> Nigel Sharp: I think that goes to the point I made that it's reviewed by
the community that's going to use it. I mean, if you say, I mean, you can say
anything. If you say we're going to commercialize this in a spin-off company
and the review process says yeah, we're fine with that. That's the best way to
make this available to the community, fine.
We've had cases where the reviewers say, you know, we don't think you should
fund this because this code is not going to be Open Source. So I think that's
one of the strengths. You say something, and your community says yay or nay.
44
>>: I guess I'm worried about if I were to do something like this, of getting
it through IP office in time to meet any deadline.
>> Nigel Sharp: But your office legally agrees to the conditions of a grant
before they draw dollar one from that award. So if they get an award and they
take a dollar out of it, they've agreed to all the conditions, including those
terms.
>>:
Remember, it's always easier to ask for forgiveness --
>> Nigel Sharp:
Not in the government.
>> Daniel Katz: I will say in one of my previous university roles, we had this
issue where Open Source software was actually very difficult in general. But
if it was something that was in a proposal and that's what you said you were
going to do, the university said you have to do it, that's fine.
>>: That makes it easier, okay, because I actually spent six months trying to
get something like this through our IP office, even though the award had been
agreed on.
>> Yan Xu: Okay. Thank you very much. And since our speakers who are here
tonight to continue their questions and discussions, let's give them a prop.
And before we move to the last panel, Eric is going to make some announcements.
>>: And before he does that, we will have the group photo after this session.
Since we're running late, we'll probably have the Daniel discussion, say, until
5:15. The dinner is already ready out there. So when we're done here, we'll
all go out, take a picture, then you can eat.
>> Eric Feigelson: My name is Eric Feigelson, Center For Astrostatistics at
Penn State. George has asked me to review for you the organizational changes
that are relevant ->>:
Not to review, to mention.
>> Eric Feigelson: In three minutes, okay. It's an alphabet soup. I'll start
at 2010. The International Statistical Institute is the largest, oh,
45
150-year-old association for statisticians. It formed an astrostatistics
community and now has 150 members. This is now morphing into the International
Astro Statistics Association. This is going to be a joke. There are so many
initials here. Okay. Probably in the next few weeks or months, which is going
to be an independent international organization, 501C to allow the ISI to raise
it into something called an associated society.
So the statisticians are quite interested in astrostatistics.
get the message there.
You should just
A famous project called the LSST in America, the largest NSF-funded project for
the next decade, I think, I'm not sure, formed the ISSC, the Information and
Statistical Science Collaboration in 2010. The leader of this one is Professor
Joseph Hilby, retired from Arizona State. The leader of this one is Kirk
Borne, who is sitting right over here, and this group has been fairly quiet but
is now sort of beginning to talk a lot about how to contribute statistical and
informatics expertise into the LSST project.
Now, in 2012, that means the last three months, two things have
is the American Astronomical Society has formed a working group
astroinformatics and astrostatistics. And this is now one of a
working groups that can last for years within the AAS, which is
national society in the world of astronomers.
happened. One
in
number of
the largest
This indicates that the astronomical community, at least in the United States,
is very interested in astroinformatics and astrostatistics.
And then last week, the International Astronomical Union, that's the largest
international group of astronomers, formed a working group, I'm sorry for the
joke here, of astrostatistics and astroinformatics, which is slightly different
than the other working group name. And that shows that the international
groups of astronomers are very interested in this.
Kirk Bourne is the leader of this. [indiscernible], a professor at the
University of Washington who will be here tomorrow, I'm told, is the head of
this. And their leadership is going to be decided very soon. I may be
involved. Actually, I'm sort of involved slightly in all of them.
And there is something that I am involved in. This is sort of why I'm
interested in this, is that I am sort of, Joe Hilby and I have formed something
46
called the -- let's forget writing it all out. The astrostatistics and
astroinformatics portal, and it's ASAIP.PSU.EDU -- PSU is my university, it's
just hosting this. And the reason for promoting this is that this is going to
be a vehicle for at least three and possibly all four of these organizations to
talk to the public and to talk to themselves in public and to have resources
available both for the expert community and for the innocent community, the
thousands of astronomers who don't know very much about informatics and
statistics. And also for the communities of computer scientists and
statisticians to just sort of browse astronomy and find out what interests
them.
So I encourage you to -- this website is not finished. In fact, you may go
there and feel it's rather incomplete. Well, like it has five forums that sort
of don't have any discussion threads yet. Well, that's because it hasn't
begun. So I very much encourage you to do two things. One is join one or more
of these societies that pleases you. And two, pay attention, especially over
the next few months, to the website, become a member of ASAIP, to contribute to
it and sort of build the online social network intellectual community of
computer scientists, statisticians and astronomers who have mutual interests in
methodology for astronomy. Thank you.
>> Yan Xu:
So the analysts of the last panel please.
>> Kirk Borne: I'm Kirk Borne. I was just mentioned in reference to the LSST
group by Eric. Thank you. So I have been involved in astroinformatics before
it had that name for a long time with the virtual observer since its infancy.
>> Masatoshi Ohishi: My name is Masatoshi Ohishi. I'm from Tokyo, National
Astronomical Observatory of Japan. I have been involved with the virtual
observatory activity in my observatory for the last ten years. And last week,
I'd like to say we had special session number 15, which was associated with the
original assembly. The aim, to discuss not only the current status of various
virtual observatories and how to extract new information through, you know,
statistical data analysis, including astroinformatics. So I'm very happy to be
here.
>> Bob Hanisch: I'm Bob Hanisch from the Space Telescope Science Institute,
and I'm the director of the Virtual Astronomical Observatory Program in the
U.S. I also am the new president of commission five that I've taken over from
Masatoshi. Term limits, term limits is what it is.
47
And the working group that Eric just talked about is under commission five in
the IU. So how shall we proceed? I've got a few notes of things that I wanted
to start things off.
>> Yan Xu:
Please go ahead.
>> Bob Hanisch: So as Masatoshi just alluded, the virtual observatory has been
basically a decade in the making. The initial problems we'd tackle were quite
easy. We'd sort of, you know, skim the cream, and we were very optimistic,
perhaps overly optimistic about how easy it would be to carry on from there.
Things got more difficult. Achieving national and international consensus
takes time. The infrastructure, though, that we conceived and saw evolve over
the past decade is now mature, to the point that we are now in some cases doing
second tier and third generation evolution of capabilities.
The core capabilities of data discovery, access and interoperability are
largely solved problems. And VO-based data access is really now pervasive.
Many people in the community do not understand that they are using virtual
observatory protocols every day. There are millions of VO-enabled data
accesses per month just from the U.S. via collaborating organizations, the ones
that we count, and I'm sure it's another factor of two to ten more than that
when you integrate all over the worldwide VO-enabled data resources.
So the VO is here now and people just still are not as aware of it as they
should be and, of course, I think those of us in the VO projects do have to
take responsibility for making sure that this dissemination works better in the
future.
So the challenge now is scaleability. Scaleability has always been part of the
plan. We sort of started the VO initiatives in partnership with grid projects.
But frankly, astronomers were not interested. They, for the past decade, have
said oh, I can just get the data on my workstation, on my laptop, and I'm
happy. Don't tell me about the grid. Don't tell me about the cloud.
But they have to get interested now, because in the coming decade, the data
sets being produced by the observatories of the future or even now, with, like,
low far, are just not possible to bring on to your laptop or to your desktop
and analyze there.
48
And, of course, astroinformatics and astrostatistics are part and parcel of
this next stage. Eric did not mention, for example, his VO stat package, which
has now been updated and released by Penn State. There's a nice poster here on
can far and sky tree. So we're seeing these astrostatistics and informatics
tools being exposed more widely to the research community and getting takeup of
them, I think, is now a challenge that has already come up today in several
other contexts.
So this whole thing of community engagement is, I think, where we are now. In
the VAO, we are participating in summer schools, we are hosting community days.
We are going to conferences like the AAS and helping to organize things like
the IU special session to help get the word out that the VO is here, it's
ready, and we're really trying to push out capabilities to the community so
that they can develop their own VO-enabled tools. We can't write everything
for everybody. We've never really intended to do that. But we can provide an
environment that makes it easy for people to develop things that build on the
VO infrastructure.
And, of course, coupling this with education of the community as Eric, for
example, has been doing for many years now about astroinformatics and
statistics is, I think, critical.
So now is the time. I think we've got the basics done, and we are ready to
move into the age of big data, big information and using these tools that will
allow people to distill the essence of very complicated data products into new
science.
>> Yan Xu:
Okay.
>>: Can I ask about the meaning of the title of this session, why from virtual
observatory to astroinformatics and any other combination of words?
>> Bob Hanisch:
Well, that was George's title.
>>:
I feel like it was a collection of words.
>>:
It was the lottery.
>> Bob Hanisch:
That's what came out.
I think the VO and astroinformatics are a partnership, not a
49
transition, although there is a change in emphasis here from discovery and
interoperability to now incorporating these tools for informatics and
statistics into this framework.
So in that sense, there is a from and a to.
partnership.
But I think it's an ongoing
>> Masatoshi Ohishi: I'd like to add some words. The VO concept consists of
three categories. One is registry service, how to -- you know, in order to
find data services. Then after finding specific data services, we retrieve
data.
The next step is computing service. And my understanding of this computing
service includes, you know, statistical data analysis or astroinformatics and
astrostatistics. That's why, you know, VO, from VO to astroinformatics is
closely related.
>> Kirk Borne: So I just want to say a few words about the end of that title,
astroinformatics. More than once today, people have said I don't know what
that is, and so I'm just going to take the opportunity, since I'm in the chair
right now, to say what I think astroinformatics is and why I think it is a
research discipline.
For quite a while now, I've been in this area. But it started sort of the
quantum jump in my interest in this field jumped back in the summer of 2003
when I heard Jim Gray give that talk in online science, which we heard
mentioned this morning. He has given that talk in a number of venues in 2003,
2004.
And he talked about X-informatics, where X referred to any discipline.
Bioinformatics, geoinformatics, and he didn't actually say astroinformatics in
that particular slide, but that transition in my head from bio to geo to astro
happened, the a-ha moment for me was these are independent, in a sense
independent research disciplines within the bio community, within the Geo
community. They have their own journals. They have their own departments.
They have their own conferences.
They're not infrastructure. Too often, there's feedback that why are you
calling this astroinformatics. You're just talking about computing
infrastructure. And it's almost the same sort of self-definition that the VO
50
community or the self-identification VO community has had, which is are you
just computing infrastructure project, or are you an astronomy research
project. And I think those who have been working on these projects recognize
them as astronomy research projects. But in order to make the astronomy
research enabled, you need that infrastructure.
So when I think about astroinformatics, I think about many pieces and maybe
this is too broad. But for me, it is broad. It's the data access and
discovery. It is these registry services. It is what terms do you choose to
index the data, to tag the data, to edit the data. And those could be done by
humans or they could be automated.
We heard a little built this morning about automatic tagging, automatic keyword
extraction, whether it's in an abstract of a paper that talks about the dataset
or if it's actually, you know, a set of keywords that the author provided.
But how do you index data to make it discoverable and accessible through things
like the VO? So it's that data access, discovery, data management stuff, if
you will.
There's also data structures. I think of data structures as informatics. So
how do you visualize multidimensional data? I mean, you have to put the data
in some special structure, perhaps, to enable that. How do you do an end point
correlation function in a thousand dimensions? Again, what is the data
structure? Is it a graph? Is it a table? Is it a cube? So these things are
part of research. Understand what are the best ways to make the data
interactively explorable by the research community to enable research.
And then, of course, more than that, so it's those things, but it's more than
that to me. It's also the machine learning or applicational machine learning
to large data sets, which is data mining. So frequently, those terms of
interchangeable. So I tell my students, the application of machine algorithms
to big data is called data mining. The application of machine learning to
machinery is robotics.
So you talk about decision trees and neural nets, et cetera, these are the
exact same algorithms that you learn in those robotics systems.
So developing those algorithms is research, okay. It's not computing
infrastructure. It's actually applied mathematical, in effect, some cases,
51
pure mathematical research.
And then, of course, statistics. So statistics, again, there's such a rich
wealth of algorithms historically, plus still being developed by that
community, all of that enables discovery from data. So these are all
informatics and specifically applied to astronomy, astroinformatics.
And lest I leave one off, seeing Matthew sitting here, semantic astronomy. The
whole concept of ontologies, knowledge representation. Because when we're
dealing with big data, the real issue is how do we get the data to a point
where it's so manageable, I can do something with it? So it's that extraction
from data to information to knowledge to get us to that insight.
So it's that value chain where we're dealing with petabytes down here at the
data level, but how do we go from that to extracted information from that,
finding the patterns, correlations, trends in the data in that information
stream, which is now the knowledge, okay. And from that knowledge, actually
understand what it means for the universe. And that's the insight and
understanding step.
So that's quite a reduction. So you can imagine all of the data that goes into
a project that's looking for the Hubble constant or the dark energy parameters.
At the end of the project, there's one number, which people publish, and then
they become famous. But in order to feed into that one number, there's an
enormous amount of experimental result.
And so how do we do that for other types of things which are maybe not quite as
sexy and high profile as dark energy or Hubble expansion or something like
that. But how does the scientist deal with huge quantities of data to extract
that nugget.
So for me, in my career, it was always the galaxy merger rate. For me, that
was the holy grail for my career. I was going to help find what the galaxy
merger rate is. And whether I achieve that or not is probably almost
immaterial now in the late age of my career here. But for me, things that I
did were driving towards a single number. The number of mergers per galaxy per
Hubble time. Of course, that's an over simplification of a complex problem.
But that's sort of the idea, how do we go from the big data to the big idea,
which is what we're seeking in science. And so for me, astroinformatics
52
incorporates all of those pieces and maybe that's too broad of a definition.
But for me, I think that explains why we can have a community sitting around
this room where we're talking about data management. We're talking about
databases. We're talking about data integration, data warehousing in a
previous talk, but also at the same time talking about statistics and more
tomorrow about data mining algorithms and the applied math that goes into all
of this.
>>: I think Kirk's basically right, although I think you missed a couple
things.
>> Kirk Borne:
Well, yeah.
>>: The library is electronic publishing and also I think means of enhancing
collaboration communication.
>> Kirk Borne:
I just didn't have it all.
>>: So the way I view any science informatics is means by which computing
technology can and has stimulated some domain sites. In our case, astronomy.
And in that sense, there's probably going to be perishable field, because soon
enough, all of astronomy is going to be astroinformatics or maybe it's already.
All the biology will be bioinformatics, in some sense. This will be just a
normal way of doing things.
But until that happens, until the communities embrace new ways of doing, we do
need these bridge fields.
The second role for the science informatics field is, I think, to talk to each
other. To share the methodology and tools or ideas or experiences so that we
don't have to reinvent the wheels. And in that sense, I think you're right.
It should be broad and anything that helps us do better science using computing
information technology some tangible fashion belongs to it.
So in that sense, I would say statistics obviously means understanding the
data, but it's just one part of the whole picture. I think the only reason why
he has working groups of both astroinformatics and astrostatistics is that Eric
is interested in both and he was the one who took trouble to form it.
>>:
Just a short answer to that, I like to say that the informatics deals with
53
the volume and statistics deals with the complexity, the variety.
>>:
I disagree.
I think that data mining deals with complexity.
>>: No, I'm just saying that the discovery -- dealing with large data is
different than the sort of statistical validation of your results, which can be
on small data.
>>: Well, [indiscernible]. Hubble [indiscernible] take a dozen or 20 spectra
or photographic plates, look at them, do a bit of thinking, make up the
expansion of the universe. We now deal with petabytes of data. No human mind
can actually deal with that amount of data. So for me, astroinformatics is a
process of extracting science from the data, because we've moved beyond the
point where you might just look at the data, extract the science. It's
actually difficult, those steps, all the way from data, information, knowledge
to science. You need sophisticated tools.
>>:
I think he said the same thing, only more compactly.
>>: In the game of alternative definitions, the definition that I've been
working with in terms of defining, talking about discipline or self-discipline
in astroinformatics, I think could battle with Kirk, probably could battle with
George's and maybe [indiscernible], which focusing on informaticians or
informaticists in the sense that there may be a category of people who are the
people who have perhaps -- are perhaps computer scientists who form a
particularly interesting area or scientists gone to the bad, in terms of
getting their kicks more from the technology and making the contribution
through the intersection area.
So it may be that if there's a discipline of astroinformatics, it's primarily
motivated, is defined by implication with the interest for [indiscernible] like
some sort of intersection between those two disciplines.
>>: Let's say that I find this very refreshing, because for almost one
century, of course, I won't ask [indiscernible] to provide an operational
definition of geometry, and I think you all know that this answer would give me
a couple days. In a couple days, say geometry is the set of complex or
knowledge which [indiscernible] over X [indiscernible]. It's not a
definition --
54
But in any case, even if the definition is a familiar thing, at least find a
[indiscernible] definition of [indiscernible]. For instance, I had the same
feeling as George. Astrostatistics and astroinformatics work together. They
are two completely different things, and it's difficult to say which includes
which. They're two completely different things.
In my opinion, astroinformatics includes astrostatistics, because
astrostatistics is one thing which is needed to do astroinformatics, but I may
be wrong about that. I will discuss it with Eric.
But I didn't agree it's not the specific [indiscernible] what we're discussing
is what's happening here and astroinformatics is actually the knowledge we need
in order to serve. The problem is just the physics is phasing out. So it is
agreed that this incorporation of a slash or [indiscernible] networks is
everything which we need to solve this problem.
>> Yan Xu: Okay. It's been a long day and we're behind schedule so we'll take
the last question and, of course, we'll have opportunity to continue
discussions in lunch time. Dinner time. Alex.
>>: Just to return the favor for the quantum computing question, to get you
out of your comfort zone, so what would the word like be in this field in ten
years from now? Just a three-sentence question.
>>:
The outgoing president of --
>>: In ten years from now, I believe, you know, astroinformatics or
astrostatistics, whatever it is, various useful tools will be prevailed and
many instruments will be used and without seeing them. So that's the way we
should go and I'd like to stress that useful tools and successful use cases, I
think these are the keys for our success.
>>: I'm hoping in ten years astroinformatics won't exist.
irrelevant.
It will be totally
>> Yan Xu: Well, let's thank all the speakers for very stimulating discussions
and presentations and the panelists that gets us set for the next two days.
Okay. We'll follow George's instruction for the picture and then let's enjoy
dinner. Thank you very much again.
Download