>> Ashish Mahabal: Okay. So I'm going to... and semantics, how we can try to use expert knowledge...

advertisement

>> Ashish Mahabal: Okay. So I'm going to talk about connecting Bayesian networks and semantics, how we can try to use expert knowledge and try to lay it over various networks so that one can get more out of them gradually, build them so that they become more useful. And this work of many, many people involved here, many of whom you know, George, Matthew, Ciro, and so on.

But there's also the student, Alex Ball, who was working with me over the summer, an undergrad, so he put in a lot of work in the Bayesian networks that I'm going to talk about.

So you of course already know all these big service that are happening and that we are getting lots and lots of data. And we will be getting even more data. The main thing is that the number of columns will be increasing, not just the number of rows. And that is where a lot of interesting part will be coming in because these columns, these parameters or these variables are going to be very different from each other, not just like those which are flex measurements and different filters and so on, but all kinds of different measurements, connectivities and so on.

And so that is where we'd like to use many of these methods. So consider CRTS. So far it has discovered more than 6,000 transients. And there are various kinds of transients. There are supernovae and cataclysmic variables and blazars and so on and on.

And associated with each of these transient is a different set of parameters. So not the same kind of parameters are useful to look for these different transients.

But the [inaudible] on what typically have been used a lot. But what we can look at is that there are many discovery parameters, magnitudes and delta magnitudes at least if you have couple of measurements and the filter itself. And there are many, many contextual parameters. Where is the distance to the nearest star? What is the color of that star? What is the magnitude of that star? Then what is the nearest galaxy?

And that is a more dicey question because there can be more than one galaxy, the radius of the galaxy comes into the picture and so on. So we'll be looking at more datas of that particular thing.

Then flux to the nearest radio source, galactic latitude and so on. All of these are contextual bits that we can get from archives that already exist. Then if you do any follow-up, what are the colors that you get from follow-up? That becomes an interesting set of parameters that gets very back and forth follow-up classification.

Any prior classifications that may exist that can be reused. And then of course characteristics from like because I'm listing only a few here. Matthew has been putting a service together which has several already, and more are going to go in.

So you can have simple things like amplitude, just the difference between the minimum and maximum or half of it or median buffer range percentage, to remind what the median is and to see how many points are outside a certain range from it. Or just things like standard deviation and Stetson k and so on.

And we are also developing something called a prior outburst statistic, try to take

[inaudible] and try to figure out how many outbursts that particular object had, of what intensity and so on. And that's like it will be very useful in many cases.

So why I show the specificities here is because many of these parameters are missing for most of the transients. So we have lots and lots of columns, even more rows. But many of the entries are completely missing. But that missing information is something that we need to live with. And Bayesian networks or Bayesian methodology in general is very good to deal with that because missing parameters are okay there.

And, in fact, what is said is that you can build your network based from your data. So you can find where the connectivity should be. But this is true in principle, and if you want to actually do it in practice, that seems almost impossible just because of the complexity that is involved there. So I'll be coming to that later.

So we are interested in classifying all types of transients. Matthew showed this earlier.

This is some addition that we have done like not just a supernova type but several subtypes that may be possible on not just one AGN but many AGN subtypes that come in there. So we are interested in all of these we need to be able to find parameters that can tease apart different classes from here.

So we have been using some things like the GPR, which are good for certain specific cases. And in this case it's the lightcurves that have been used. I won't be going into details of that. Or the dm/dt method that was also alluded to before by Jeff and

Matthew and Ciro.

What we have is we take a single lightcurve and take all the dm/dts for that and then try to compare that with models that have been built for different classes using many, many lightcurves for that particular class. And then quantitatively again using Bayesian methods what class it's likely to fall in.

So for some broad classes like supernova, supernova, we can easily go to the 98th person. But as we go down the tree, that decreases. And again, what is important then is trying to determine which parameters are going to be useful in trying to discriminate that veteran, whether we can include those in.

This combiner picture Ciro showed earlier where we have different kinds that come in and we should have a combiner which should give a final classification. Again, I'm not

-- there are many, many difficulties with that. And I'll not be going into those details right now. But that is something that's going to be very useful, especially when you consider that some methods are based on lightcurves where many other methods are based on other parameters that you are going to be using.

So each method is going to take only a few parameters do something from it, something intelligent from it, and then we would want to, anyway, try to combine all those bits into a single one. So Curt messaged this earlier about characterization, so it's something that we have been looking into. So if you consider a supernova, and then again a very coarse example in the sense that that's not completely true, and that caveat has been

mentioned here. But still, if you are talking about large delta-mag some small delta-times it's still true.

So a supernova typically goes up and goes down. And so when it's going up, this is how it's increasing, and when it's going down, this is how it's doing. But if you have only three points and this is what you see, then it's very unlikely to be a supernova. So that's something that you can quickly say based on just three points. And that is what I would say characterization is. And one needs to again get into more and more characterizations for different kinds of objects and try to build it together for many things. So in case of supernova, that's sufficient. And not for all other classes, of course.

Most surveys that look for supernova do so using image subtraction and they typically target large galaxies and subtract an archival image from the latest image that is a source there. So cat Lee that doesn't originally do that. That is the main data comes in and the catalogs are used and so on. But we have also been running the supernova hunt where actually galaxies are subtracted and we look for supernovae like that, and many have been found.

What happens here mainly is that the galaxies that are targeted are large, big known ones, brighter ones. But we have been finding many, many supernovae using the catalog domain where the supernova hosts are dwarf galaxies and it's not easy to find those. So even there, going into the catalog domain helps a lot.

So Bayesian networks. We have been using the naive Bayesian network for sometime now where what we make use of are follow-up colors from the Palomar 60-inch telescope and some incidental parameters like the galactic latitude and other parameters like radio, for instance.

So we have been using these small number of input parameters and about six different classes, AGN, supernovae and so on.

So what happens in this class? This is a multinomial mode. What that means is that you have got all the six different classes sitting there. And there are probabilities that are priors that come out of these colors and other parameters that you have. And you can see exactly what probability your outclass is likely to be.

Again, it comes from your priors. So the better your priors are, the better your output is going to be. So what is important is that with every classification that gets fade back into the original and the classification keeps on improving. And it is very fuzzy in the sense that you can define your dependencies here. But the dependencies again will have to come originally from you.

And in the case of naive, you are not really defining interdependencies between different parameters, but all of them connect to the single node class. And phenomenology, this is not implemented, which would have made it actually non-naive.

Here you've got two layers. But that is the final aim that one is going after.

So here what happens is that you can say that you can use the naive Bayesian network when the input vector that you have you can say that they are completely independent

from each other. And of course that independency not completely true. For instance the G minus R and R minus I are related to each other. And that is a kinds of information that needs to go in to build a higher level network which will perform even better.

So within the Bayesian networks actually there are many possibilities that one can try.

The naive Bayesian network works overall very well, because you don't have to implant on it any artificial information, part of which may be wrong in some cases.

Then there are the three August meant networks. So you start with the naive network.

And then let the algorithm figure out based on the various probabilities that are going into that the dependencies may exist. Whether some of those nodes need to be connected and in which order. And that forms a new network for you.

So what we have found when we tried that is that many times the wrong connections are added. So one has to be careful if one is using the augmented network.

And then you can construct a network where you can use the semantics and expert knowledge to connect specific nodes. And that software is going to be better. But when you are dealing with hundreds of parameters that may not be possible. So what will be needed is try to build it piecewise. So start with one kind of transient, see what parameters need to be connected for that, start with another one, see what parameters need to be connected for that, and then connect it in the hierarchical network.

And there of course what one can do is one can have several sets of naive networks for each classifier and then put them together into a single one where you are putting constraints on the top level so that you say that, okay, you have all these naive networks and only one of them is likely to be the true winner, so let's try to pick it.

And the real thing should really be fully learned from data. But as I said, that's almost impossible because if you look here, if you have five different parameters, then the total space in which you are to look is about 30,000 which is possible. But by the time you reach 10, it's 10 less to 18. So there's no way that you can explore that entire area fully to come up for the best network automatically from the data.

So here is an example from what a naive network will do if you try to give it about 20 different parameters. And so this is just a zoom in of the network that you see here. All nodes are connected. Each node is a variable. And each has a prior that comes from the transients that we have given it as input. All of them are connected to class where the class is one of these about dozen different classes. And what is important to see here is this.

So you get completeness of about 80 to 90 percent for three classes and not so good for a couple other classes. But remember that there's a completely naive network.

Nothing else has been told to it. All that is that for each of these parameters for these several sets of transients, what is the behavior of those parameters for those classes?

And the network learns from itself. And the hope, the aim is that we can do far, far better than that.

But when it goes to the entire network, it can do interesting things like this. So in this naive network where class was connected only to three parameters, the third is not shown here, and then the 9 network [inaudible] but then we asked it to make a tree augmented network with it, try to let it figure out if there should be links that get put in which it thinks improves performance.

So it did build something. And the nears star of course has nothing to do with the nearest galaxy because those are completely different beasts. But it did connect that.

So one would have to be careful with something like that.

So this is what one can come up with again with tree augmented networks. So this is the performance of a tree augmented network that was built completely blindly starting from naive network. And here what we notice compared to the earlier one, for some classes there was an improvement in the performance. But for some other classes not so. So caveats again there.

So this is what we like to go to final. So here is class one and the naive network related to the class one, the features are from node one, node 2, and some of the node for class 2. Some of those features may be common, but not all features going for Nth class, some other parameters go in. So all these different naive networks get connected. And finally a constraint it reminds which of those classes going to

[inaudible] so that's what we like to go to.

So as part of that what we did is that we build a supernova, non-supernova classifier using a very small number of parameters. And the parameters that we decided to use was distance to the nearest star, galaxy proximity. And here we used the normalized galaxy distance. And both these can come from archives so long as that has been observed by some survey before, say Sloan survey. And from the archival lightcurve try to come up with a single characteristic, which is based on the peeks that one sees at the start of the lightcurve and the end of the lightcurve. So not use any of the other lightcurves, not use any of the magnitude of the transients or any other thing. Just three of these parameters and see what we can get from there.

So each of the three parameters of course, there are interesting caveats to look at. And

I'll go into details of some of those. Like proximity to a galaxy, right? Sure, there is a galaxy nearby. But what happens if there is another galaxy nearby? In some case we show our transient where there were two galaxies. So which of those galaxies are you going to use? This small area has been zoomed in here. So in the original picture if you don't look close enough, if you think that this is the galaxy, maybe associated with this possible supernova. But this says that oh, there is a galaxy that's much closer, we just really missed here. So maybe that is the one. So that is where the radius needs to come into the picture.

And then the other thing that happens is that if you are asking a catalog to give you the nearest galaxy, the catalog itself was made at say three sigma or five sigma. It's not going to have all the galaxies. So it may be important to go back to the original pixels, maybe a neurons extractor and get that information and do additional stuff with it.

Then if there is a co instance dents star, then that will be -- they'll tell you that it's definitely not a supernova if there is a star there. So there can be [inaudible] inputs that

go in of that nature. On the other hand, if there is a radio source that coincides with it, then it's unlikely to be a supernova. But if it is a supernova, it is an interesting thing. So it has to be a bit careful again with input parameters of that kind. In this particular thing we are not using the radio flux or radio proximity as a parameter at all.

It is another example of a supernova which belong to either of those two galaxies that you see here. And this is the kind of schematics that we build from archives. So we started using De Vaucouleurs' radius to start with, but the De Vaucouleurs' parameters from Sloan are not great. This is an elliptical galaxies and we found that there are many, many galaxies which are very elliptical given the ellipticities in the catalog, and we shifted to using the Petrosian radius.

So some of those looked like this, in fact. And they clearly did not make sense. Or some of those radii, they seem to overlap. And clearly in the picture they did not. So it is the Petrosian magnitude that we started using, and it seemed to be much better in this case.

Then one has to be careful when a supernova, potential supernova is in a cluster because near the center, near your potential transient you don't see any galaxy. But there may be several galaxies around all of them have known [inaudible] which are close to each other. So potentially there is a much fainter galaxy close to your transient, and it may be a supernova belonging to that galaxy. So that has to come in.

And all that again comes from semantics or expert knowledge that can go in.

So this is the according schematic picture of the previous image where there are -- where there's a galaxy cluster.

And so what one can come up with is a -- is a prior of this nature where one asks oneself is given all the supernovae that you know about with certain -- almost certainty and you look at the distance to the nearest galaxy normalize how is those galaxy? And that is where you can do a cutoff to determine if you want a higher cutoff. But then again, probabilistically the Bayesian network can do it for you, depending on how you have defined it.

But if you want a higher cutoff, to go to a rule-based decision tree you could also use something like that.

So the proximity statistic that I mentioned, the peak statistic rather, is this is how we define it. So if you consider a CV and this is where say it was detected because there was a big jump and then you look at the archival data, you find that there were many peaks there. On the other hand a supernova of the galaxy may have been detected many times, but this is where the supernova was detected. And then there was no big outburst here.

So what we do is that we take the 80 percent faint, 80 percent faintest points in the lightcurve, determine the median then and then define some sigma and ask ourselves that how many points about that one sigma or two sigma level are seen in the lightcurve? Find the earliest such point. Find the latest such point. And then do additional statistic like connecting all the nearby points to not miss other points that are

close to those peaks that so for instance all these points get treated as a single peak, all these points get treated as a single peak and so on. And then ask ourself what is the ratio of the delta time from the first peak to the last peak, divide -- and the total length of the lightcurve.

So delta tree defined by first peak to last peak and delta tree defined by the length of the lightcurve that we have. And in the case of a supernova we expect that number to be small. In case of other objects like CVs, we expect that number to be much larger.

And then of course the ratio of the significance we will want to try to vary it and see how best that comes about.

So this is a set of boxplots for different kinds of transients. So for CVs we see that this is the ratio of the lightcurve of the peaks to the entire lightcurve. The median is around .75. Whereas for supernova it's around .1 something. So there's a drastic difference there.

Of course there are some individual points that go all the way here, and for CVs there are certain points here. And that will get determined by the significance label that you use to do the cutoff. And you can -- and one can experiment with that and do better than that. Or with the connectivity and so on, one can do better than that.

And there's a whole variety here. For instance, the only difference between what we call blazars and AGNs is whether there is a radio source nearby. But otherwise they are similar. And you can see that their boxplots are similar. So that's a heartening thing to look at.

Here is what the N-sigma looked like. So for each of those classes if one asks to see how many sigma about the median one can find points at all, then this shows that. So for CVs you can have points -- the median rather is around 3.75. So it is for supernovae. But there are classes where it can go fairly high and for a few classes like the Mira variables it tends to very low. So you can use a combination of these two parameters to define your statistic to improve the classification that you are going to get.

So using just these three parameters, the minimum galaxy distance normalized the minimal star distance and the prior outburst statistics as I described it right now, we get

80 to 90 percent completeness. And then we tried additional things by again here not all three parameters were necessarily present for each of them. So even this data set was not complete. There were missing values in there. And still we get 80 to 90 percent completeness.

And then we also tried it by looking for three parameters complete. There is only taking the rows where all three parameters were present. And there is slight improvement in there. And then when we try to use only two parameters complete it of course went down, indicating that the third parameter, even when it was not present in all cases, it did help us improve things.

So this is using the three parameters. And this is just for supernova and non-supernova. And this is the type of thing that we like to use for multiple sets of parameters. And this just CDPs shown for that particular set. So we used about 900

nonsupernovae and 600 supernovae, and these are what -- the class one here is the nonsupernovae, class two is the supernovae. So consider the minimum galaxies distance. For nonsupernovae the nearest galaxy is the largest but on the X axis you have distance in arc seconds. So clearly, as expected, for nonsupernovae the nearest galaxy can be farther off. But for supernovae it tends to be much closer to where your transient is.

And the opposite is true for the minimum star distance. For nonsupernovae the nearest star distance seems to be peaked at zero whereas for when it is a transient, the nearest star is generally farther out, and it's actually -- there is a large spread there. So that seems all consistent.

And so we are now coming up with a tool just for visual verification of these kinds.

Because when one is trying to put semantics or expert knowledge on this, it's important to be able to verify some of these things. So earlier on that's going to be fairly useful.

There's a couple of diagrams based on just this classifier, normalized distance, how the supernovae separate from nonsupernovae. Here is another way to look at where what you have used is normalized distance to nearest galaxy. So they seem to separate reasonably -- not completely but reasonably well. And by hiding a couple of parameters, this can be improved further.

So of course the larger picture is that all these small bits are going to go back into something like a portfolio for a transient. So it's going to have a growing list of parameters that can be used to -- as input to more and more classifiers. And so as more become available they go right back in.

And many of these additional follow-up that happen from various telescopes, including the Gaia network they again go back into those bits and can get incorporated into the

Bayesian networks as well as the other networks.

This is just a schematic showing how this black box incorporates all the classifiers that we have been talking about, and whatever probabilities that we get once the ones that are confirmed one way or the other can get fed back into the priors through, say, active learning and so on.

Another related thing that we have been talking about is using a similar Bayesian network to determine which follow-up instrument, which telescope to use when there are multiple ones that with possible. And that can be again determined in the similarly

Bayesian way by looking at different classes that can be observed. And if this is -- what this shows is the different classes.

So the classification is ambiguous to start with. If you do the observation was this telescope then it crystallizes into a single class, whereas if you do it through this other telescope, it still remains ambiguous. So if there is such a decision to be made, then clearly observing it with this telescope will be useful. So that's what one can do.

But not too much has been done in this yet. So one -- the reason why I'm showing this is that if you remember the original Bayesian network where the class sat at the top, so as if everything was feeding back into class. But if we reverse the question and ask

ourselves okay, I don't know the class of the object but knowing that only one percent of transients are interesting and those interesting transients are either of these classes or something that I do not know, where in the Bayesian network should I be getting more information? So, again, that is how this particular diagram can get linked to the

Bayesian network of classification.

So I think I'm going to stop the summary and see if I can answer any questions. So but essentially we have been moving towards bunching sets of inputs that we have to build classifiers of different types and hope to go further. And some of the semantics tools that could be useful are perhaps topic maps or anything that deals with triplets like

Matthew mentioned before, because those are the ones that will allow us to say what are the different dependencies. The dependencies are the ages in Bayesian networks.

So if we know dependencies then we can create ages in the Bayesian network and we can define more useful structure. And your concrete ideas will be welcome. Thank you.

[applause].

>>: We have time for a few questions for Ashish if anyone has any questions for

Ashish? Eric.

>>: You just mentioned that -- in my research, knowledgeable organizing, I've recently been using my [inaudible] been using Bayesian classifiers not really network, and they work fine. My experience is that you each know a great deal about the -- the prior distributions for every class [inaudible]. But if you have it [inaudible] so it's just as easy as you said. And it works in our opinion.

My problem -- I wonder if you have an answer, it's probably true for any classifiers.

When you're finally at the end and you finally have to decide between a class two and class five, you need some sort of a -- sort of a guillotine-type criteria to just make a decision. And I was wondering, do you have any way of making [inaudible] that guillotine rather than arbitrarily, which is what we do?

>> Ashish Mahabal: No, I think you're right. First of all, for the first part of your comment, naive networks work well. But given that we know a little bit more than what naive networks can do, the hope is that we can do better than those. But I've looked at literature and there are not enough indications that there are good methods to do that:

That is why I thought that maybe this piecemeal thing will help, where we know a little bit, you implant that and see whether it improves things.

>>: It's more the final decision part rather than the computation part?

>> Ashish Mahabal: Yes. So I don't think we have anything specific there. The guillotine will be just -- so with other things we have been working on something like if it reaches 90 percent then yes, something like if it reaches --

>>: [inaudible].

>> Ashish Mahabal: Yes. So 18 percent, not just that, 51 percent and so on. But what's important is to be able to also leave it there so that when something new comes in how that improves. But if you are to make a decision now, then yes [inaudible].

>>: So building on that, what is wrong with just keeping the probabilities there and having probabilities --

>>: [inaudible] decision to live --

>> Ashish Mahabal: Yeah.

>>: [inaudible] either in your case go for telescope or not [inaudible] in my case to start doing some [inaudible] I'm tired of doing statistics [inaudible].

>>: But I think what we will see, especially as we move into the LLST era will be dealing with probabilistic classifications that this object is 80 percent [inaudible] 20 percent W, something will be much more familiar with us. And then there's that sort of fuzziness, if you want to call it that, becomes more -- well, but it will become more natural and we will develop the techniques and move forward in that way.

>>: Sooner or later -- I mean, a lot of what we're doing is scientific infrastructure so that people can say there's variables, [inaudible] stars, supernova type ones or whatever it is, and actually a sample for study. And sooner or later decisions have to be made.

And I don't think there's any way around it.

What we actually did was in addition to having training sets for the different facets, we ended up having a training set of sort of a [inaudible] that had been studied for 50 years.

We basically tuned our final decision levels to work on the -- to train [inaudible] global trends. And we just hope that it works on the other [inaudible].

>>: [inaudible] a question. You mentioned [inaudible] and that very naturally keys into stuff that people [inaudible]. But the question I wanted to ask was earlier on you showed a problem with the TAN approach generating spurious links.

>> Ashish Mahabal: Right.

>>: Are those just spurious, just --

>> Ashish Mahabal: No, not on this.

>>: [inaudible].

>> Ashish Mahabal: In that particular case, because we knew that those should not be connected and the network connected it, we knew that it was spurious. But there may be ways or times when we don't know whether there are spurious things because within the network are terminology that are various things like moralizing a graph. So if you have inputs to a particular node coming from two different ones and if you go from directed network to an undirected network, then the connections that it makes may be some of those are the ones that I tried first by the tree augmented network.

So there's no way for me to know about every connection, whether how good that is and so on.

>>: [inaudible] technical question [inaudible]. In that case [inaudible] spurious, is it statistically spurious or is it [inaudible] thing you ought to worry about?

>> Ashish Mahabal: So I don't think it was statistically spurious. It was part of a larger network. So I didn't go through trying to cut out each of those pairs and see if that is the one that improved things a lot or made it worse. But meaning we know that just because those should not be connected, so only in that sense I know it's not correct.

But statistically it may have been relevant.

But we shouldn't still go after those.

>>: Okay.

>> Ashish Mahabal: One thing that I did not go into are issues related to continuance variables and discrete variables. So right now we have been discretizing everything.

And bending is a huge issue. There are some methods that are said to be optimally bent. But when we look at those bends, they don't really seem to be optimal. And I don't have a solution around that either. But that's something more that needs to be considered.

>>: So any more questions for Ashish?

>>: Just a comment. Sometimes I made crazy statements in my lectures like I say that the data becomes the model of a complex system. And the simplest example I can give is like the human genome, that must be fully map the genome that we will know all the links in that network, so to speak, and what things will link to other things and what things express themselves in certain cells or lead to certain disease or certain susceptibilities.

And so what we're aiming for, whether it's astronomy or the world financial system or the genome of sciences is a complete data set with not the Swiss cheese model as you showed it where we have the gap but we have a sufficient knowledge or data network that we can then build the knowledge network from that and actually be able to answer pretty much any question like, you know, is it -- you know, are star bursts related to mergers, are AGN related to triggered collision events in pairs of galaxies or whatever, so that the knowledge is embedded in the data. The data becomes the model that answers your questions not the subjective model where you parameterize the universe, you actually have it encoded in all of the data that you have.

And so one of the definitions I like to give my students for data mining is data mining is what transforms knowledge which we're seeking, transforms knowledge from the data representation to a rule representation.

>> Ashish Mahabal: Yeah, but --

>>: And so finding.

>> Ashish Mahabal: You have --

>>: Finding those connections is what maybe can help us.

>> Ashish Mahabal: Right. If you have two spots that are too huge data, sometimes just throwing a lot of computation at that is not going to be good enough, so that mining the model may not be easy that. And that is where trying to do it, trying to see some of those connections may be useful.

And you can always do a subset of data and then use that as model.

>>: [inaudible] a complete [inaudible].

>> Ashish Mahabal: Correct.

>>: [inaudible] ask any questions and get an answer. But the big data is getting us closer to having a complete network of knowledge.

>> Ashish Mahabal: Well, something from which a complete network could in principle be --

>>: Right, in principle.

>> Ashish Mahabal: Right.

>>: So I guess the general theme here is for the three of us to discuss or answer questions about incorporating domain knowledge into astronomy into astroinformatics.

And I think, as I said when I was going through the why part of my talk that I think there are essentially sort of three separate areas where we look to include domain knowledge.

We look to include it for doing sort of smart application stuff which is more probably about data access. We look to do it for the actual data mining to see if we can do something new and interesting that way. And then the third one I guess would be sort of a sort of data management for doing consistency checking and that way it's more important to do the making sure that we -- we know what we have and what we understand.

So and as Norman said that we're fairly well advanced on a lot of the things now to start making some very good in roads into those. I think what I would be interested, as an initial starting point, is hearing what potentially we could foresee having done in a year's time based on, you know, we understand what technologies are there now if we had, you know, let's say decent funding for a year or a year's length of project. So what could we achieve in a year based on what we have now?

>>: Well, I think that the -- last week I was visiting ADS and talking to -- to talk about this unified astronomical [inaudible] from the Thesaurus. And that sort of size or port of money we should be looking for from somebody is about an FTE for a year or a half

FTE for a year. And I think with that you could do most of the -- I think you could take things up quite naturally with that, produce a fairly neat product Thesaurus and a fairly plausible management -- mid process for that. So that's just the order of money required for a tidying up process there, for a productizing process.

>>: So that would be a Thesaurus of --

>>: Thesaurus which intended to be of use to -- well, the partners in that are two journal publishers, AIP and IOP. And the AAS I think is the order of it, and ADS as another user of it. So the sort of application they want to use that for, which I think is really what you're asking, is using that as a skeleton, a backbone on which to add lightweight bits of extra navigation, more sophisticated navigation, if you could embed that in which would help them create good applications, but also demonstrate that this is the Thesaurus folks will be using and this is what you can [inaudible] these applications from this. So these things you can imagine therefore explain to people's minds after that, which they wouldn't think of until there was this Thesaurus waiting to be used are things like, for example, embedding these structured tags in articles, paragraph by paragraph. This paragraph is about this, this paragraph is about that.

I'm not sure who you get or whose arm you twist to do that. Perhaps the author could be persuaded to do that if there was something in it for them in terms of the visibility of their article.

But these are longer term things which will just happen, I believe. As people have an itch they want to scratch and realize that these people are using that Thesaurus, I'll use that one.

>>: So this would give them I want to mark up my body of knowledge, I want to -- I want to mark up my -- I want to semantically tag my X?

>>: Yes.

>>: This is the Thesaurus that is community approved and you should use it?

>>: Yes. I should be using or I should make sure to tie into. Because one of the most widely used Thesauri on the edge of technology is the Dublin core metadata set. Dublin core metadata set is a set of originally I think 13 terms. Now it's about 30. For things like author, title, publication date and so on.

And it's very simple, brutally simple. But it's used all over the web [inaudible] visibly because whatever Thesaurus ontology we use for their storage they make sure to say and this bit's like Dublin core author and this about it's like Dublin core title. So they can do all the precise work they need using their own Thesaurus or some other Thesaurus but still be linked into the rest of the world because the -- the explicit link between their private stuff and the rest of the world.

And so even if the -- so the way that the unified astronomy Thesaurus might develop if either people say we need more in this part of the tree, we need more hear, this part needs reorganized, that might be change that way. You could also manage that being substrate or something.

I understand how you -- you do build trees on substrates, yes. The trees in the -- forget that. It's a framework, a [inaudible] skeleton in which other things might well find root and spread the threads of astronomy. I could go on.

>>: Sebastien, same question.

>>: What was the question?

>>: If you had a year's funding --

>>: Oh, yeah. I wish we had a year's funding. We made applications for funding for this smart bottom that was rejected. So we're still trying to do it. But slower. If I had some funding -- I would hire someone to work on the -- on the -- on this model.

I think something that we are missing is the -- is bridge the gap between, you know, data discovery or resource discovery and actually querying these resources. Even take a small example. Even if you are able to locate the artist, this catalog, or this image survey which is relevant, how do you -- do you expect the user to know all about the protocols to query these resources or -- I'm not sure we already have a mechanism so the -- you know, it gets hidden from the user, all the ugly mechanics of the parameters you can tweak to make the queries in the data services are -- well, I don't think people will want to learn [inaudible] of doing [inaudible] simple spectrum access or -- to make it parallel with the SDSS. I think SDSS is a great success.

But the first thing they showed was probably not the SQL interface. People would not go and make queries. You need -- you know, you need to go there step by step, have simple template, queries, and then you can understand how it's done and then move on to something more complex.

>>: So you'd like to see some sort of smart data access?

>>: Yeah, probably. Yeah. And something where -- to explain to astronomers okay, this -- what is this parameter? Even currently, you know, registering something to the registry we are trying -- I know there are so many thoughts on the view side to make the form to describe a resource more intuitive because you -- nobody will go and rate the full view resource specification just to resist one resource in the view. So --

>>: [inaudible].

>>: Yeah. So the web form has to be intuitive. And which you say subject colon, text field, what do I put in there? Subject. So you should have examples or suggestions or mechanism which semantic to guide the astronomer or even the PI of a small project to say, oh, yeah, that's what they expect the output list of keywords to derive from the teasers, things like that.

I think this is -- this is where semantic can help.

>>: Following up on that, I think that [inaudible] the UET work was morally funded in the sense that we put in a bid for a boat, an FEE to [inaudible] in the UK. And I'm not sure how many [inaudible] but we came third -- we were ranked third, with four to be funded.

And then at the last moment the funder had a little bit of a crisis, raided the piggy bank and halved the pot, which was very aggravating.

So I regard that as a win, an unfunded win.

>>: A Scottish win?

>>: Yes. But in terms of things I would like to do, I think I would -- there are things that could be done, but I don't want to do them. In a sense. That long scenario that I described with mat I will that drifting from place to place, that isn't one project. That's a lot of different people with different tools just going oh, allow this little extra bit of functionality. And I think that it's a small extra bit of functionality is the point. So this

Thesaurus is a big lever for other applications to start doing things together. So there isn't a grand plan here. I think there ought to be a grand plan.

>>: So I think the thing I would like to see is semantic data mining is used far more prevalently and by [inaudible]. And there's a particular technique they use called pathway. So you mark up your data using semantic tags. And then you do a big inference on it. And you suddenly discover that such and such a -- you know, I'm trying to think of what the example is. They covered that there's a particular G -- gene sequence which codes for zinc expression in some sort of sleeping sickness virus which they hadn't previously realized.

The information was actually all out there in different sections, but it was only when you put all the information knowledge together and did the logical inference on it that, oh, that's useful and can affect blah, blah, blah.

It would be interesting to try and do the same exercise for a finite data set in astronomy.

You know pick something like the NGC catalog and try to tag it up with as much information of where about it across all the wavelength and all this thing and then do an inference on that and see what comes up.

Because that will also give us some idea of what it is that we are missing in terms of knowledge or information or stuff that we would want to be encoding to be able to include it into the sort of stuff that I was -- some of the stuff I was showing and some of the stuff that Ashish was showing to be able to have these much smarter systems for this is how you get the expert knowledge in terms of data mining.

And I think that that is -- all of that plus [inaudible] the Semantic Web is a block of reasoning technologies which are [inaudible] but the Semantic Web bit of it is that this infiltrated through all these manual technologies with the idea they are webby. They are the web -- the web in their genes. So it's all of the knowledge, taking things from people who aren't working together a priori but who are articulating things in a way which is shareable. It can be mashed up together.

So the idea of a triple store -- one of the things about an RD triple store is that you just three everything into it. You can't do that with an RDMS because the schemas don't match. The triple store, check [inaudible] some of it might be useless but you can run reasoners over it and program those reasoners with ontologies. So it doesn't have to be limited to one -- you start off with one data set. But you can find things that no one would find if they won't --

>>: I mean, it's more the idea of having a single data set because that's just a manageable --

>>: We understand the provenance of the original data as opposed to all the other stuff that's attached to it.

[inaudible] had a comment or a question.

>>: Yeah. I wanted to ask something that I already forgot but now I want to ask something different.

So how hard is it to connect a simbad into the linked data?

>> Ashish Mahabal: It's not -- simbad is not that big. It's about four million individual objects at most a few 10ths of measurements for each object. And so it's a manageable triple store.

>>: A hundred million triples. That's small.

>>: So that's --

>>: You can do it automatically, right? The only thing you need is to build a model and connect it, right?

>>: Yeah.

>>: That shouldn't be that hard, right?

>> Ashish Mahabal: No.

>>: It's not a 1 FT project.

>>: [inaudible].

>> Ashish Mahabal: No one made some experiments with -- it was not all simbad but a fraction of --

>>: I think that [inaudible] I last saw this thing running about six months ago at least, so

I don't -- keep on talking.

>>: Does anyone else have --

>>: [inaudible] that you can achieve in order to get astronomical cal into the --

>> Ashish Mahabal: Maybe. I know -- I read once about the [inaudible] if it's too small, if you make it too small, it won't work. There's not enough information in there --

>>: Yeah, we won't get connected or --

>> Ashish Mahabal: You can't make it work. If it's -- if it's too large, might fail. So --

>>: Okay.

>>: [inaudible].

>>: So you're talking about creating a triple store from the [inaudible] a colleague of mine, Ed [inaudible] two years ago came up with this weird idea called the astronomy brain. And what the astronomy brain was was basically a collection of no triples, a simple statement of astronomical knowledge and he was going to create this, harvest this type of information from [inaudible]. Specifically what he said was most research papers are written such that the introduction of the paper and the conclusion of the paper may be factoid-type statements.

We know that quasars have black holes at the center. We've discovered that black holes give feedback on star formation [inaudible]. In between introduction and conclusions of the paper are all of the data and analysis and stuff like this. He said let's just do text mining on the introductions and summary sections of [inaudible] and just harvest these triple, these simple statements which first set the stage for the people's research which comes in the middle and the conclusion section where they state statements of fact about what they discovered about the classes of astronomical object and build that RVF knowledge base for astronomy or at least attempt to start building one just from that source.

>>: So how far did he get with it?

>>: Not as far as to right a [inaudible] not as far as to submit NSF. [laughter].

>> Ashish Mahabal: I suspect it went the usual way of --

>>: [inaudible] all the greatest proposals.

>>: All the greatest proposals, yes.

>>: [inaudible].

>>: No, but that -- I mean -- again, that sort of exercise is something that you could -- you could imagine taking, you know, do a search for papers on white dwarfs, say, and there will be a finite amount of information. And you do the mining and you do the mining and put it together, and then that gives you a starting data set to play around with.

>>: And the point is [inaudible] to ignore the analysis and data set [inaudible].

>>: Yeah.

>>: Go over these sections where these sort of general knowledge statements are usually made in research papers.

>>: Even so, this is an enormously grandiose project. I mean, we don't like it happening but --

>>: So you have funding in Australia to do this [laughter].

>>: But what I was going to say is, I mean, I have probably things like that and it's nice to think big, it's nice to think long term. But actually you ought to be able to do stuff now and --

>>: [inaudible] suggestion pick one specific --

>>: Yeah. That's right. And do it twice. Do a little bit, you know, go a bit further.

>>: Yeah.

>>: I mean, we start -- it's actually very close to where we started in with the A tells, you know. We took 2,000 A tells and just natural language what can we extract from them.

And the student who did that two years ago now works for Microsoft [inaudible].

>>: [inaudible].

>>: Yeah. But it's -- it's the -- that sort of thing might be manageable. Maybe best my student -- summer student next year. [laughter].

>>: And I've impressed myself.

>>: This is not a service. This is not endorsed by simbad. This is a -- an experiment.

So the point here is that you can -- [inaudible] service then the claim would be that -- well, you can look up -- can you read that at all? Okay. I know what. Can I make this bigger? Okay. Oh, yes.

>>: You might make the frame bigger and then --

>>: That's a good --

>>: [inaudible] nobody can read it.

>>: True.

>>: There's a little green thing that --

>>: Oh, yeah. I forgot that. Okay.

>>: If you -- the assertion is that this URL here HTTPWWW -- HTTP colon, slash, slash with URL [inaudible] slash ID slash and a number, and a peak number, if you retrieve that in -- by dereferencing that URL [inaudible] to another page.

On the other page you get HTML back. And I could show you that because if I link to there. And that in turn redirects to an HTML page that shows you stuff that -- it's not very prettily formatted. And that's not perhaps the most exciting things you could get back from simbad but it's and HTML page.

If, however, you retrieve this, the URL and say -- and say when you're retrieving it, no, I want RDF back, you say what format of data you want to come back and what comes

back is not HTML but RDF. And I could show you that. Accept text. Total. And yes, what comes back is a pointer to --

>>: You need a space.

>>: [inaudible].

>>: You need a space after the H.

>>: And the data that comes back is RDF. The same URL retrieves HTML for humans and RDF for computers. And the point is that if I go forward -- where is the pointer?

There it is. This -- the links are links [inaudible] what type of object this is. They're links to citations of this object.

So it's basically the sort of information that you would find if you went to simbad.

>>: So you have [inaudible] for everything?

>>: Yes. This would generate -- effectively generate URI for every object in simbad.

>>: [inaudible].

>>: Yes. So I -- it would be possible to regard -- oh, yes. The URL up there as the name for the string N31. That's the URL which named that galaxy. And if you use that anywhere in any application that would unambiguously refer to that.

>>: Thus providing an end point for that?

>>: What --

>>: An end point?

>>: This is the end point.

>>: Okay. So you can connect it straight to the --

>>: Yes.

>>: Behind the scenes this makes a simple query to simbad. So I store nothing here.

It's just a layer on top. But this could plug straight into anything else in the [inaudible].

>>: So it's done, right?

>>: Yes. Yes. Essentially.

>>: Is there anything missing there?

>>: Yes, the information that comes back is just the identifier, the string N31, the types

-- the simbad types that attach to this object and the references. And there's more -- there's much more in simbad --

>>: [inaudible] all the measurements and so on.

>>: Yes, it's not the full content of simbad.

>>: So this -- this shows that you can do this. It's far from complete. And -- but you -- but you could start -- if this were a different URL and if it were blessed by CDX then you could start using that URL in the top there as the name. And later on you could just add more information that comes by. Because nothing in this search what comes back.

There's no contract but what this gives back. It just gives back some stuff but that object. Just like [inaudible] with object.

>>: But the advantage of this is then in the insofar, the story about Matilda, you know, this can link the Wikipedia stuff could link to this and then you've got the link from

Wikipedia to astronomical data. And astronomical information or astronomically sanctioned information domain and knowledge --

>>: And if you have another server that knows what AGN means in this context then it can go off and find links to other -- other sources.

And the -- go back. Yeah. All right. The other point was that -- I forgot where I was.

>>: Yeah. [inaudible].

>>: Okay. So any final -- final --

>>: Going back to the portal discussion, I think that Sebastien [inaudible] semantic search engine [inaudible] reminded me of a project I started to work on a dozen years ago back [inaudible] discussion with some [inaudible] who were trying to -- we didn't call but should have called in a semantic search world. And so the use case was the following. They said [inaudible] what was the temperature in Indonesia yesterday, which is a very natural question [inaudible]. That question can be parsed to determine if the person is looking for a physical parameter, temperatures, and so that it had something -- they would have something in the back like a UCD that says here's the parameter I'm searching for find -- you know, they'll then query its registry and find all databases that have [inaudible] temperature for Singapore and I recognize Singapore is a geographical location so it uses the equivalent of the name resolver that we have in astronomy, a geographical name solver GNS that can resolve that into a specific location on earth. And then the last part of it is the texture in Singapore yesterday, the semantic part of it recognizes yesterday as a time variable whose value is equal to today's date minus one.

And then you can essentially formulate a query knowing what parameters you want or what geographic regions of what time ranges the registry tells you what databases are available to answer such a query and is formed in independent of the user seeing it

[inaudible] answers return much [inaudible] just by -- by a few terms of what you're looking for [inaudible] understand the semantic weaving of [inaudible] like yesterday today's date.

So it's a similar scenario. It probably might be like find me two quasar with galaxy radii in multiple galaxies. And it can understand that set of words to find catalogs in simbad that can answer that question. Is that somewhere close to where you're going?

>> Ashish Mahabal: Yeah. Maybe not the last question about the quasars within -- you know, if the query becomes very complex, it becomes more and more implementation to interpret exactly what the person wants. But I think the important part is to, yeah, point the astronomer to places, services, or data -- datasets where -- which are relevant to the query. Then even if the detained query mechanism has to be tweaked by astronomers, that's -- that's okay, I think, astronomers can still go and make the detailed query.

But point them in the right direction. Say, okay, there are 10,000 catalogs in [inaudible].

If you look for this and this and this, maybe you got 20 catalogs which are really relevant and will answer your query. And do this digging.

>>: [inaudible] earlier reasons for the UCB was to allow [inaudible] searching for catalogs that had [inaudible] and galaxies and no matter what they, that table --

>> Ashish Mahabal: Yeah, yeah, yeah.

>>: They understand the concept of [inaudible] in which --

>> Ashish Mahabal: You can -- you can go to the point for the most simple queries where you actually provide the results. But if I look at what astronomers do, if you provide really the results, you must tell exactly what you've done to get this result, otherwise they won't trust you. They don't trust you. So you must -- the very important part in the smart portal is to say oh, this is what you asked, this is what I understood, this is how I translated it, this got translated and this is the result I got when query this data set.

If you -- if you are missing the -- you know, the middle parts, you will probably -- the answer to the astronomer I would say wow, how was it done or I don't trust this.

>>: But provenance is -- I mean, there's a large chunk of Semantic Web technology all about trust and provenance exactly for this reason that, you know, whose state -- whose making this assertion? I might not want to know, but I like the capability to be able to find out because it could be relevant for something I'm going to do based on a particular assertion of fact or statement of fact.

I think last --

>>: [inaudible] following up with [inaudible] what was the temperature yesterday in

Indonesia. It says that the interpretation is the temperature is temperature for Indonesia it's the center of Indonesia. Yesterday is yesterday. The result is the result is 66 to 93

Fahrenheit according to the movie station, weather station.

>>: [inaudible] for Rhodesia.

>>: And I think that the problem with modern technologies [inaudible] this problem in the sense that the problem of how do you turn what the user taped into something formal is a hard problem. And some people there are people who work very hard on that. But it's quite separate concerns about what you do with that once you've done it.

So getting from the world of strings into the world of logic is hard by itself. Once you're there, completely different technology allows you to go off and do certain things.

>>: What we're hearing is from a [inaudible] do you want to know the answer or do you want to see the catalog that will enable you to find it?

>> Ashish Mahabal: Yeah, exactly, so --

>>: The later is the [inaudible] rather than see the source material [inaudible].

>>: What contains the red chips and --

>>: Well, I think that's the key. I think my sense is that the more knowledge we build in the background so that we know we can answer a variety of questions in the background, then that's a time when we can start to make the GUIs sensible and intuitive. It's great to talk about typing in that question. But where exactly are we going to mine the answer?

>>: Exactly. So it is the -- it's marking up the smartness of this is how you interpret that, this is the information that contains it and this is how it would interpret or this would contain that body of information or this is where you would do it. And then because I can phrase the question a hundred different ways. But the answer is there's one answer. And the trick is in the -- it's in the semantic layer which figures out what the meaning is.

>>: Sometimes there are multiple answers.

>>: That's true.

>>: One is the -- one is the magnitude of the [inaudible].

>>: Yeah.

>>: [inaudible].

>>: I think Sebastien already answered those [inaudible] so we just need more metadata in the registry so we know which resources to go to do answer these questions.

>>: Marked up in the right way.

>>: Of course.

>>: Okay.

>>: [inaudible].

>>: Thank you very much, everyone, for this afternoon. I believe it's now the final session of George and whoever else is up there.

>>: [inaudible].

>>: Well, all good things come to an end. Well, actually not the end. But thank you all for participating. I think it's been very lively, very unexpected, actually. And I like to use the opportunity to thank our hosts, Yan, in particular, also Robin and many others.

[applause].

>>: So open floor for general discussion. But a reminder. What's coming out of this will be two or three white papers of sorts. Remember we asked for the 20 scenarios. And that is a really useful way of focusing attention what has to be actually developed.

So some of you have contributed already, others please do. And we will draft a white paper after that, not a long one. Everybody will have tunnel to comment or contribute to it. Carl will put it in astro PH so that the general community is aware that all this is going on and what might -- they might expect. And we'll probably do something similar tomorrow at a session connecting education, computational science education with research. Essentially that boils down to what kind of curriculum do we need to design for astroinformatics so we can train students for science in 21st century.

And the wonderful vigorous discussion on novel ways of scientific publishing hopefully will continue on the Wiki that Ray's set up. And hopefully something similar will come out of that, a position paper or set of ideas or something like that. But let's see. And please continue the dialog and discussion on the Facebook web page, because it's really easy to do.

So at this point, I don't have anything else particular in mind, and I know if you guys want to say something, we'll just open the floor for anybody who wants to express some closing thoughts, ideas, questions.

>>: Yes. Definitely. So I guess I just want to bring it out that this year happened to be in Microsoft Research, therefore we have more computer scientists presenting than in the past two years. I want to hear from you whether or not those are useful and what are the topics from computing perspective that you [inaudible] perhaps next year we can bring different people with the same speakers on different topics. So those will help me to recruit our, you know, researchers in this building and to join you next year.

>>: I just want to say that I'm going to use the epitome stuff. That guy here was fantastic.

>>: Wonderful. Great to hear that.

>>: Please.

>>: Actually I have a lot of relationships --

>>: A little louder, please.

>>: Excuse me?

>>: Louder.

>>: Okay. I have --

>>: So I have as PhD student I am in my second year and in France I have too many acquaintances with astrophysicist and I -- I could say that the future organization is very, very important, and that is big lack of visualization tools for astrophysicist in France and here. I have discovered it. And I was thinking today, and I -- I spoke with Steve, the guy of Microsoft Research.

>>: [inaudible] yeah. Uh-huh.

>>: It's pure -- it's pure it's a pure computer approach. This approach is how to go from models to visualizations. It means I can -- I would say I would mention that it's small example of -- another example which is called GMF, graphical modeling framework. It's framework on eclipse. The user makes his models in his UML models. And the platform generates him a graphical editor.

And today I was thinking of such an API or such a plug-in to begin from the models to the visualization in other terms, for example, the user, an astrophysicist or biology scientist, he designs his own tools. That means the access, the data types, the shape of his -- of his figures. And then the platform generates him -- his visuals.

I don't know the guy of Microsoft was willing to [inaudible].

>>: You missed [inaudible] presentation? Were you hear when Curtis Wang presented?

>>: Actually I had seven hours flight delay, so I don't know.

>>: Yeah. So I was thinking if you saw that one you will see different way of presenting data, visualized data.

>>: I not present different ways of presenting data but customization of the [inaudible] other terms that the user designs his own visual.

I got the idea from another platform for graphical -- for text -- for text editor and graphical editors and now I got the idea for visuals from modeling to visuals.

>>: Okay. I know we have room for highly opinionated people. So let's hear some.

[inaudible] [laughter].

>>: Yes, please.

>>: So the conferences are great for talking amongst ourselves. These conferences are great for talking amongst ourselves and having great drinks after dinner and so forth. But don't you think that we ought to be going to our astronomy conference, our topical conferences and infiltrating in order to get the word out.

Because there are 50 -- 40 or 50 people who are soundly convinced of everything that we've been talking about. But there are 5,000 people in the US practicing what they call astronomical research and there are maybe 50 of those who are using these techniques and technologies. So this is marketing at some level. Microsoft knows how to market things. But astronomers don't know how to market things very well.

So how can we approach that problem?

>>: Well, I think you're right. We need to do more of that. It's not like we've been sitting on our tushes doing -- VO has been proselytizing at the WS meetings for quite some time now.

But again, I think it basically boils down to having results to show. And so that's what we really need to get engaged in, have some smart young people [inaudible] generate some cool new signs. That's when people start paying attention.

And we can be telling them all we want how great this is going to be, but without real science results that come out of it. And in some sense astroinformatics was meant to facilitate that going beyond the narrow circle of experts that they're engaged in virtual observatory but to engage a much broader spot of the community. Some of them we get. We need to get many more.

>>: [inaudible] I take advantage of my position. I think that one additional thing which would be very useful is these position papers. If they can come out -- especially if they can start the discussion.

So for instance, we can leave open the Wiki after the publication to see what the reaction and what are the suggestions from the community. So to larger also the community of people who are proposing new paths, I mean, this is just suggestion.

>>: Well, part of my motivation for starting Facebook page is to try to use social media and see if that will help.

>>: And then there's 20 questions or 20 user scenarios our way to try to guide a little bit. So we don't go totally crazy with the content. If we can actually come up with very, you know, focussed 20 scenarios that call for new ways of computing or new ways of applying computing technologies, astronomical research, I think that's a great success and no one can disseminate our effort through those scenarios.

>>: About promulgate and education and training. I agree. I think we need full semester-long curricula in astroinformatics and astrostatistics. There's a couple text books now in astrostatistics, and I think that -- I actually don't know if the new book by

[inaudible] and Alex and others is a textbook. I think it is. So that would be very useful to have a data mining textbook.

Another level is to condense it down to let's say a five-hour tutorial that can be presented on the Sundays before WAS meetings. So I could imagine an astrostatistics tutorial and astroinformatics tutorial every six months drawing, let's just imagine, 50 participants out of 1,000, this is only five percent, but if you keep on doing it and you really got five percent every time, it will add up a lot.

Now, I would suggest that my personal opinion is those should be focused on more -- has to be attractive. And I think five hours of lectures of mathematics is not attractive to astronomers.

What I think would be attractive would be practical training on a well-defined package.

And an example would be day or weekly for data mining where someone could say okay --

>>: [inaudible].

>>: Which one?

>>: [inaudible].

>>: Hold on. I was going to say day or week for data mining, NR for statistics. But, in fact, R are do all of it, if you really want to. R is so large you can give endless tutorials, just goes on. We give week-long tutorials and hardly begun. So we also give four hours versions. I just gave some in China that was four hours long. And so I think these have to be packaged well. I think Bob is right, we have to essentially learn how to market. There's statistics.com or something that has little tiny courses that market statistics courses. And I suspect the computer scientists have them too.

Historically there was a group of geostatisticians from the 1960s and '70s from France who wrote a book and then went out and gave, I don't know, I'm just guessing, 1,000 tutorials on how to do [inaudible] and other technologies advantageous to the mining industry in geology. And the net result is there's now a field of geostatistics that was 20 years before ours. And they went out to the worlds and over and over again gave seminars, training seminars.

So I think training and curriculum, short-term training and long-term [inaudible].

>>: I teach physics to geology students. The result was not very good. [laughter].

>>: Let me just mention, NVO summer schools I thought were spec tactically successful, very useful. And, in fact, the book that Matthew and collaborators edited summarized those lectures is closest we have now to practical astroinformatics.

Although we're working to get more. So --

>>: Yeah. Another mechanism I think [inaudible] shared. So you can't just teach and then walk away. You need to give them homework and then this home work should be there is some motivation there, there is some incentive. So what we did that is so successful in China is that after we trained the teachers from each -- you know, different schools, we encourage them to go back and do a competition using worldwide telescope to create some story telling kind of thing. And then we give words to it.

And then that way then it's just, you know, there are some follow-up. And the next year we continue on, and we got more people interested in learning. So that's how the momentum's going. We could use perhaps some successful, you know, learning on that.

>>: [inaudible] telescope?

>>: It is.

>>: But worldwide telescope is, excuse me sexy.

>>: I know. That's eye candy. So I use --

>>: We're not [inaudible].

>>: [inaudible] as I can be, being I teach something like old data. It's a particle, right?

It's very hard to communicate. But if you start with something attractive -- I agree with you, you have to have something that is, you know, grab people's attention and then use kind of like speaking to some other things that you wanted to teach them. There's a method.

>>: Yes, I fully agree with Eric and previous speaker that we need to encourage young talented students and I just want to tell you that I've been discussing this issue with

Chris Smith, who is the head of [inaudible] in Chile. And so that's the way we pull forces and convince us to start the astroinformatics initiative in applied math center.

And so Germo has been an outstanding first be student and researcher in that program.

But we are creating -- we are in the second version of an astroinformatics course in

University of Chile, and we've drawn 15 students in the -- in both cases. And it's just a start, but we are encouraged by the experience and in this meeting we're getting many ideas of things that should be incorporated in that curriculum. And we'll be very glad to share that experience. And furthermore, I like to make an invitation because next year in August we are hoping to have our third [inaudible] of massive data in astronomy. And that has also drawn a little -- a few people from mining industry and biosciences that are facing massive data.

And before that, we want to have a workshop with American students and Chilean students so we will -- I hope that some of you can participate in that and we can encourage in all our different institution students to come. That's exciting because visiting La Serena you have the opportunity to see the site where the LLSD is going to be built at the end of this decade, and also visit the [inaudible], have some fun. And following your suggestion, yes, we do need to have some marketing and part of the marketing is having meetings where we have fun, where we enjoy our activities and our lives. So it's an open invitation to visit Chile in August of next year. I'll be posting more precise information later.

>>: Questions.

>>: [inaudible].

>>: Right.

>>: We're asking something which will be discussed tomorrow in the [inaudible].

>>: [inaudible] new modes of scientific publishing, scientific publishing. Why don't we start from next year to have a next generation proceedings for the astroinformatics conference?

>>: Well, we sort of do, even at the first one of these in 2010, the proceedings consist of web casts in many different formats, plus PDFs, so slides, so you can watch the speakers, you can look at them and they are there.

>>: It looks like old fashioned and not next generation at all because the -- the webcasts are not tied in any weigh, they are not searchable, and the PDFs are not searchable. I cannot -- so I would think of something really more new generation, next generation.

>>: Like --

>>: Like, for example --

>>: [inaudible].

>>: Like, for example, digitally stored texts that can be commented line by line with social networking features and live graphs and plots and clips instead of the whole eight-hour webcasts, clips attached to all the -- all the texts with semantic annotations that would allow me to search between all the contributions for this year, the past year and --

>>: Sure. Give me hundred thousand dollars and I'll hire people to do it.

>>: Well, are you volunteering to --

>>: I'm sorry?

>>: Are you volunteering to edit text? This is a serious question. Or would you like to?

>>: To what?

[brief talking over].

>>: If my boss agrees. [laughter].

>>: [inaudible] for this.

>>: [inaudible] true believers [inaudible].

>>: Speaking of that, do we know if people ever come back and visit those?

>>: I haven't kept statistics but I know that some people mentioned to me --

>>: They do?

>>: I can't tell you how many. They're high quality webcasts. I know people who

[inaudible].

>>: Some webcasts we use for the courses in the [inaudible].

>>: So [inaudible] common from the perspective of PhD student. I'm finishing up this year. But just a few comments from me. First of all, you know, as a PhD student, I don't have -- most of the data I work with is proprietary and so I can't share it with people. And that's not my choice. Whatever my [inaudible] might be, the survey I work on is proprietary and --

>>: You're working on wrong story. [laughter].

>>: And the other content to do with that or with me being a student is that, you know, my goal is to graduate, and I had to do science for that for the most part, and that's --

I'm on the fringes of this sort of astroinformatics. And as a student, I'm working on science. And I don't have the time to let's say put together a huge package like you mentioned, of software and that sort of thing. I have a lot of my own software I've written, but I can't package it, I can't document it or really put it together into something that other people can easily use.

So I think having, you know, students, post-docs do this sort of thing is not going to be trivial to do, unless you paid them to do that specifically. And for post-docs that works, not so much for students.

>>: But are you ever concerned that you have come up with the best software or most efficient software?

>>: Perhaps minus the -- mine may not be the most efficient but it is sufficient from the perspective that I know that it's working and I also know that I don't have to go searching for some software.

And this is the other thing I was actually going to raise is that general purpose software is great except for the fact that you have to cram your data format into that software and make it work with that software, which may or may not do everything you need to do.

>>: Right.

>>: And especially for large surveys, you know, putting it into a very general format is not always the easiest thing to do, for many reasons. So that's another problem.

>>: Well, that's just because some surveys don't use standard formats. But I think that situation is going to get lot better. Now, when I was in grad school, I learned -- almost everything I learned there because I wanted to do research on some subjects, so I got to learn stuff about it and the tools and everything. Not because it was in grad school curriculum. And seems to me that one way to help that kind of learning is to present

materials in an accessible and proper form online and, in fact, [inaudible] myself been developing this online modular customizable astroinformatics curriculum. And so the student just needs to learn about [inaudible] and they don't have to take the whole damn class, right? And have a link to the software that's easy to use.

>>: Other questions?

>>: Let me make a noncontroversial comment.

>>: Right.

>>: That's impossible.

>>: So everybody talks about LSSD as the second coming. Because that's when all the science gets to be done. I think this is dangerous because there is a heck of a lot of exciting science and data going on right now and I think all of the science that's proposed for LLSD will be done before it sees the first light.

Another aspect of this issue is that the way the funding is going on [inaudible] and astroinformatics and God knows what else will be wholly owned subsidiaries of LLSD corporation. All of US astronomy might be for all I know. This is not a healthy situation.

I'm not saying anything against LLSD. I'm just saying, people, there is so much other good stuff here, now. And not just way for LLSD. Let's get prepared for LLSD.

>>: [inaudible].

>>: Well, that's going to be the next one. Third coming.

>>: In 25, 30, and 40 meter telescopes and, you know, the web telescope, et cetera, are all heavily competing with LLSD for --

>>: [inaudible].

>>: [inaudible] everyone talks about LLSD as the second coming. Not everyone talks about it as the second coming. There's lots of other things going on.

>>: Exactly.

>>: Like energy surveys starting right now, you know. I mean, [inaudible] is producing data soon. I mean, already they're getting commissioning data. [inaudible] there's lots of other things going on. I mean, LLSD is a popular one to pick on just because it's widely hyped. But not everyone talks about it as the second coming. Speaking as a man whose job it is to get it funded.

>>: I might have exaggerated slightly. [laughter].

>>: Let me [inaudible].

>>: [inaudible]. [laughter].

>>: So [inaudible] discovery and I [inaudible] which I doesn't talk about [inaudible] which is about four people previously had discovered it that they hadn't realized. So with hindsight people saw them in their data. And so we've been talking about the great discoveries connected to that [inaudible] whatever. A lot of those discoveries are actually there in the data right now. If you had the right tools you could actually find them.

So I agree with George some of the -- hopefully LLSD will make new discoveries. But some much those discoveries that will be made with LLSD are actually in the data. If only we knew how to look for them and --

>>: [inaudible].

>>: We know that [inaudible] if he'd paid more attention to his own results. We know all this stuff. [inaudible] people create their [inaudible].

>>: Okay. But nevertheless, somebody here wants to get a Nobel Prize they can look at the existing data with the right tools, immersive visualization, whatever, and they will find stuff that [inaudible].

>>: Well, we've [inaudible].

[brief talking over].

>>: [inaudible] exactly the same.

>>: I think there is another [inaudible] problem there is that a lot of astrophysicists will get the green physics don't know enough astronomy and so they think they've discovered something that was actually well known.

>>: [inaudible].

>>: Don't even get me started. I wasted five years of my life on crying to correct errors of particle physicists.

>>: Come work for us.

>>: Do I look like another generation -- [inaudible].

>>: I need company.

>>: In northern Virginia and see you.

>>: No, I think -- I think one of the things you said -- several people said is the key to this. I mean, you've got to demonstrate to the, you know, the poor astronomer shuffling between his [inaudible] telescope trying to pin down the parameters of RS [inaudible] stars or something. Why this helps him.

If I go to an article that's got tags throughout it, it's got one benefit. If there's a term in there I don't recognize, there's a tag I could follow along and get that term explained to me.

If I read a piece of paper and I don't understand a word in it, I've got to go to a dictionary. But if I've got tagged text, it's there, it takes me [inaudible] so that's and obvious thing you can say to people that that sort of semantic attachment is a benefit.

What's he going to do for this guy trying to get better [inaudible] or whatever?

>>: I think people respond to somebody else's success or they'll recognize they need to do something different. Like people used to publish papers with couple hundred RLR, right? We just submitted people with couple tens of thousands RLR. And so once you start doing that, when you get, you know, category killer projects like Sloan, you better adapt. Right? What would it mean to go take hundred ratchets of galaxies when there is million over there in the archive, right?

So I think people just be motivated by their own research need. And they won't necessarily learn, but their students will.

>>: I just think demonstrating a benefit to them doesn't help but demonstrating

[inaudible] by somebody who used the tools.

>>: Uh-huh. That. Yeah.

>>: [inaudible].

>>: Yeah.

>>: Other questions? Comments?

>>: We're fading out. So one last thing is [inaudible] I would like to thank Ray and

Aaron and Tara Murphy who doesn't know yet that she has volunteered for agreeing to organize next year's version. So see you all down under. All right. Thank you.

[applause]

Download