>> : Okay, so thanks for coming back. So we've swapped the order of Kurt Bon and Alex Gray. So now it's my great pleasure to introduce Alex Gray who's going to tell us about scalable data mining. >> Alexander Gray: All right. Thanks, George, and the organizers for having me here. So my affiliation is a little different now. For those of you -- It's the usual suspects plus some others. For those of you who know me I'm still at Georgia Tech but spending most of my time now at Skytree which, I think, is a new company. Not that new in a sense; it's been developing product for three years which is scalable machine learning software. So a lot of what I'll talk about, all of what I'll talk about and will ever talk about is implemented in Skytree or will be. And so that's what I've decided is the final delivery form for all the fast algorithms and stuff that my research lab and the other labs that I cherry pick the best of. The best, fast algorithms and machine learning methods from, they will be in Skytree. So why is that interesting to you? Of course I have a special sweet spot for astronomy. It's how I grew up. It was my domain area that I worked in. I'm not an astronomer; I'm maybe the one guy in here who's not an astronomer. Who else would describe him or herself as a nonastronomer? Okay. All right. But I'm a computer scientist/machine learning or statistician, so I'm an oddball in a sense that in the last 20 years I'm the probably the person who's been hanging around astronomers the longest that I keep seeing today. And then within machine learning, I'm an oddball because I'm the guy who's been worrying for a long time about big data. And of course now it's a hot topic. In the last two years a lot of other people have started to think maybe this is important. So, but for whatever reason the software we have today to do this stuff is the same. It's the same crap that we've had for 20 years since I got started; it's the same general framework. You have a library written in some language. You do need many different tools because there's no one tool that does everything. But you have some thin command line interface and you can do some plots. Okay? That's MATLAB, R, Weka, SAS, SPSS, both commercial and open source stuff. There's really nothing still. There are some attempts that are pretty weak still and if you ask me pretty like really weak, like not useable, weak. In the open source world it's great. But --. So what we've decided to do is kind of take the best in the research world that we know of in algorithms and machine learning methodology and cook it into real professional enterprise-grade software that both a big company with a lot of money and/or a science institute with almost no money can use for their high value, high complexity problem. So, anyway, that's all I'll say about that. But if you're interested or intrigued, both -- we need people to work for us who are smart, including astronomers and also making it basically as free as we can make it for astronomers. Okay? So if your institute or group is interested, let me know. So as for your statistics problems, here are some common ones that I'll talk about. So the first four, these are basically -- these boil down to -- So I realize a lot of what people end up being interested in after my usual talks are kind of steps that I skip over which are, well, what's the best way to translate my problem, first my science problem, to a machine learning problem? And then once I've done that, like which type of machine order problem? Once I've done that, which method should I use? So I'm going to talk a little bit about that in a bit of more of a tutorial way. Many of you here are already experts in that, but for those of you who might get something out of that I'll talk a little bit about that. And then at the end I'll come to these last three which are -- So the first four really boil down to -- Here's a kind of a, usually a straightforward translation from the problem to a machine learning formulation. But then the issue is computational, and that's what I'll say a little bit about in a slightly different way than I usually do. And then the last three are basically statistical issues where the standard machine learning methods out of the usual software or textbooks don't quite do the trick. You need the fancy stuff, something fancier. So, well, what are the different kinds of tasks, basic queries, estimating a density, classification, regression, dimension reduction clustering and two-sample testing and matching, comparing two data sets. Okay, I think most of you kind of know what these are. You know, and this is the usual starting point for my talk which is, yeah, if you look at these actually and you look at the best ones they are in squared or in cubed often. And I will come back to that. That's a killer. N is the number of rows in your data table, so N-squared is a crushing growth and computation that means you can't do millions, let alone billions of objects. And then of course that's just for one run of a method with one setting of a parameter. You actually want to try many, many, many, many settings of parameters to get the best model and to get air bars which we never get to this point because of computation reason. We would like to do the whole thing ten thousand times to do a bootstrap or jackknife. Okay, so we never even get that far because we are crushed just trying to do one run of a sophisticated method. And these are just textbook methods and they're already difficult. Okay, and these are non-textbook methods. These are more state of the art methods that these are just examples from micro. But there are many, many of these; the fancy stuff that is the stuff you make up to deal with specific problems like whatever measurement errors or whatever. These are even more expensive, typically. Okay? You want higher sophistication. You see, even an N to the four here, okay, because higher statistical sophistication means higher computational cost. Almost no way around it. So that's the computational problem which I will come back to. But for now I want to -- This is probably the most common kind of question that I get which is, well, so I translated my problem to a density estimation problem, let's say. Density estimation is basically like -- A histogram is a density estimator. It's non-parametric. It can fit any shape of distribution. It's one dimensional and it's kind of crude because it's choppy. What's the fancier version of that? Well, it's kernel density estimation. Many of you know what that is. It's smoother. It's in general dimension. It just fits the shape of your distribution, and then you can now probe that shape and go, "Where is it low? At this point is it low? Is it a low density point or an outlier? Is it a high density point or common point?" and so on. So of course this is fundamental; you do this everywhere in astronomy. So what's the best way? There's kernel density estimation. There's also a mixture of Gaussians. Those are probably the two that I would suggest for this problem. What are the tradeoffs? kernel density estimation is your most accurate overall method. It's free. Why is that? Because it's free of strong distributional assumptions. It doesn't have a lot of parameters. It has one parameter that controls a scale basically of a little kernel function that you throw on each point. But it's killer. The reason it isn't used everywhere is that it's expensive unless you have a fast algorithm , which I'll come to of course. I work a lot on that. Okay? But if you don't have the fast algorithm , the fast algorithm , the real fast algorithm , the best one, is pretty complicated unfortunately for this. So you really want it kind of packaged up in some reliable software. Mixture of mixture of Gaussians, tried and true, it's pretty efficient. You use something called the EM algorithm . It's pretty efficient once you have a fixed K. Where things -- K is the number of Gaussians that you fit. Of course the problem then is how many Gaussians do I fit? I kind of want to make it more and more non-parametric or able to fit arbitrary shapes that aren't necessarily really just big Gaussians. And so I can throw more and more Gaussians, but then how many and how do I search that space of Gaussian parameters and number of Gaussians. That's what makes it fiddly, unfortunately. Meaning, you have to try a bunch of stuff. And the other thing making it fiddly is that you obtain local minima in the optimization. It's not a global optimizer. Nobody knows how to do a global optimizer for it that just works. So it's a non-convex problem. But it has another advantage: it offers a way to impute. Imputation means filling in, guessing at missing values. So if you're data table has a whole bunch of question marks in it: Well, this measurement was not taken for this. We don't know the, whatever, ellipticity of this object. No one recorded that or whatever. Question mark. You have a whole bunch of holes in your data table then the usual algorithm for this, EM algorithm . It can fill those in; it can guess them for you. Whether that's good or bad, it's better than nothing because if you don't fill them in you can't use any machine learning method. Machine learning methods assume, generally, that everything is filled in. But there's still basic -- kind of an open problem basically because you are kind of putting in data that isn't really there. So ideally your learning method would learn to model simply ignoring those question marks, just they don't affect it. It only uses the non-question marks to obtain the model. Okay? But that is possible in certain methods, certain situations and that's preferable in my opinion. So I don't think missing value imputation is the end of the road for -- It's not totally satisfying. Nonetheless, if you have a problem right now and you have missing values, this is a good thing. It's decent. Okay? So those are some tradeoffs there. Classification is the other one I'll spend time on. I won't touch on all the problems, regression and all, unless you want to ask me about them later for lack of time. But there's a whole zoo of methods especially for classification. It's the one machine learning problem -In a way it's the hallmark machine learning problem. It's the one that the most effort has been put on. It's the one that's easiest to understand what it's doing. It applies all over the place, and it gives good results. It's the basis for most of the big application results that have come out of machine learning, you could say. Ad Google uses machine learning to show you ads which ads you might like based on your query terms. You know, the handwriting recognition that's in the U.S. Postal System; 97% of all mail is done by machine learning. All these big things really are classification under the hood. Okay, but that's why there's so many of them and there are so many different ways to kind of sit around and dream up a new classifier, new kind of classifier. So I would navigate that. So Naïve Bayes is probably your simplest. You could teach this to a Kindergartener as long as they know what a Gaussian is. It's simple and, therefore, it's instantaneous as far as speed as well. It's also instantaneous to implement. Those are both good things. But it's unlikely to ever give you the highest accuracy. It's a very simple model. Logistic regression is for one level up in non-simplicity but still pretty simple, pretty fast. It's a linear classifier. Perceptron is a different way to do a linear classifier. You may have heard of that from the eighties, also unlikely to give you the highest accuracy because it's a linear model. Your decision boundary that defines, you know, the boundary between cloud of Class A and cloud of Class B has to be linear, perfectly linear for this to be a great model. So decision tree. Now we're starting to get into some powerful methods. This is non-parametric which means you can prove -- Actually only in a shakey way but roughly it's still true that you can fit any distribution, any decision boundary with this kind of thing. Unfortunately it's also a little crude. It's kind of choppy. It's like in the sense of a histogram it chops things up into hyper-rectangles. But it has a lot of great properties other than if you -- It gives you medium-level accuracy, a lot better than these other two generally. But you can interpret the output as rules. And so one of the earliest kind of machine learning things in astronomy, I was semi-involved with, was at JPL doing a star galaxy classifier with decision trees. And back then they were looking at the rules to interpret what are the properties of objects that make it more star-like or galaxy-like. Okay, and it's somewhat fast. It's not instantaneous but kind of fast. It's N log N to learn. N log N time where N is the number of rows. You can easily do mixed, discrete and continuous attributes. In other words, some of them are numbers, numerical, some of them are A or Btype variables. It has a way to do missing values in a nice way actually that doesn't impute them. It doesn't guess at the missing values; it actually just ignores them. It only uses the non-missing values to make its model. And it automatically all at the same time as learning the model, it decides what features it can ignore that actually weren't useful for making the -- whch is the other big thing you always want to know. Okay, you have a model. What were the features that actually predict if something's a star galaxy or a quasar or nonquasar. Okay? So that, a lot of nice properties but still not the highest accuracy generally. We do a lot of these bake offs in machine learning. It's kind of part of the culture which I like, empirical bake offs. And it never really is the winner, but it has these nice other properties that are very practical. Okay? Then there are random forests. So people who are champions of decision trees didn't like that other methods were pounding decision trees inaccuracy and so they found ways to boost the accuracy of decision trees in a way that mostly keeps most of the properties but that are good. But then, you lose one at least. So what you do is you basically take a whole bunch of decision trees and you average them together. So you have to learn hundreds, ideally thousands of decision trees. They're all a little different, and the couple ways you make sure that they are a little different from each other, and then you average them. So then the model is some linear combination, essentially, of many decision trees with equal weights generally. But it's a huge model of course and so you can't interpret it any more. But it retains the other good aspects of decision trees, and the accuracy now is in that set of methods that can give you the highest accuracy. Okay? So I'll just say I won't pick a winner as far as accuracy because there's never one winner across all problems. It is problem-dependent. There's even a theorem about that, the no free lunch theorem which says there's no winner across all problems. It depends on your data. But although in the winning set, these are all methods that are nonparametric meaning you can prove that they can fit any decision function. Okay? And this is one of them, random forests. Neural networks: There was a time when all of machine learning was neural networks. It was synonymous with neural networks a couple decades. Now it's been reborn as something called deep learning, but it's really the same thing. It has always been poo-poo'd. It got poopoo'd big time in the mid-nineties because something displaced it in popularity called support vector machines that'll come next. Because support vector machines, basically you can learn in a convex -- It has a convex objective function, so you have a good optimizer for it. Whereas the problem with neural nets is that it's fiddly. It's yet another one of these non-convex object functions, and so you're always talking local minima. It has thousands of parameters usually. And there are all sorts of things so you end up kind of -- But if you work at it, you can get state of the art. In fact for a couple problems neural nets are the best approach. And so that's always the case of any method: if you're dedicated enough and dogmatic enough about that method, you can make it the best method for your problem because you've put more energy into it than anybody else. And there are some important problems when neural nets, deep learning, are the biggest thing. And relevant to this community, image problems, image classification problems some of them deep learning is the best thing right now. Okay? Speed slow-ish to train but fast at prediction time. So still medium, kind of like decision trees on the [inaudible] front. Nearest neighbor: a very simple method but among that set that give you the highest, yeah, accuracies. And my favorite for certain astronomy problems is kind of you could say an elaboration of nearest neighbor, kernel discriminant analysis. I like it because it is in that set that can give you the highest accuracies. It just has a few parameters. It's expensive unless you have a fast algorithm . If you do, it's okay. And it has interpretable probabilistic semantics which people like Bayesians and other people want to think about probability distributions like and feel comfortable with in terms of interpretation. And it gives you an accurate estimate of the probability of being in class one or two, which many of the other methods don't focus on that and so they don't give you good estimates. Okay, and then the support vector machine actually anecdotally across all problems. If you had to take one winner averaged over many, many problems, what's the most common winner? It is the non-linear support vector machine. Usually by just a little bit but if you had to pick one winner this would be it. And unfortunately this is the biggest open problem of my lab. My lab aims to scale up all machine learning methods. This is the one that I can't really do yet. This is the thorn in my side. So no one can do this yet basically and make it scalable. If you could, it would be the one to use in terms of pure accuracy. Okay? So I just wanted to say if you look at this list and you want to focus on the non-parametric ones, the high accuracy ones, the ones in red, well, why are they expensive? Why are they in squared's and in cube's? And many of you know this if you know my research: it's because they involve all pairs computations. They're comparing each point with each other point, computing a distance or similarity between them or kernel. Okay? Recently there's this National Academy's report on massive data. It's finally become a topic of national interest, and so I wrote the chapter there looking at the deep computational bottlenecks in analysis of massive data. And these problems tend to be of this second kind whose category I made up "Generalize N-body Problems," things involving pairwise comparisons. For those of you who don't know this, and I usually don't say it this way but this is my provocative claim, anything to do with pairwise comparisons I believe we know how to make it from N-squared to order N. We have done this, I think, for all of the common things that you see in statistics. Okay? For example, for every point find its K nearestneighbors. That's N-squared naively, order N. This is all provable now. It took me many years. We now prove this is provable in the computer science of worst case complexity, which you may not care about but computer scientists always told me, "Ah, it's just a conjecture. Prove it." So we finally proved it. And it is a little non-intuitive which is why it took me a while, why it is that you can do this in order N time. Okay? But my claim is if it looks like you're looking at something with pairwise comparisons and you have to do them all, you probably don't need to. And this includes things like friends of friends. It includes endpoint where it's even worse. Endpoint correlations are pairs in the two-point case or triples or quadruples in the higher order cases. Okay? And so we did the first fast algorithm for general endpoint that's exact a long time ago in 2000. But now smarter people than me, like my student Bill who I'm going to promote in this talk, have done the largest three-point and have continued -- There's even faster now - largest scale three-point correlation to date published. And actually in practice when you're doing endpoint calculations using a random set and you're doing many, many re-samples or you're doing a jackknife and you're doing it for many, many scales, right, the many matchers or scales. If it's three-point, it's many different triangle settings. So it turns out you can do all of them at the same time in less time than doing all of those things individually. And you can get many orders of magnitude speed up by doing that. That's another paper that Bill did with Andy Connolly. Okay. So real quick, statistical issues that come up: measurement errors, a big issue because some objects are close, some are far away so they have different measurement errors. So we're working on how to do some of these fancy non-parametric things that are based on kernel density estimation. We have results that are about to come out in a way that accounts for measurement errors. Two more slides, three more slides. And then everything is now time domain so all the next generation surveys, and so you would like to be able to do all of this stuff not just on static objects but where each object is actually a time series. The problem with that is time series are variable in length and so on. So they're not of the same shape. But there are ways to do machine learning which is standard paradigm, everything is over the same shape of fixed length vector, in funny objects that are not of the same shape at all like graphs, sentences and time series. So we have some ways to do that. And then finally even though we can do certain sensors on a large scale all over the sky, still there are other more expensive ones that we can't do all over the sky on everything. So the question arises, what are the best objects to measure? This is sometimes called active learning. So if I only have a fixed budget of objects to measure to obtain some underlying function of something then what should they be? So that, it turns out, surprisingly is a natural concept to formalize, to want to formalize. It's been a long time. Very recently people have done this is a rigorous way where you can prove that by not choosing everything and not choosing everything randomly but in a deliberate way, you still get some guarantee on the error. Right? Because if you only choose some of them in a funny way you could just learn the wrong function. But there's a way to. And we probably did the other way to do that. Okay, so I'll just mention the software implements all of these things. How do you find out more, is the last slide? If you want to find out more about how do machine learning in astronomy, there are some other good books coming out and we have one with these co-authors where we'll have a longer expounding on these kind of tips and tricks for, "What sort of methods should I use for my problem?" as well as explanations of them. If you want scalable software to do this stuff, where do you get it? It's not scalable but there's Python code in our book that makes machine learning easy. There is scalable serial software, one machine software, that's free, open source from my lab. And then of course if you want the fastest and most powerful that's for pay but we'll make you a deal, you and any scientist basically, to support science at cost. So just talk to me directly if you're interested in that. And if you're interested in both, the upcoming expert is my student Bill who works in astronomy in doing scalable astrostatistics. And if you're interested in machine learning as a career, talk to me about -We need smart people at Skytree. All right. Thanks. [ Audience applause ] >> : Okay. Questions? Yeah. >> Alexander Gray: Yeah? >> : Can I have the microphone and a soapbox? I have to disagree with your characterization of kernel density estimation. >> Alexander Gray: Okay. >> : Can you make it louder? >> : Oh, I'm attacking the characterization of kernel density estimation as, you said, the most accurate. It actually throws away information. Any smoothing technique, of course, discards information at the finer scales. In a sense the trade off is you want pretty pictures of your density rather than accuracy in some sense. I didn't have time to talk about it but the Bayesian block algorithm I described in one dimension can work in higher dimensions. And I refer to a paper with [inaudible], the senior author, where we apply that to the Sloan Digital Sky Survey and basically get a density estimation that essentially represents all the information that's present without any bias at scale and doesn't do any smoothing. >> Alexander Gray: Okay. Well I'm certainly open to there being many methods that are good. But I'm not sure how you can do any kind of density estimate without smoothing because if you're not... >> : [Inaudible] tessellation does it in order N in a simple way. [Inaudible]... >> Alexander Gray: But that's a dis-characterization which is a form of smoothing. In fact it's a chop. >> : Well, that -- Yes. In this context the data is discrete anyway so you're just representing the density information in the data. >> Alexander Gray: Okay. So if it's discrete data then you can use a discrete method and not smooth. But if it's continuous data you have to do some smoothing. I think that's --. But, anyway, not that there aren't other methods, but these are kind of -- I just listed, you know, standard off the shelf textbook methods. That doesn't mean there aren't methods from the research literature that should be in the textbooks. >> : Okay. [Inaudible] question. >> Alexander Gray: Oh. Maybe David's question. >> : Yeah. >> : So one of those things you alluded to in your talk but didn't make explicit is the difference between generative models and non-generative models. So some of these, for instance, just specializing classification some of these techniques are effectively a generative model for the data and then use that generative model. And I think from the point of view of astronomy there's a lot of advantage to generative models over non-generative models because in generative models you're better able to deal with missing data which I consider to be an absolutely essential [inaudible] for astronomy. You can deal with the fact that you have [inaudible]. But more importantly to this community I think in this room, is this issues of utilities. So whenever we're classifying we're always classifying for objectives. We have objectives. And those objectives -- We don't necessarily want the most [inaudible] results. We want the results that produce the most value for us going forward. [Inaudible] select things for follow up or whatever. And I think somehow that's missing in this. This thing about classification. It's not clearly all where you can insert your utilities. >> Alexander Gray: Yeah, so you hit a number of things which I'll try to remember and address. So, you know, generative versus discriminative -- Yes. Typically, you know, you get a couple things extra out of generative method that you don't get out of a discriminative method, usually at a cost, at the cost of being the very most accurate. The discriminative ones usually they do less work in a sense; they are not trying to get you class probability estimates. And those are really important to you and you want them to be accurate, so that's why I like things like kernel discriminate analysis as I said. But, you know, you do pay a little cost for that, and that's why a kernelized support vector machine which is discriminative is slightly more accurate in general because it only worries about the decision boundary. But another, you could say, advantage of probabilistic or generative methods is the missing value thing. But that's only one way of doing missing values which I cautioned about which is by imputation. So that's not the only way. It has some pluses and minuses, so I wouldn't necessarily tie them together in a one-to-one fashion. "Oh, you have missing values? You have to do a probabilistic method." Not necessarily. But, yes, you often do want a confidence on -- You know, I called this thing type A instead of type B, what's the confidence of the model? That's pretty common. You can get those out of a discriminative method but it's usually a hackier way of getting probabilities out. So it just depends on the relative importance of all of those things in your application. And then utility, there is a formal way to put utilities on top of machine learning. It's very simple. You just, in your decision function, you can eventually write any machine learning method if you know the basics enough as a base classification task in terms of Bayes' formula. It doesn't mean you have to be a Bayesian, but you just write in terms of Bayes' rule. And then you can throw utilities into that formula and thus you can say, "Well, if it's more important to me, for example, to make, you know, errors when I call things that are really type A I call them type B versus calling truly type B things type A," that's one form of utility that's not equal weights in the errors. That's easy to adjust for in putting utilities around the overall loop. And there are other ways that you can put utilities into things. So I don't know if that addresses everything, but --. >> : For instance in the kernelized SVM is there a straightforward way to include utility? >> Alexander Gray: Straightforward, no. But that's why I said if you know the basics well enough you can essentially re-derive all of the methods in a way that you put utilities into the parts that you care about. >> : That would be a valuable contribution for science. >> Alexander Gray: Well, let me know if there's a particularly common instance of that in astronomy. That's something we could do. >> : Okay. Last question, Eric. >> : I'm not sure if I'm asking the same question as David in a more naïve way. But David very quickly mentioned [inaudible] measurement errors. So in a table, a [inaudible] table, every cell has its own personal measured measurement error. My question is, do any methods [inaudible] but do any classification methods or other methods allow these to be taken into account? >> Alexander Gray: Off the shelf, no. And this is of course a huge issue in... >> : Is that the same question you were talking about...? >> Alexander Gray: Well, no. But I do have a... >> : [Inaudible]. >> Alexander Gray: I do have a similar answer which is... >> : Okay. >> Alexander Gray: ...you know... >> : This is a grand challenge level need. Because astronomers spend a vast amount of effort, measurements and measurement errors, and then they go to the statistician and there's nothing but [inaudible] pi squared. [Inaudible] capabilities. And so there's a huge disconnect between the practice of observational astronomers and the availability of modern methodology to incorporate it [inaudible]. So this is not a trivial issue when you ask about measurement errors. And I'm not sure how closely this is related to your point. >> : I consider it completely related because for me missing data is just an extended case of measurement error. >> : Yes. >> : Your missing data is just [inaudible]... >> : Missing data, which is sometimes a truncation and sometimes just missing. And we have censored data and we have measurement errors, and in astronomy the truncation, the censoring and the measurement errors are all from the same source. And we have no mathematician who's willing to tackle this thing. I've tried to get you guys to work on it. >> Alexander Gray: Nah, we're working. Let me just give you... >> : We'll -- Sorry. [Inaudible] very quickly then we'll start our second talk. >> Alexander Gray: So the rough run down of the situation in the literature is that this was worked out for linear regression and that's it basically in the statistical literature. This problem is called there "Errors and Variables." So it's even got... >> : There's two books on it. >> Alexander Gray: ...funky names and some of them make it hard to track down the... >> : But it's useless. >> Alexander Gray: It's kind of useless. And then there are two cases of it which -- Homogenous errors and heterogenous errors. Of course we're interested in different errors on different points. That has even less work on it. But we want to do this now for fancier -- all machine learning methods. Unfortunately the way to -- you know, you have rederive each one from scratch basically. It takes a new method, a custom method. And so we're trying to do that for a certain class of methods. David's done that for a certain class of methods. So they exist. There are a few things coming out. But it basically takes custom development of new methods. It hasn't really been done in the stats literature or the machine learning literature. >> : Okay. I'm afraid we should call this discussion stuff [inaudible]. I'm sorry. We really have got to move on. [ Audience applause ] >> : Alex, none of your links from Fast Lab work. >> Alexander Gray: Oh, yeah, I keep hearing that. >> : Okay. Thank you, Alex. So right into the panel discussion with George Djorgovski, Yan Xu and Mike Kurtz. A talk about [inaudible] models and [inaudible] publishing. [ Background noise continues ] >> George Djorgovski: Well, I may as well start. So for years we've been all crutching about the need to invent scientific scholarly publishing enabled by information technology. And there are two aspects to this. The first one is that the variety of scientific product has increased dramatically, whereas it used to be more or less just research papers in journals. Now we have all manner of other stuff that's very much intellectual product or its data archives, algorithms, work flows, just simple ideas, people debating interesting things in blogs, etcetera, etcetera. And I think first of all we have to develop a system for acknowledging and collecting all that worthwhile product in a meaningful fashion and also figuring out the new model for quality control. Because we have a peer review system which is effectively broken because there is too many papers and not enough time, and essentially it's more used for people to undermine their competitors than to advise colleagues to do something better or nothing at all. So we need to look into that through forms of maybe crowd-sourcing or whatever. The second aspect of all this is that we're simply not using technology to deliver whatever content we want, that most of the stuff that we do mimics printed paper which is crazy, that we need to think of native electronic publishing that has nothing to do with ink on paper and can link to different varieties of digital products, media, movie simulations and whatever in noble fashion. It doesn't have to be structured as traditional papers. And then we have to convince community to embrace all this. >> Yan Xu: Right. So speaking of that, George and I have -- I think I was reading this book that we put together like several years ago regarding computation and education for scientists. So in 2007 that was the first conference I [inaudible] at Microsoft Research, and there were like three astronomers in the room among other professionals whose key on promoting computation or education for scientists. And we also talk about a new way of publishing ideas. But we've been talking about it since then for the past several years. Then a lot of things has changed. First of all two years ago when I was doing the same thing was, you know, putting together a book for the conference, the first criticism I got when I had the opening of the workshop saying, "Oh, we're going green. We shouldn't print them anymore." So I'm totally for we do a, you know, online electronic version which is good for not only, you know, publishing the papers, results and data mostly promote ideas. And now the question is, if the community is not into it then we may have, you know, a few [inaudible] show up and then not sustainable without a group of people dedicated to maintain the quality like George was saying. And then keep on having new content from different disciplines. So I'm here to listen to you for ideas, and I can go ahead and then perhaps implement some mechanism that leads the start for Microsoft Research to support the community. But I need people to, you know, some of you at least, diehard people who is going to, you know, devote to it and then keep this going. So, again, I can have a sign up sheet whoever wants to be the victim of the first group. >> Michael Kurtz: Okay. First I think I'd like to say that I'm very unhappy that Lee Dirks isn't here to be on this panel. >> George Djorgovski: Yes. >> Michael Kurtz: I think that's all I want to say. I miss Lee Dirks. Astronomy from the past 25 years has been building publication mechanisms and libraries completely independent of the journals and the old structure. I'm going to list several large entities that are all 25 years old or so, and you'll see how that works: ADS, CDS Simbad, NED, The Archive, CADC, the main archives that's home to Einstein and Chandra or ST Mast or the ISO Archive -- It's interesting that all three of those were founded when Riccardo Giacconi ran those organizations -- the Hesark, the IPAC, the European Space Agency has several, the Super OR Archive, etcetera. That's where we're spending our money. The amount of money going to those dwarfs the money that's going to the journals or to our journals plus all the libraries combined. We've already decided to move away. We're already doing it. Maybe it's time to formalize that better, but we're kind of leaders in that field among many of the scientists. So for something new to publish I just thought I'd say one thing that seems to me obvious and it's very short forms, things that don't really belong in journals the way journals are structured but need to get out quickly like a graph or a table. Very often a result is one graph, one data table and all the paper around it is just, you know, getting that out. The people who understand it will understand it just looking at the picture. And why wait six months to write? Why not put up the picture? We don't have a mechanism for that, and that's the mechanism I would most like to see. So I think that's enough to introduce. >> George Djorgovski: Yeah, over the coffee we've been [inaudible] about how maybe we should start a completely novel journal with faster informatics along the lines I talk before, fully digital, different media, experiment with all kinds of things including peer review. And I assure you the exact same discussion happens in e-science conferences in general. So there are two problems here: some people need to devote serious time to actually start and run such a thing -- In our copious free time, right? -- and some institution needs to sign up to it. And I mean respectable institution, not like those mushroom journals that are popping up all over Internet and are essentially vanity publishing. That stuff's irrelevant. And, okay, you can say, "Well there is new journal of astronomy computing." And I appreciate good intentions, but I'm sorry. I think that was exactly the wrong thing to do because it's a) paper journal of old style and b) it's going with commercial publisher. And I think that commercial science publishing is already dead they just don't know it yet, just like newspapers, just like the old music industry. >> : Can I make a comment on that? One of the problems is that some of us need to continue to build our careers. In order to do that we have to publish papers and get funding and then we... >> Yan Xu: Did you say career? Build up a career? Yeah. >> : And so we're graded on the way we publish to a large degree. And so, you know, we can start up our own little journal and that's wonderful. But no one's going to publish anything because... >> Yan Xu: Exactly. >> : ...you want to get that job, you want to get that promotion, you want to encourage your students to publish in top level journals and so on. What may be a -- it may not be possible but a better avenue is to talk one of the existing, well-respected journals into having a branch that they try some experimental thing and so you can piggyback off the reputation of that journal but still try these new things. So, you know, should we be getting people from those journals along with these kinds of meetings to get them excited and get them thinking about how this might happen? [ Multiple audience comments ensue and continue in the background ] >> George Djorgovski: That's a really good idea. >> Yan Xu: Yeah, yeah. >> George Djorgovski: That's a really good idea. >> Yan Xu: Uh-huh. [ Audience conversations continue ] >> Yan Xu: So that's --. That's the answer to my question I just posted on Facebook. If you wouldn't mind, go comment. Yeah, I just posted a question and you provided the answer. At least one way of --. >> George Djorgovski: What was your question? >> Yan Xu: My question was how can we come up with a mechanism, you know, providing incentive and encourage that? You can't just ask people [inaudible] and then creating a career problem for them. They will have identity problem when they graduate if a student is doing that. Where do they go, right, as an editor? As a physicist? As a computational scientist? >> : It's like these online classes that Harvard and Princeton are putting up, you know, different ways to get an education that's piggybacking off the reputation of an existing infrastructure. And maybe, I think, there'd be an avenue that way. >> Michael Kurtz: Yeah, consider physical review X. >> : Exactly. Yeah. >> : Okay. I hear that's what I [inaudible] for that journal. I agree. I have long agreed with most of the doubts that George has, and I've expressed exactly the same thing as a number of you. The thing is the journal has a very key property which is missing [inaudible] that George mentioned, and that is the property of actual existence. It actually... >> George Djorgovski: What? >> : ...exists. >> George Djorgovski: Oh. >> Yan Xu: Actual. >> : Actually exists. It [inaudible]. In every other respect it's a very boring journal. It looks very much like all journals, so it takes a lot of the [inaudible]. And that was very deliberate. We weren't trying to be experimental under any heading other than getting this community a journal. And the question of trying to piggyback on other journals, we thought of that as well. And before we started this we were in touch with the editor of [inaudible] and, well, the big three astronomy journals and other journals. We sent them some [inaudible] abstracts and said, "What would you do with these? Would you drop them to the floor or would you at least send them [inaudible]?" And the best thing he said, "Drop them on the floor." Even E and E which has a section on... >> George Djorgovski: Experimental [inaudible]. >> : ...[inaudible], said, "Well that wasn't very [inaudible]." And basically they said, "We're not interested unless you can talk about the astronomical results coming out of this technology. Just technology is not interesting." So the journals [inaudible] were journals that were to [inaudible] looked a little bit too [inaudible] to think anything. >> : And yet those journals would probably survey design papers. [ Multiple audience comments ensue and continue in the background ] >> George Djorgovski: Or instrumentation papers. What's the difference between hardware instrument and a software instrument? >> : Well, and I agree. I think it would make perfect sense for these journals to publish these papers. But they said no. They didn't say no maybe. >> George Djorgovski: I think this was an aspect of the great cultural shift problem that we've been talking off and on through the meeting, in this case applied to journal boards or editors. But maybe the better approach is to go not through astronomy but through e-science community which is much bigger than astroinformatics and has exact same issues and probably does have a critical mass, which I'm not sure about astroinformatics, to actually do something. Start a respectable journal within respectable society that will deal with these issues, especially if they have universal importance like, "This is how we deal with such and such data analysis problem or data-based problem," and so on. Maybe that is a more viable way to approach it. Then maybe some day there will be [inaudible] e-science supplement or astroinformatics supplement or something like that. >> Yan Xu: So I'll just comment a little bit and then switch to you because we talk about this last night. My comments with an active perspective saying, "What value do you provide to the rest of the community for bringing it to e-science? Why would [inaudible] informatics read your journal?" So that was quite [inaudible]. And this morning I thought about it. I guess we can position it the other way. If we wanted to join the e-science community as a branch of astroinformatics, maybe that's a way to push us to think more broadly. And then we publish, we really want to focus on the general aspects of computational... >> George Djorgovski: Well, you know... >> Yan Xu: ...challenges.. >> George Djorgovski: People who build instruments... >> Yan Xu: ...provides value to the rest of the community. >> George Djorgovski: Yeah, people who build instruments might read optics journals. But general astronomer will not read an optics journal or... >> Yan Xu: It's a different matrix... >> George Djorgovski: ...something like that. >> Yan Xu: ...to measure this. I'm sorry. >> : I was just somewhat following up on this comment of PRX or ApJ astroinformatics, recognizing the serious issue of careers. >> Yan Xu: Right. >> : I think it should get some really serious thought is there a way to change something like ApJ Letters into an ApJ experimental or some more experimental aspect of ApJ. I've certainly written to the ApJ Letters editor and said, "I don't think this paper should be published in ApJ Letters because you might as well just put it in the main journal and then [inaudible] as soon as it's accepted." That's way faster than ApJ Letters. And so the letters and -- I'm picking on ApJ but I think at least a couple of the other big four have a letters section. Figure out a way to do away with the letters section because that's outdated and move it to this more experimental... >> : Well once upon a time -- They may still have this statement -- but ApJ Letters had on their, you know, instructions to authors that they were accepting papers that had no lasting value. I was like, "Wait a second." >> Yan Xu: Right. >> : Since we are talking about science fiction. >> : Speak up. >> : We are talking science fiction, about dreams, about dreams which need to implemented from scratch. Now very stupid question, we have something which already is in place where all of us send our paper which is Astro-Ph. It is not refereed. Wouldn't it be easier to transform Astro-Ph and Astro-Ph Version 2? So basically you submit your draft then Astro-Ph takes care of the refereeing process. Then when the paper has been refereed, you get the flag which is what we want that "this paper is refereed." And then you have also an automatic way of measuring the impact on the community of the paper, not to the journal, by looking at the number of downloads and eventually with the number of feedback. So this is an idea, I must be honest, which we [inaudible] already many years ago. I mean, how to get a feedback from the community which is the equivalent of the Facebook I like. I mean [inaudible] I don't enter into a library since many years, and also I love ADS because [inaudible] there is everything there. But usually what I do, I go to ADS. I find the new papers in which I am interested. I try to download that. They are protected by the copyright of the journal. I go to Astro-Ph, and I download it from Astro-Ph. So basically all our reading of astronomy journal is through Astro-Ph. Now what is the reason for the proliferation of this journal? We all want to reach the astronomer. We want to reach computer science; there is a computer science section also inside the [inaudible] archive so it should not be difficult to do the things. Because to be in the scientific [inaudible] of a journal that's prestige. I don't know. I don't understand the proliferation of journal. I have been member of several scientific board, well, [inaudible] didn't gain [inaudible]. We don't need to have the support of a big community like [inaudible] or like, you know, or American Astronomical -- I was thinking about this last week [inaudible]: We already have our board which creates consensus. This is the whole community which cannot live anymore without Astro-Ph. >> Michael Kurtz: The way high energy physics normally works now is you submit to Astro-Ph, you let it stay up for about two weeks. People complain and write you back. You change the paper and then you submit it to the Physical Review which allows you to submit it by sending an Astro-Ph number and they download it from, you know, archive themselves. So, high energy physics already has a rather different model in that respect. The editor of Physical Review D told me once that he reads papers --when they're already up in the journal, he still reads the archive version instead of the Phys Rev version because it's one click rather than two on Spires. That's the editor of the journal. So everybody else does too. There's plenty of statistics that show that. >> : Question over there. >> Yan Xu: [Inaudible]. >> George Djorgovski: Give him a mic. >> : For professional journals [inaudible] --. For journals I suggest maybe we should do some more survey for user's requirements. Before we meet the user's requirements we can get better performance of much successful. For example, for Chinese community for scientific journals we have some basic requirements. A journal has to be indexed by the major index system, for example the ICI and the EI. Yeah, open access journals are very popular for some countries, but for Chinese users they are not popular. The main reason is that most open access journals are not indexed by ISI or EI systems. For us or maybe for graduate students in China even though you publish the term papers or even more on the new created computing journal but it not usable for you at all for you to graduate or to get a promotion. So another example is for the proceedings for ERDAS and the proceedings of SPIE, they are quite different because ERDAS proceedings are very usable for our software developers and the [inaudible] developers. But we have -- Yeah we are not eager to publish or to submit papers to ERDAS because it has no use for us for promotion. But for SPIE proceedings because they are indexed by EI so we are happy to publish our papers, to submit papers to the SPIE meeting. That's why for the IU symposiums because they are indexed by ISI, so many pupil maybe from China and maybe from other countries they hope to submit one or even more papers to the IU symposium conference. So that's my opinion. >> : So I mean that is essentially Darin's comment, I think, that you need some formalism like being indexed by ISI, something like refereeing or maybe replacing refereeing. It doesn't have to be refereeing, but you got have some process which installs a minimum standard... >> George Djorgovski: Those are two different problems. There is respectability problem and there is... >> Yan Xu: The recognition... >> George Djorgovski: ...[inaudible] peer review problem. And I think what [Inaudible] was alluding to is some form of crowdsourcing effectively should be a peer... >> : [Inaudible] it was too fast. I was saying just Astro-Ph throughout the refereeing mechanism. >> George Djorgovski: Oh. >> : Submit your paper and people can begin to read it. Meanwhile it is refereed. Then after if it is accepted there will a light or a green light on the paper saying it has been refereed [inaudible]. >> George Djorgovski: Perfect. >> : Yeah. >> George Djorgovski: Astro-Ph will never change because Paul Ginsberg never wants to change anything. >> : That's right. >> George Djorgovski: And they don't have the resources. And, moreover, they're just rapidly disseminating the old style papers. Right? The word paper tells you the problem. We're so used to ink and paper. >> : Yeah, I am the referee for instrumentation and methods for all of Astro-Ph. I read the titles and look at the authors and normally that's enough. Somebody does look at every paper before it goes up. It's an undergraduate assistant. And if it doesn't meet the form of a paper, has references and all that, it's flagged. I have to look at it. If it's a post or something, it doesn't go up. They're not willing to put things like that up as George just pointed out. >> : If I can just make one comment on the refereeing? What some other disciplines, particularly [inaudible], is [inaudible] the formal refereeing because they couldn't get people to referee papers and they do effectively have a crowdsource and things, so they post the paper. They invite comments from other people and then some editor at some stage reviews the comments and gives it a "Go," or, "No go," or, "The paper needs to be revised." So there are other models already out there. Eric? >> : I've been an ApJ editor for seven years, one of nineteen scientific editors, and to a considerable degree I served this community in the sense that astrostatistics and astroinformatics papers often are assigned to me. So I actual play a role as the traditionalist in the room, and now I'm going to give you my opinions that are not official opinions of the Astrophysical Journal but they're mine. First you should know that the [inaudible] and the AJ actually do accept astroinformatics papers. There's an informal sort of feeling that we don't like barcode manuals, we don't see how the code, you know, patterns [inaudible]. But if you describe the methods and then have an appendix on the details of the code and apply it to some example and show that it's useful then we will accept it. Our criterion is new and significant research in astronomy and astrophysics. And there's not statement that it requires a telescope or physics. It can be pure instrumentation and pure informatics. >> : So if... >> : However, despite the fact that this is true, your community, this community here, it doesn't seem to work in the sense that very few submissions are made on codes or informatics. So few that you don't even know what I just said because it almost never occurs. So sociologically it has failed. But by policy [inaudible]. And that's my first comment. My second comment... >> : Sorry. Can I just ask you -- So, clarification: So would you accept somebody who writes a paper about a new algorithm... >> : Yes. >> : ...without actually applying it to any data? >> : No. >> Michael Kurtz: See, that's the problem. >> : Yeah, but it's not a big problem because we don't have very high standards about the application part. So if you apply it to a piece of -- a junk problem? You know, it doesn't have to be an innovative result. It just has to show that it applies to some star or galaxy or some image or some time series or whatever you want. So we low standards on the applications but we require [inaudible] application. This is informal. This is not a formal requirement and maybe you could convince us to change our style on this. It's informal. >> Yan Xu: Right. That's what I was going to say. Do you regularly revisit the matrix or the rules that you use to...? >> : We should. [Inaudible response continues]... >> Yan Xu: For example, you may learn from this community. Yeah. >> : ...but it's informal. So it'd be useful for us to formally state this in [inaudible] journal online. Would you then use the journal more? >> : Yes. >> : Because almost nobody is submitting anything. >> Yan Xu: Now you catch up. We match. >> George Djorgovski: See that's the problem. This discussion is going in the wrong direction. We're not in need of a new journal. We're in need of a different... >> Yan Xu: New way. >> George Djorgovski: ...way of publishing that can contain... >> : [Inaudible]. That's my second point. I'm talking about the traditional journal, traditional paper articles that are now online, okay, [inaudible] PDF. My second point is about ApJ Letters. And, again, this is a personal interpretation of discussions that I've heard. In my opinion and in the opinion of perhaps some others, ApJ Letters has lost its original purpose. The original purpose was to be fast. And the reason it lost it is because ApJ used to ten months is now ten weeks. ApJ is as fast as ApJ Letters used to be. Okay? And that's all technology. It has nothing to do with anything. So all the delays in ApJ are not due to the editorial process, okay, it's due to the authors. It... [ Inaudible background conversation starts and stops ] >> : ...[inaudible]. Admittedly ApJ Letters is making seven weeks instead of ten weeks. There's some small difference but it's very small. So there's actually discussion that ApJ Letters, having lost its purpose, is purposeless. And I actually had hoped that there would be a change of purpose in the last change of editor which just occurred, which due to lack of courage and frankly lack of consensus and lack of good ideas, no change was made. You should know that the [inaudible] pub board and the editorial boards don't like the situation of ApJ Letters and we sort of want a new purpose so we're moving to something new. So if there's anyone in the room who has actual new ideas within the context of ApJ Letters -- We're not going to become a blog, okay, I mean it's just not going to be totally different. But in the context of a traditional journal, if you see ways of innovating I think you should send your ideas to Ethan T. Vishniac, the Editor-in-Chief, because I think he is actually for ideas. I don't know that but I think he is. [Inaudible]... >> : Before [inaudible] comments can I ask the panel... >> Michael Kurtz: All right. Yeah, first I showed a whole bunch of plots yesterday with the exception of the one where the fish pond fire pot which was in SPIE, all the rest were in ApJ, AJ, PASP and ApJ Letters. It's quite easy to publish these things if they're part of some sort of scientific thing, astronomical thing. Most of them were just software. The eigenvector paper is just pure software. There are a couple different directions that we can talk about. One is making the current publications computable which they're not. And the other is publishing things that are outside the realm of a ten-page paper. George is adjusting the second. I was thinking the first, but they're both ways that are pushing nontraditional things. The journals have not been very good in either of those things, in leading. It's one of the reasons why there are 20 new libraries and publications over the last 25 years, none of which are the journals and none of which are the traditional libraries. To be build, I don't think that modifying the ApJ Letters is really what George things is important and I don't either.... >> : No, but other people do. >> Michael Kurtz: I do think that making the papers computable is important and that.... >> : What do you mean by [inaudible]? >> Michael Kurtz: I mean, making it so that computers can understand what the papers are. And that requires work in publishing. It makes it so that papers can't be free. It sort goes in the other direction of open access. Dense semantic tagging is one of the catchphrases for that. Simbad of course does that already. The librarians at... >> : [Inaudible]... >> Michael Kurtz: ...SV do that already. >> : You have to talk to the technical people about this. I don't know enough. >> Michael Kurtz: But, yeah, we'd do that. It's one of the things that ADS is looking at with librarians. But that goes in a different direction than making publication faster and toward what people are actually doing, and I think that's what George is talking [inaudible].... >> George Djorgovski: One of the things but maybe... >> : Well, can we hear from Yan next? >> Yan Xu: I want to say that other than the [inaudible] and the scientific outcomes for journal there is other things that you have to worry about, even the political impact and social impact and administrative and the logistics of running a journal. So we should not invent another journal, that's not what we're here for. We're talking about new ways of publishing. If there is already existing a vehicle that we can use, we should take that vehicle to a different path that matches what we want for today's publishing. So that should be the direction of this discussion. Other than, you know, creating another vehicle. >> : George, do you want to add? >> George Djorgovski: Well, that's exactly right. We are so brainwashed into thinking in old format style papers in a journal that this discussion veered in that direction, "Let's improve ApJ Letters." No, that's not the problem. The problem is exactly what Yan said: to be able to publish scholarly output of different kinds that you can publish a data set or an archive without having to write a bogus paper around it, right? Or I ought to be able to publish one paragraph idea saying, "This is a good idea. Somebody should do this. I don't have time," and somebody might do it. And maybe they'll cite my little digital thingy somewhere. All right? I want to be able to publish numerical simulation or an algorithm or a work flow, "Given Sloan Survey and this survey and that survey, this is how we discover [inaudible] quasars." So there is no result. All right? You know, it's not even code. Right? So that's what I think we need. The only reason why I said new journal is because the old journals are, I think, hopelessly wedded to the paper paradigm. >> : Okay. We got a zillion [inaudible]. So we're going to -- We'll start in the front and move backwards. So Alex, Matthew, [Inaudible]. >> Alexander Gray: I'll just throw out a couple things. So in a field like mine which is about methods, we have an issue that everyone feels uncomfortable about which is... >> : Oh, sorry. [Inaudible]. >> Alexander Gray: ...you can read a paper about a fancy method or algorithm, but then when you try to implement it all the details aren't there or you don't know if it's -- or the experiments that they showed aren't really reproducible. You don't have the data set. You don't have the code. Which gets to the issue of publishing code and data sets and idea all at the same time. In fact, it should be a requirement that you can't publish an algorithm unless the code is there so it can be verified. So we don't have reproducibility in methods, basically, the methods part of the world. And the same would be here if we opened up astronomical publishing to methods more. So that's one thing. The other thing is [inaudible] idea about Astro-Ph. So what if there wasn't -- Instead of having a green light, yes or no, acceptance model like we have in publishing today, we simply have a continuous, you know, like you said like a dig-style thing where you put something up there, it happens to be in my area, I read it? That's the one problem we have in computer science is getting enough of the experts time as reviewers. You got some reviewers but they might -- it's almost always you don't have the right people because they're busy or whatever. But you can get their attention on the topics that they're really interested in. So, you know, publish something. I go, "Oh, I like that topic. I want to comment on that," and I'll rate it in the four areas or whatever, novelty, depth of experimental results and whatever. And then it has my name on it. It's not anonymous. And so if I'm well known then that has more weight. If I'm not as well known, it has maybe less weight. And so that creates a kind of formal point system for, you know, [inaudible] it just becomes a... >> : [Inaudible]... >> : Sorry. Can we -- Let's move up the -- We'll get to you in a second. >> : So I would like to suggest that some of this is actually already being solved by a broader scientific community. There's a little journal called Science that some of you may have heard of, and there are -- I can think of four papers that I've downloaded from Science in the last year which are specifically about methodologies, new methodologies for doing informatics type analysis. And you have the paper which is the summary of the method and then there's a thing called the SOM which is "supplemental other material," which more in depth descriptions of the algorithms, greater details, attached data sets, attached results, sometimes there's a attached code as well. And it strikes me that a large chunk of what's being suggested has already been solved by a major journal out there, and that's the sort of thing you want to be looking at. >> : I think that -- I'm sorry. >> : [Inaudible] reaction but that's still different from what George says, where you're going to publish fragments not... >> : No, I agree but there is part of it has already been done where you are trying to, "I don't have the -- I can't reproduce your results," or, "I want to know more about this," or where is the code attached to it. >> : That's optional, right? So very few people do. >> : The four papers that I've looked at in depth have got it, and it was very useful to have the information. But you make it a stipulation that if you're publishing that's what you're going to be doing. >> : Yeah, I agree with this part that we should start doing that more, add code to it, add a set to it and make it reproducible so other people don't have to come back to us and say, "How did you do this? I am not getting the same [inaudible] and the same code and the same data set." They should be able to do that. And so that I [inaudible]. To what George said about being able to publish a paragraph and so on. So what's wrong with doing something like that on the existing things like Astro-Ph for instance? Who might need something? And some people do that. >> : [Inaudible] does it. >> : No, but there's no connection to formal points basically. You can post whatever you want, but you know I think if we just establish a way that it just shows you some points... >> : No, but then are we asking about policing it? And then who would do it? >> : You got several different issues in there. >> George Djorgovski: Yeah. >> Yan Xu: Right. >> : But, okay, Pepe, Norman, Nick, Eric. >> : Absolutely, I agree with what [inaudible] said. The problem is that -- Eric, I mean you are just going the wrong direction in my opinion. I know I'll start a fight. But I think the ApJ [inaudible] are exactly the wrong way to go. It's the wrong way to go because basically they cost a lot of money. For reasons which are no longer understandable it seems most of the subscription are to the electronic journal and, therefore, there is not any more huge cost of paper which there was in the past. They are just a system [inaudible] power when nowadays we see a world which has moved toward consensus platform like Facebook [inaudible]. So it is just a matter of doing things intelligently because to assign a referee -- I've done a few times in my life to assign a referee to a paper. You basically look at the quotations. You find the paper which is quoted first or more times inside of the paper. [Inaudible], "Oh, yeah their group." And you send the paper to that guy. I mean it's basically a system [inaudible] power which is self [inaudible] itself when acknowledge [inaudible] different solution. I would like -- I don't see why there must be monthly notices, Astrophysical Journal or Astronomy Computing [inaudible] when basically you have system or keywords which electronically allows you to pick up the article in which you are interested. If you wanted to have a most automatic system of reviewing your choice or referee done with a simple algorithm, you know, by selected member of the community will be [inaudible] today you can get the traditional refereeing or you can get the referee all together done by the community like the one which [inaudible] mentioned and so on. I think it's just obsolete. An approach like this has Astro-Ph obviously since the guy is paranoid about these things, it will not be Astro-Ph it will be something different, something research gate like, it's something which I like very much, then you can publish programs. Not like it happens very often that you say the program is downloadable, and then for two years you cannot download it for reason. The program must be downloadable then you know you can download data. I think it's too simple. I don't see where the problem is besides the political problem to solve the mafia over the various journals. >> : [Inaudible]. >> : I'm very much enjoying this discussion because over the last year and over this afternoon it's made me more and more convinced that getting involved with this journal was the right thing to do. So I've intelligibly [inaudible] points here. The question about why can't people publish small additions, like a table, like a paragraph? They can already. They can do that now and that would work to model if people could get professional credit for that, if people could list these in annual reviews, in CV's, that would work. Now there's absolutely nothing stopping that except politics. It's true that journals are dead. They were dead ten years ago. That was obvious. But they have manifested nonstop twitching. And I see no reason to expect they'll suddenly stop twitching in the next decade because it's unintelligible they've last this long. So it's unintelligible to claim that they'll suddenly stop. And all your discussions about archive overly journals, about put your new models all for publishing, I've heard all of these for the last ten years in bars, cafes, lunch breaks at conferences and nothing has happened. >> : Our generation must die. >> : Our generation must die, fine. [ Audience comments and laughter ensue and continue in the background ] >> : And this is a social problem. It's a social problem. [ Multiple audience comments continue in the background ] >> : [Inaudible]. It's Eric next but first we'll try, does the panel want to respond to anything? >> George Djorgovski: Well, now I think the discussion is moving in the right direction. I think maybe we can clarify our thinking by parsing the great problem into several smaller pieces. One of the reasons we have journals at all is the archival nature of it, that somebody has signed to perpetuate this stuff forever. And usually it's some professional society or a really successful commercial house like Nature, right? And so that's a whole other thing. Somebody needs to really make that commitment and pay for the upkeep. Now as far as we're, contributors are concerned there is an issue of, well, broad variety of types of contributions which we really haven't addressed properly. The issue of quality control or peer review which we may be able to crowdsource or each paper can have a little Wiki associated and so all of the reviews minus the obscene ones, to be removed, would be there forever. And then there is a separate issue of giving people credit for this new variety of publications, and so there has to be some agreed upon standard way. You know, it could be [inaudible], something that people can put in their publication list that they can be cited as, you know, "As such-and-such suggested in this electronic thingy," you know, or we use the data or the program or the work flow from, you know, again giving the electronic reference. And that those count. So then that really means if enough professional societies decide this is the way to go then Institute of Scientific Information will have to just learn how to count those. It can count downloads. It can make, you know, linear combination of a number of things, you know, number of downloads, positive reviews, negative reviews, whatever. But that's a different story. >> : Sorry. We've got a whole cue of people. Do either of you guys want to add anything to that? >> Yan Xu: I just want to share how I feel. Yeah, I agree that we have these kind of discussions over and over again. I was showing the book that we did in 2007. I was thinking of when I was ten years I was talking to a friend of mine, and she started collecting stamps when she was five. So I remember her had that stamp book, beautiful stamp book with dragon stamps and stamps from UK from other country. I was thinking, "Oh, I'm already five years later than she is so probably not a good idea for me to start." If I started, you know, back then I'm already having, I don't know, how many beautiful collections of stamps. So I don't care what kind of things we're talking about as long as we can kick off something. You know, [inaudible] that we can collect ideas and then get it going and then revise it, that'd be a great outcome of this conference. >> : Mike? >> Michael Kurtz: All right, to the point of money. It cost about five dollars for something to be up on Archive, and it cost about two thousand for things to come into one of the reputable journals. That money isn't wasted. So you have to decide what parts of it you want. I won't go any further than that except that the money isn't wasted. The whole refereeing and formatting and long term saving of it is part of what that money goes for. The journals traditionally have not been the ones to archive the journals. It's been libraries. Libraries have basically collapsed. So the long term archival keeping of the journals is only newly and temporarily in the hands of the journals or handing it off to some place like Portico. It's not clear that there's a long term solution for electronic journals, but there is no other long term solution. Paper is not the way anymore, clearly. I guess that's enough. >> : Okay. [Inaudible]... >> George Djorgovski: Just a footnote to this, I mean, I've been talking a lot to librarians and they are various astute communities paying a lot of attention to these issues. So there are a lot of clever people who are really thinking how to do that. >> Michael Kurtz: That's a good discussion [inaudible]. >> : Eric, Nick, Joe, Alex. >> : I just want to -- This is very light. It's just an anecdote, and I thought, Pepe, you would enjoy it. Hearing about the mafia of the journals from you from [inaudible] is very interesting to me. So now I want to tell you who the mafia are in some cases. So I'll try to disguise a little bit. There's a senior editor who's been trying to make an innovation that, by the way I think Science and Nature do, to publish the numbers underneath graphs, you know, associated with graphs. Okay? And this is an obvious innovation. It's quite easy for us to do technically. And it's been unsuccessful because it's been stymied by the higher level bosses, called the AAS Publication Board, who are elected and tend to be very ignorant of the fantastically interesting issues that the people in this room are actually quite knowledgeable about in many ways. So I just want you to know that it's not always the public face that you think who's the enemy. It's sometimes someone else. [ Various inaudible audience comments ] >> : Pogo said, "I have met the enemy and he is us." Yes. >> : Just a couple of quick things. So it's not perfect but you could certainly imagine on your job application replacing "List of Publications" with "List of DOI's" -- some of which are publications but it encompasses all the other stuff like code and whatever -- as a way to compromise between being able to show everything you did and actually having something you can get credit for. The other thing is more of a question is there's an initiative by Peter Coles in England called "The Open Journal of Astrophysics," and the motivation of that was to basically let astronomers make their own journal. I don't know much about it but does anybody have any comments on that? >> : It won't go anywhere. >> : [Inaudible] having it be free and not having it be open. It should be called "Free Journal of Astrophysics." >> : Doesn't go anywhere. >> : Are you saying to [inaudible] or --? >> : No one is using it [inaudible]. >> : Okay. >> : Well, it only started a couple of months ago. >> : Yeah, but no, no, no. [Inaudible]. >> Michael Kurtz: Yeah, how is it different from Archive? >> : Well, it's a journal. >> Michael Kurtz: Ah, right. >> Yan Xu: Right. >> Michael Kurtz: All right, journals produce articles which people look up in ADS just like Archive. It's how people really use it. So --. >> George Djorgovski: I cite Archive papers all the time. >> Michael Kurtz: Yeah. >> George Djorgovski: And, you know, why not? They're there. >> Yan Xu: The ADS go search out all that. >> Michael Kurtz: Yeah, ADS makes the match anyway. So --. >> : [Inaudible]... >> : [Inaudible]. >> : When you're applying for a grant, at least in Australia, you actually have to list citations and impact factors, and unfortunately Astro-Ph is no good for that. Archive's no good for that. >> : Joe, you're up next. >> Michael Kurtz: Could ADS help you by creating an impact factor for Archive? It's easily... >> : Yeah. >> Michael Kurtz: ...calculated. [ Various audience comments ensue and continue through ] >> Michael Kurtz: I'll do that. I can probably do that, though. >> George Djorgovski: This may be the best outcome of this conference. [ Various audience comments ensue ] >> : [Inaudible] enormous impact factor if you did that. >> Yan Xu: Yeah. >> : [Inaudible]. >> Yan Xu: And then we will have [inaudible]... >> : We'd have all the money. >> Yan Xu: ...so Lee Dirks economic search? >> : Yeah. >> Yan Xu: [Inaudible] as well. >> : Oh, okay. >> : I may be repeating what others have said but I sort of think 90% of this problem has probably been solved in that we already have large table and journals that are not printed. They're -- You know you get the first ten lines or whatever. And if you really are interested in a table of data, you go some place else and you grab it. It's not obvious to me why we can't have appendices that are not published. There's a one line description that if you'd really like to see code, go here. Or if you'd really like to see this movie or set of movies or extracts of simulations -- Okay, it's not going to be in the paper part; It's got to be somehow stored elsewhere -- but I think that the technology and the capabilities are sort of almost there with some of our major journals. And to provide the sort of curmudgeonly view point here as well, George did hit on this archival notion. There is an important aspect that we shouldn't forget, that at times you do want to be able to go back to read the paper from 1970 or 1950 or, at some point, people are going to want to look to see what we were doing in 2012. So there is an important archival aspect to this as well that is not free and is going to be very challenging. >> George Djorgovski: While that mic is moving let me point out that what you just described says that all of this new content is minor subsidiary to the traditional paper. What I'm pointing out is that there new types of content that are good in their own right. If somebody wants to publish code, they should be able to publish just that code without the paper that it goes with. You know? That kind of thing. >> : Okay. Two quick comments here, then [inaudible]. >> : I think what you're saying is not very complete because he said exactly one paper on Archive costs five dollars, one paper in a journal two thousand [inaudible]. Look, this journal do not live only out of subscription. This journal live from contributions from many organizations from -- Astronomy astrophysics in Europe is largely paid by the government. They don't live out of subscription. And I think the same is true for monthly notices [inaudible]. So if you just moved [inaudible] published a small fraction of what is currently paid by the government to maintain the paper journal it will cover all the expenses for the running operation and for the archives. >> Michael Kurtz: But that two thousand... >> : [Inaudible]... >> Michael Kurtz: ...has nothing to do with paper. The paper is almost free. All the rest is the cost. >> : A very quick almost point of information: ANC will, for example, support adding supplementary material to do article. And I think that's not the only journal that would do that. And... >> : But first, after you set the science [inaudible]... >> : Yeah, exactly. And so the only thing that's stopping the vision, one of the other only things that's stopping it is the willingness of editorial boards to do something different with what they believe is a good article, an adequate article, an adequate contribution. And some editorial boards would be more conservative. Some editorial boards, e.g. this one, still are at the phase of saying, "Okay, what should this article look like?" That was part of the motivation of this. It wasn't just journal; it's to ask the question what does an article in this area look like? It doesn't have to be section one to four. It can be anything. >> Michael Kurtz: I think just briefly if you have your software, the astronomy software library -- The poster is over there -- it is indexed in ADS. >> : Right. >> : Yeah. Yeah. >> : On George's point of why not just publish a code, you know, the counterargument to that is, of course, the code is not good if you're not telling me why you're publishing it. >> : Yeah. >> : I mean seriously think about, could you go through a code, strip out all the comments, and publish that. And would that have any value to anybody? >> George Djorgovski: That's a bogus example. >> : I don't think George [inaudible].... >> George Djorgovski: That's not what I'm saying. >> : Exactly. So you need some kind of descriptive about it, even if it's only a paragraph. "This code is going to do X, Y, Z." >> George Djorgovski: Sure. That's not... >> : That's not the point. The point is do you want to write a full scientific paper... >> George Djorgovski: Yeah, exactly. >> : ...to describe what the manual... >> Yan Xu: It would be reproducible. >> : ...what the manual described. >> Alexander Gray: I just want to throw out one data point from another field which is a success story of how we upended the paradigm. So in machine learning roughly ten years ago, the dominant journal was called The Machine Learning Journal. But they didn't let you own the -- They owned the copyright and, therefore, you couldn't put the paper on your webpage. They were actually sending people nasty messages saying, "You can't have that on your webpage." So the machine learning community -You know, this gets to that difficult issue, you're trying to do something new, making something free to different paradigm, there's the reputation issue. That was the only place to publish if you wanted to have a reputable paper. So what happened, the way we resolved that was the leader of our field, basically the most famous guy, started the new journal. And so you need -- Basically once the famous people are all there -- And now this new journal is the main journal in our field, AMLR. So you need the well known famous people to all be behind it, and then overnight it can actually have the reputation. So it's not hopeless. >> : Okay. We're coming towards the end so -- Anybody else [inaudible] then I'll hand it back to the panel for some [inaudible]. Okay, one, two, three, four and then the panel. >> : So regarding publishing of code, there already are some [inaudible] archives like R has the CRA in the content, so R Archive as well as the CPAN which is the [inaudible] of [inaudible]. So these are typically individual programs well packaged with documentation with manual pages and everything. So would there be any sense for the astronomical ones to have their responding [inaudible] apply to that. All like the poster that is there, they have a good set of rules, what you can put it, how you can start, about the code [inaudible]. That is also pending. When we are talking about a new journal, what is it exactly that we are saying? >> Yan Xu: A set of rules, yeah. >> : How are these different from that? So can we somehow leverage these into what we are calling as [inaudible] journal and so on? >> : Okay. Pass the mic. >> : This is a journal comment and everything, but just coming from the point of view of a earlier parsing career, I would like to kind of remind that almost everything I think who's been discussing this so far is someone in a permanent position. >> : Oh no. >> : Hmm? Okay. >> : [Inaudible]. >> :Okay. I think that many people in this room are discussing this from the point of permanent positions.... >> : [Inaudible]... >> : Okay. I think that meant... >> : ...[inaudible]. >> : Okay. So because it hasn't been explicitly said, although I'm philosophically very much in favor of everything we've discussed, I'm somewhat nervous about all this as well because I worry that when going for a permanent job I know that there are many people who are not as broadminded as the people doing this discussion who will not -- who, despite the fact even if it works better than a traditional model will slap down a ruler and say, "You only have this many ApJ papers. We're not going to consider this." And for younger people, I think that's an enormous consideration that we have to -- even if it's a better method, there is pressure to put it in an ApJ article or [inaudible]. >> : Let them [inaudible]. If you had an established journal which was recognized and not just [inaudible] thinks it's got to be recognized. >> Yan Xu: But on the other hand if you join it now when you turn into establish, this journal could be one of your credit. Like my virtual stamp -- I mean my dream of virtual stamp book. Right? If I started ten years go -- I mean when I was ten years old I would have this [inaudible]. >> : But if you don't publish now in the old, good journals we're not going to get tenure. That's the point. If we try to publish on that journal I would like to publish on... >> Yan Xu: No, maybe not now. Yeah. >> : Most astronomers would look at that and say, "What is this?" >> : And that's okay. >> : "What are you talking about?" >> : But that's not a reason for not starting it, though. >> George Djorgovski: It's a separate problem, as I said. There is problem of delivery. There is problem of quality control. There is problem of giving credit. All of those are big problems. >> : Yeah, I think it's very complex, and it's probably worth looking at the sociology of all this because in a certain way it's a reshuffling of power. I mean, you have new methods but the power of decision making goes much slower the changes. So basically it's absolutely right to hear students that will be skeptical about these new methods unless the power is sensitive to these changes. So the point is trying to look at these new systems and somehow create the right incentives, the people, so that the recognition and that it moves science in a better way. I mean, I think in a certain way that's already happening with information. I mean, we Google and some algorithm is making the decision of what we look into. So it is beginning to happen that's affecting the power of decision making and the way that information flows. I think it's a very complex sociological problem that we should give a very deep thought. I don't think we have the answers but we do have the right problem that should be addressed very seriously. >> : [Inaudible]. >> : [Inaudible] students are never skeptical about the new [inaudible]; they are skeptical about [inaudible]. And that's a completely different thing because -- So [inaudible] because there is no other way. I mean we cannot think that we can continue [inaudible] shelves of libraries which, you know, cannot not even allocate any more space for these things. This thing is going to happen. Now the problem is to understand how, I just wanted to comment on the [inaudible] publishing the program. [Inaudible] meeting organized by Eduardo. There was a fantastic talk by a guy from a [inaudible] processing community who has solved this problem since many years. We just need to look what are doing others. They publish programs since many years with the [inaudible], with the comparison, with the documentation. These are standard for publishing program which needs just to be adopted and not, as we always want to do, to rediscover water wheel and this type of thing. Many communities have already done it. Can I make a suggestion? I think that this discussion is very interesting and very useful. If we can keep all the [inaudible] to the discussion, we have a Facebook page, why don't we use it to put... >> : Because Facebook [inaudible]. >> : Sorry? [ Audience laughing ] >> : [Inaudible]. Really we have this Facebook is where every one of the people who are going to be doing the discussion. Okay, not the contributions, so that we can try to reconstruct. Because at the end, if I'm not wrong everything goes down to build a proper business model because I'm sure that if we point out that using electronics rather than paper can save a strangled university, choked university billions of dollar because it's -- Like my university spends 1.9 million Euros for the [inaudible] in physics. It's all in subscription. And model based on electronic publisher would save precious research money [inaudible]. >> : [Inaudible]... >> : So fix... >> : It won't save any money. [ Various audience comments ensue and continue through ] >> Yan Xu: The thing is... >> : It certainly won't save any money. >> Yan Xu: Yeah, I want to say that... >> : I think so. >> Yan Xu: ...Facebook or e-mail or any mechanism -- I want to throw this question to the audience, what would be the minimum we can do to avoid having the same discussion next year? >> George Djorgovski: It seems to me that we need to write a requirements document for the new model of scientific publishing which will outline all of the problems that we have mentioned here. And then that can guide discussion about possible solutions. Some things may have been solved already; others might require a completely new approach. Others may be unsolvable or solvable just by old people dying off. Right? But I think a useful product of this discussion doesn't have to be done in one day. It will be that people who really care about this get together to some virtual forum and write this requirement document, if you will, for, you know, future model scientific publishing. And it may not come to anything but at least we can have problem defined in a way that's approachable. >> Yeah. Yeah, [inaudible]... >> Michael Kurtz: There's a thing called Force 11 that Lee Dirks was part of that actually that looks at that for all signs. >> : Can I cut off your discussion right now? We're going to have one last comment from the floor and then you three can sum up or whatever you want to. >> : I just wanted to say this has been a problem across a lot of the disciplines and make the comment that there are people who have been working on issues of credit and metrics and impact factors for journalism, sort of alternate methods for accounting that. And one of those groups is oldmetrics.org and there's another group that just launched called Total Impact. And they come out of the gene sequencing world. But if you're concerned about that you may want to look into those two groups. >> : Okay. Thanks. So up to the panel to sum up all proposal [inaudible]. >> Michael Kurtz: Me? >> Yan Xu: Yeah. >> Michael Kurtz: All right, first the metrics are being worked on a lot. The how to do nano-publishing and get credit for it is a typical topic of conversation at a general meeting for the future of scientific publishing. I was at three of them last year; they happen all the time. Force 11 is a buzz word. Type it in and find the web page. They'll have another conference some time next year. I guess the metrics of use and citation are the most useful ones. I have a new citation metric on Astro-Ph this morning, so I have to say the citations really are still very useful. We're developing new publication methods all the time. That's what Hesark. That's what the Space Telescope Archive is. It's what Archive is. It's basically what CDS is. It's what ADS is. It's what the Astronomy Software Library is. The Astronomy Software Library is linked to by ADS just the same way that the space telescope measurements are linked to through ADS, through papers, through paragraph descriptions. People think in words, so describing things in words is probably the least common denominator to get everything all working together. Otherwise, you don't really need to invent that much stuff. You have t use a lot of the stuff that's there up to nanopublishing which is not a solved problem anywhere. There's no real reason why astronomy would solve it. >> Yan Xu: Right. I just wanted to repeat the question that I just throw out, is I wanted to find out what would be the minimum we can do? And I volunteer to, you know, be your assistant to make that happen because the success of my job is not defined by the number of publications with my name as the first author in a physics journal. That would be a failure at Microsoft. I'm supposed to assist you to do better science. So you let me know and take advantage of what we can do for you from here. And give me suggestions. >> George Djorgovski: It seems to me that we're fairly ignorant of serious thought that has gone into this in other fields. And maybe what we need is a one day workshop where we can get mutually informed with people from other areas and then construct a requirements document and start thinking, "What's already in hand?" because some things may require only minor tweaks as Michael just described. And I suspect it will boil down to the sociological change of getting used to giving people credit for things for which they're not getting credit now. >> Michael Kurtz: That is the major issue. It's tenure decisions. It's old farts who don't view having software that's used in ten different scientific papers as important as writing one of those scientific papers. It's the same problem with instrumentalists also who do things that other people use rather than they use themselves. That's a real serious problem for pretty much everybody in this room. >> Yan Xu: So we should find a venue that we can reconvene ourselves, perhaps E-Science in Chicago would be a --? >> George Djorgovski: It's too soon. >> Yan Xu: Too soon? >> George Djorgovski: We have to, you know, organize this and think about it. >> Yan Xu: I don't mean to be too pushy but we have to make it happen. >> : Can I suggest actually... >> George Djorgovski: Be pushy. >> : ...we do continue this conversation online somehow, perhaps just a Wiki rather than Facebook but whatever suits you best. >> Yan Xu: Absolutely. >> George Djorgovski: Yeah. >> Michael Kurtz: Things you put on Facebook are broadcast to other places. >> : Yeah, I think... >> Michael Kurtz: [Inaudible]... >> : ...a Wiki format is [inaudible]. >> : [Inaudible]... >> : Let's discuss it on the Wiki. Okay, let's thank the panel. It's been a great discussion. Thank you... >> Yan Xu: Thank you, Ray. >> : ...panel. >> George Djorgovski: Thank you, Ray. [ Audience applause ]