>> Yan Xu: Let's get started again. In the second half of today's morning session, we have four contributed talks, 20 minutes each, including questions. And the first speaker is David Hogg. He's going to talk about going beyond map-reduce and going beyond maximum-likelihood. >> David Hogg: Thanks. Okay. So this is a slightly odd talk because it's not results, it's not even a proposal, it's a statement that where we're currently going is I think in the wrong direction, and it's a challenge to the people in the audience. We've got good people in the audience, so really I am pitching a project to the world and hope somebody picks it up. So there's this map-reduce framework, and I will say what that is, although I actually think most people in this room know it better than me, so it's a little embarrassing if I say anything about it, but I'm going to say something about it. There's this map-reduce framework, and it does tasks in log N time when you have huge data, and so it's really important to data centers and to data analysis and to all big data companies on the Web. And we can do things like maximum-likelihood analyses in frameworks that look a lot like map-reduce, though not exactly. Oh. And, by the way, when I say map-reduce, I don't mean anybody's product; I mean these operations of map and reduce, which I'll say in a minute. I'm not talking about MapReduce a commercial product. The bad news is that we cannot do the things we needed to do in the next generation of astronomical surveys in map-reduce. This will not work. We need to do something different. And we don't know how to do these -- the things we need to do at scale. We don't know how to do things in less than polynomial or exponential in the amount of data we have. And people -- there are people on these various projects here, and they might disagree with me, but I think they have a limited view of what their data analysis will be. I think if you really think about what you want your data analysis to look like, you can't work -- you won't be able to log N operations. So this is a call to arms. But I might point out that if you solve any of these problems, you might get fabulously rich. So it's not just an astronomy problem, of course, because lots of people -lots of people are stuck with scale. Okay. So I have various collaborators. Bovy was a student of mine. Brewer is a statistician at Auckland who's very good with sampling and Bayesian things. Fergus is a computer vision faculty member at NYU. Foreman-Mackey is my student right now. Goodman is an applied mathematician at NYU who works on applied math relating to data analysis. And Lang, was my student, is now a postdoc at CMU and works on astronomical imaging. Okay. So this quotation at the top here: We won't even consider any algorithms that can't be written in the map-reduce framework. I've heard that said with small modifications by people at Google who say they will not run things inside the house that are not map-reduce and also at Microsoft -- I mean at IBM. The Watson team and those people say we won't do anything that isn't map-reduce. And the reason, of course, is they need to be -- they think of Google searching the whole Web, Watson also searching the whole Web in a different way. They need the -- the amount of data is so large you have to work log N. You can't work slower than log N. So what do I mean by map-reduce? What I mean is that you split your data analysis into these two steps. The first is a map step where you run the same operation on every piece of data. And then the next is a reduce step where you then compare pairwise all the outputs that you got from the map step. So this is the way all log N algorithms look, because you have a tree, you do some operation of the leaves of the tree, and then you need the information back up through the tree. And if you have N things, N leaves in your tree, then you have log-based 2N branches. And so you can do operations in this form in log N time because each of your leaves is sitting itself on a computer. You have a datacenter, you have your data all over the datacenter. Every datum you have, every piece of data you have, is sitting on a CPU. So you can do all the map operations locally, and then reduce just goes up the tree. So that's the map-reduce. I'm explaining it badly, but that's because it's not really my business. In fact, as I'm about to say, none of the problems I do -- I work on actually fit in map-reduce. But the way to think about this is -- like the standard thing is Web search. You search every document locally for the word kittens. And then if it has kittens, you return the Document ID and the PageRank. And then the reduce step you're comparing all the PageRanks you're seeing. And so in log N time you can return the top PageRank'd page that mentions kittens in log N time and you searched the whole Internet. Okay. Good. It's brilliant, but it's also a huge opportunity. That statement at the top means there's a huge opportunity for somebody who wants to get rich. I am not that person. But all the reasons that we can't work in map-reduce is also reasons that lots of interesting companies can't work in map-reduce. Okay. Good. So one -- here's an example of something you can do more or less in map-reduce. It's not exactly map-reduce but it's the same kind of structure. Say I have a likelihood function -do I have a little pointer? Ah, yes, I do. Does that work? Yeah. So say I have a likelihood function that looks like this. I want the probability of my full dataset given some parameters. This is a standard thing for astronomy. And also standard is imagine the data are independent. So these are many -- say these are observations of quasars, tons and tons of quasars you've observed here. Each quasar has a probability given the parameters of your model. But you really want to set the parameters of your model using all the data so it's this massive product over all your objects. And if you think forward to -- I'll talk about the scale, but this could easily be tens of billions of astronomical objects here. So if you want to find the maximum with respect to theta, you want to do maximum-likelihood, it's really straightforward. In the map step for each data point you compute the derivative with respect to the parameters, and then the reduce step you just sum up those derivatives. And then you can go uphill. And, okay, you actually have to -- if you want to go to maximum-likelihood, you actually have to run some algorithm, some optimization algorithm like conjugate gradient or BFGS or something. But each iteration in that only takes log N time because in each iteration you can compute the derivative in this map-reduce way. So this is a very nice problem. You can do it all in map-reduce. So if astronomers are happy with maximum-likelihood and that's all we need forever, we're fine. We're absolutely fine. Of course I'm going to say that's not true. So what is -- just to get the scale, the scale of astronomy is not nearly as large as the scale of other industries. So I don't actually think of astronomy as being all that large scale. I spin a large fraction of all of the astronomical data that's ever been taken just in New York, so it's not -- we're not talking about Microsoft scale here. But it's big enough that we care about scaling. So if you think forward to LSST -- so the two projects I'm going to kind of talk about as my examples are LSST and Gaia. And you might not know Gaia. It's less known here in North America than in Europe. But in -- both of these are at a substantial scale that matters. So in LSST there's something like 10 to the 15 pixels -- I'm right about that, right, Jelco [phonetic]? -- and 10 to the 10 galaxies-ish depending on how deep it goes and what cuts you make. But it's this scale. And what we want to do -- well, there's many projects you want to do with LSST. It's an absolutely amazing source of science. And I won't summarize. I'm only going to talk about weak lensing because it's a good example to discuss. In weak lensing what you want to do is you want to measure the shear map, the distortions of all those galaxies because of gravitational structure formation, so because of the mass map is inhomogeneous, there's this inhomogeneous lensing. So you want to measure this inhomogeneous lensing and then from that lensing you want to get the cosmological parameters. That is one of the most ambitious projects in all of astronomy. And we have to succeed or LSST was not such a good idea. So we are going to succeed, but it's going to hurt, as I'm going to tell you in a second. And then Gaia, Gaia is a smaller scale, but the inference problem is even harder in Gaia, because in Gaia what you're doing is measuring precisely the stars in the Milky Way, measuring their positions and their velocities, and then you're trying to infer the nonlinear dynamics of the Milky Way from the positions and velocities of the stars. So it's a much more nonlinear problem. So it's actually -- even though the scale is smaller, the problem is harder. You want to understand the dynamics of the Milky Way, so the mass model and the orbit structure of the Milky Way, but unfortunately you have this distribution function of stars. Stars are not distributed on the orbits democratically, some orbits are more populated than others, so there's a big distribution function problem in here. And the thing that's going to kill us and the reason that map-reduce isn't going to work and the reason why we cannot -- we've got a really big problem coming up is this shear map and this distribution function and probably also this dynamic is going to be nonparametric. It has to be nonparametric. The reason it has to be nonparametric is we don't actually know what -- we don't know how to parameterize these functions properly. And what do I mean by nonparametric? So I mean the strict sense of nonparametric which is that the model gets bigger as the data get bigger. So as you get more and more data, you increase the number of parameters. So your model scale is rising with your data scale and you can -everybody in the room already sees why that's going to be a huge problem. And the reason that all these maps and distribution functions and all these nuisance parameters are going to be nonparametric is that as you observe more and more galaxies, you want to increase the angular resolution of your map. Duh. Like that's why we're taking more galaxies, so that we can get more resolution. So everything important in astronomy is really nonparametric in some sense, everything internally in astronomy. I'm actually going to say at the high level we want to get parameters. But in the middle -- I'm going to show you some graphical models in a second. Okay. Good. And importantly this is an important issue, but I'm not going to talk about it so much, nonparametric models are never inferred at high singleton noise. Why not? Because once you have high singleton noise, the model grows, you get -- add more parameters, and then you're back down to medium singleton noise again. You never have high singleton noise in a nonparametric model. So, by the way, so I'm going to show you some graphical models just to set the stage. I don't know if you've seen graphical models. Not everybody in here probably uses graphical models. So if I have a process where I have cosmological parameters which produce a shear map which produces observations, you can think of this as a -- showing the causal relationship that the parameters create the shear map which create the observations, but you can also see this as a description of a probability statement; that the probability of the whole set of all the variables I care about is a probability of the original cosmological parameters, the probability of this shear map given the parameters, and then the probabilities of all these observations given the shear map. See, it's a way of breaking down the causal structure or the probability structure of the problem. So I'm not going to -- if you don't know probabilistic graphical models, learn them, because they're incredibly valuable for thinking about data analysis. But now I'm just going to use them as causal structure to look at the problems. So this is the Gaia problem. In Gaia, Gaia is measuring a billion positions of stars, and these are six-dimensional positions. They're position and velocity. So Gaia measures a billion six-dimensional positions. Where did those positions come from? Well, the star positions it measures are related to the true star positions, and then the true star positions are created by the dynamical model of the Milky Way and a distribution function. So these are produced by two things: one thing we care deeply about, which is what is the structure of the Milky Way; and one we don't care nearly as much about, which is about the happenstance of what orbits happen to be populated. Now, some astronomers would actually reverse those and say the other one's more interesting. But from my perspective, these are the interesting parameters; these are the uninteresting parameters. But this is a huge nonparametric model, and this might be a parametric model or nonparametric, but this is a huge nonparametric model which is affecting my observations and I'm going to have to marginalize this out if I want to learn these parameters. That's what's going to hurt. And of course there's also noise coming in and there's a noise model. So there's a noise model generating uncertainties which are also affecting the observations. So you see how this is like a causal model for the Gaia data. And that's the kind of problem we work on. Here's another one. Here's a slightly worse one. This is the weak lensing model. So in weak lensing you have cosmological parameters. These are the cosmological parameters of the density of the universe, the initial conditions, sigma-8, all those things. They produce a shear map. This is a huge nonparametric object; that is, our shear map is a function of redshift and angle. On the other side over here before we come into the box, over here there is some distribution of galaxy properties. Galaxies have complicated shapes on the sky, and that generates the true shape of the galaxy. So the true shapes of the galaxies are generated by some horrible process that I certainly don't care about, although I should care about it, and then -- but what you observe is not the true shapes, you observe shapes that have been affected by this distortion map. So once again this has the problem -- this has this character that there's this big nasty nonparametric model here affecting the observations, and there's a big nonparametric model here affecting the observations, but all I care about is this. I just want to know what are the cosmological parameters. See, that's the -- this is the structure of essentially all problems in astronomy. Actually, I could argue that this is the structure of all problems in science. I could even go further. But I'll stop at science. I think all quantitative data analysis problems have structures that look like this, and in particular the thing you care about is separated from the thing you observe by one hell of a lot of nuisance parameters. >>: What's XN? >> David Hogg: That's the position of the galaxy, because in the shear map you have to know the position of the galaxy to predict its shear. Even this is overly simplified. The positions of the galaxies are also created by the cosmological model. And in principle the cosmological model makes this galaxy formation model which produces the shapes of the galaxies. See? So I've really left out -- this is already a massive oversimplification of the true situation. But this is more sophisticated than the analysis people currently do of weak lensing. So this would already be a step in the hard direction. Okay. So just in case somebody thinks that we can solve all of our problems with Gaussians, you can't. I don't think I'm even going to say much about this, but the fact is you never understand your noise properties as precisely as you think you do, and your data are produced by a mixture of processes. Whenever you have a data point -- people were just talking about it at the end of the last talk. When you have a data point, you don't know whether that's a correct data point or an incorrect data point, so your data are produced by a mixture of correct and incorrect things. And it can be a lot worse than that. But at the very least it's that bad. And once you're that bad, every data point produces multimodal things because there's this classification inside your data point that allows you to jump to two different locations, bad data, good data, and so you generically get multimodal things. You also get very broad support and parameter space in the sense that a data point, if it's true, is telling you something very useful, but if it isn't true, it has broad support. And so it's very hard to approximate the likelihoods you're getting from individual data points with very compact functions. Okay. Good. So now I'm to the final point I want to make. This is the main point I want to make. Bayesian inference cannot be done in this map-reduce framework. And you might think it is because it looks hella like map-reduce. And I've seen a lot of glib statements. It looks really map-reduce. But because you have all the data which are producted together and you have a prior, so this is like a Bayesian, standard Bayesian inference, you -- you can compute these local likelihood functions. You can compute these all locally and then only product them together in the reduce step. So it looks really good. You can locally produce your likelihood functions and then on the reduce step you bring the likelihood functions together and you multiply them together. It's perfect. In fact, you don't even have to be a Bayesian. Good. You don't have to be a Bayesian. This is a really -- you can just think of the likelihood analysis. You can do this -- you can think, oh, good, I can do this all with map-reduce. But you can't. You cannot pass forward those functions. Theta is a very high-dimensional object. The functions are multimodal. The support is broad, and these nonparametrics mean that this might even have a variable number of parameters that's growing with the data. So no individual node knows enough about the whole dataset to know how much detail does it have to pass forward in its load function. You have to know a huge amount about the whole dataset to say with what fidelity do you need to pass forward this function. So there's -- basically this fails. And I say but that's not all because the other problem is marginalization. We also have to do a massive marginalization, and there's no way to marginalize out -- if all I care about is this and all I'm observing is this, I have to marginalize out this and I have to marginalize out this. These are big nonparametric functions. And, furthermore, their marginalization depends on all the data. This -- after you've observed this, the posterior on this depends on all the data and you have to then integrate over to that, so there's no way to do that locally. Of course it's necessary. If you don't do that, your constraints on the cosmological parameters will be wrong. One of the things about weak lensing, one of the reasons weak lensing is such a good example, is you never measure this at high signal-to-noise, but you really care about your cosmological parameters. So if you don't do this marginalization, you will fail. There are enormous nonparametric Bayesian inferences that have been done. How did people do them? I just said you can't do them. It takes exponential time or something. I didn't tell you what time it takes. It takes of order N cubed if you just naively write it down. Not log N, but more like N cubed. So how have people successfully done this in the past? The way they've done them in the past is by carefully choosing priors that make the inferences tractable. We are not allowed to do this. Why not? Because we are scientists and our prior information has to contain our prior knowledge, and our prior knowledge won't be in conjugate form. We actually have nontrivial prior information. So we can't do this. So these Bayesian nonparametrics, if you look at Bayesian nonparametrics, it's almost useless to us, unfortunately. So my view is that the word Bayesian is becoming a bad word. My approach? I don't have an approach. That's my approach. But I think this is the issue. I think the next generation of surveys will fail if we don't solve some of these problems. I have to say the applied mathematicians and the computer vision people know a lot of really great things about these kinds of problems and we should be working with them more. All my collaborative -- if you notice, more than half of my collaborators in this were not astronomers. That's been extremely valuable to us. And there are people in this room of course who have also drunk that Kool-Aid. In my day job, just to say I'm not a total bullshitter, I do do big models. Right now we're building a 10 to the 9 parameter model of the SDSS pixels as a kind of baby step towards LSST. We are getting some Bayesian nonparametrics working but with real priors, priors that represent our knowledge. We have one of the best adaptive MCMC samplers around. If you need some MCMC sampling, we have -- both Brewer and Foreman-Mackey are my experts. We have done full marginalization of a nonparametric distribution function, but of course we -in this problem we only had eight stars, not 80 million stars. We've done huge models of data, data-driven models and made large predictions, and we're big believers in working with missing data and heterogenous data. And I'm done. [applause] >> Yan Xu: Okay. Questions? >>: Two comments. Firstly, there's this new stuff coming out of Google now called Dremel. I mean, we all know the MapReduce version that came from Google. Google have done away with that, now they're using this new framework called Dremel which is supposed to be beyond MapReduce, scale [inaudible] scale problems. Can you comment on -- have you looked at that or does that solve something ->> David Hogg: No. So I haven't looked at it, but I'm almost certain that they still need log N. >>: Yeah. >> David Hogg: And so nobody is doing a marginalization, basically, in log N time. Because I think it's -- so I haven't looked at it is my basic answer, but I very much doubt it will solve these problems. This will be a real breakthrough because if people can solve the problem of passing forward the posterior probability through some log N thing, that would really be game changing, like new companies will appear the next day. But I have to say there is -- there is one respect in which people have done this in log N time, and you could do, and that is that -- I forget what the name of it -- there's a technical name for it, but it's where you build only very, very low-parameterized models of the posterior and pass those forward only. And if you can work in that approximation, you're good. I'm just saying in astronomy we can't work in that approximation. But, anyway, yes. >>: My other comment was going to be that people have translated Bayesian networks into map-reduce. >> David Hogg: That's correct. >>: Where ->> David Hogg: Yeah. >>: And that does work. >> David Hogg: Yes. And that is exactly in this form. That's this like Deep Learning. So in Deep Learning is one of these nonparametric Bayesian things. So in Deep Learning you have many layers in like a neural network. And you design that so that all the edges in your graph are analytically marginalizable. They're very beautiful models, actually, to be fair. And they're extremely high performing. Like many of the best computer vision algorithms use these Deep Learning networks now. And Deep Learning -- like there's now on-chip Deep Learning. It's amazing. But it really doesn't work for us. It works really well in the world, but it doesn't work for us because our priors are so specific. Like our priors on that shear map are very specific. So we're just not allowed to use these tractable priors. >> Yan Xu: Okay. In the back. >>: I really enjoyed your talk and I agreed with most what you said, but there was one statement that made me almost fall off my chair. >> David Hogg: Uh-oh. >>: You said that if we don't solve this problem, as you phrased it ->> David Hogg: Yeah. >>: -- that the future surveys will fail. >> David Hogg: Yeah. >>: Which I think it's a gross overstatement. I do agree with you that without sophisticated nonparametric models we might not be able to extract the entire information [inaudible] surveys like LSST, but even using contemporary methods, and you'll definitely have better methods a decade or two from now, we can still extract significant fraction of information in these surveys, and I don't know quantitatively how much better you would do with nonparametric models, but I do know for a fact that even in standard methods we will get enough information content that we cannot claim these surveys will fail. >> David Hogg: Good. So let me -- let me be a little bit more specific. LSST without solving this problem will meet its goals. But in my view, that is not sufficient because we're spending a lot of money, and we should be, and -- and the -- the current plan with LSS- -- I mean, if you do not solve this problem, you're in the -- you're in the state of classical statistical estimators and all the argument about bias and variance and so on. So most of the weak lensing literature right now is people arguing about minor differences in how they estimate shears to look for unbiasedness. And it -- as the survey grows, the sensitivity to those bias' invariances will grow. So it's not obvious to me -- so I agree that LSST will meet its goals without solving this problem. It will meet its stated goals without solving this problem. But it is possible that there are systematic problems with all the point estimates people have for this problem such that LSST will drift into a systematics-dominated position on the weak lensing. >>: [inaudible] that the complexity grows linearly with the size of survey, that this is perhaps true in nonparametric models. But, for example, if you just use the extreme [inaudible] as you increase your sample size, the number of components you result does not increase linearly with the data [inaudible]. >> David Hogg: That's true. One thing that might help a lot is it is possible that the complexity of these models might only grow as like the log of the data size. And in that case that might save us. That might save us. It's still the case, though, that this marginalization is very hard. The marginalization is still hard, even if it gets -- because the marginalization itself is expensive operation. But it is true that these may only grow as the log of the size of the data. That's not a known scaling yet. That's right. >> Yan Xu: I hope to except, but time is running out, so please ask him during the lunchtime. So I'd like to move on to the next talk. So before that let's thank the speaker again. [applause] >> Yan Xu: Our next talk is by -- what's your first name? >> Nebojsa Jojic: Nebojsa. >> Yan Xu: -- Nebojsa Jojic. His talk is titled as "Epitomes and Counting grids: Discovering patterns by (re)arranging feature counts." >> Nebojsa Jojic: Okay. So I'll talk about a very simple model. And since you guys are astronomers and looking at images all the time, maybe this is not that new to you as it was in the computer vision community many years ago. So I'll talk about a very old model that I've introduced many years ago and then an extension that's more recent. These models have been used also in other applications. We've used them for [inaudible] design, we've used them viral load prediction and so on. Now, the basic idea is very simple. You will very soon see why I chose to talk about this particular area, which is not actually what I'm spending most of my times today -- I mean these days, but it may be most relevant to your problems. So the idea is very simple, and I talked about it briefly the last time we had a similar workshop several years ago. Let's say you have an image, my mother's poodle, for example, and let's say that I want to somehow summarize the texture of this image, the repeating patterns in this image. And the way I would do that is to take at random lots of different patches of different sizes from this image and then try to assemble them back, but I want some of them back into something that's much smaller than the original image. If I'm able to do that, then obvious I will have to somehow squish all that texture so instead of having all these flowers in the background I'll have only a couple. Which is actually what happens here. I have couple different types of flowers, a little bit of area here for the gravel and some area for the fur and so on, which means that I've kind of learned the structure of the image, I know what the self-similar parts in the image are. Right? I can now do segmentation for -- for example, of the image by segmenting this out here in the epitome and transferring that back by inverting the process, finding out what this area came from in these images and then -- in these patches and then where these patches came from the original image and then segment out the gravel, for instance. So, again, idea is that if I take all these patches, if I just look at the patches here, I can't really reconstruct the entire image back without knowing where they came from, but I can see enough of common structure that I can learn a squish texture. Now, how would I learn that? Well, the approach is very simple. I would start with some guess. And this is what we call the epitome, this little squished texture. So I'll start with some guess of that epitome which is basically just the average color in the image with some noise to break the symmetry. And then I'll take all those patches, see if I can map them there, and then we'll initially map equally well or equally badly to all parts of the epitome, but some parts are going to map a little better. And then I will count all the times that I've hit every pixel in the epitome with these patches and then average those pixels based on how often they've hit there, taking all the probabilities into account. And then when I do that I'll in the next iteration get something like this where you can see a little breakup of a little whitish part to the left and right and black in the middle. The reason why you have whitish to the left and right is because the epitome is actually defined on a torus, so the left and right connect and the top and bottom connect. So then if I keep iterating this, after a few iterations, in here I'm showing 16 iterations, you get this texture that I've shown in the first slide. So that's basically the algorithm. Now, the question, of course, is what do I mean by mapping the patch into the epitome. I could introduce some kind of geometric shears or intensity, manipulations, or rotations and shift and so on, or I could just map the patch as it was. And of course if I do all these transformations, then I have to keep track of them when I'm reestimating the epitomes. And then the other question is am I always going to get the same epitome? Obviously not because I have this random initialization, so I'm going to get different epitomes. But they'll have a qualitatively very similar appearance. So to demonstrate that, here's the epitome of Microsoft Research as it was, I don't know, maybe -- this, I don't know, seven, eight years ago. Different building, as you can see. So the epitome here has a lot of these elements, they're window elements because there's a lot of them in the image here, and it has a few faces as well. And probably most of these faces map to either this one or this one or maybe this area here or the area there. But if I rerun this, I'll get a slightly different epitome. So here I'm kind of zooming in to this one. But qualitatively it's very similar. Again, I have some face-like thing here. I have a face-like thing here as well. And in this area I've put red and green on either sides of the nose of the face that appear there, and then I map back from the epitome through the patches into the original image to see which areas contributed to this area. And I get the face detector from just one image of Microsoft Research. It doesn't work like the best face detector, but, still, it's just one algorithm that does it, simple one, simple iterative thing. If I do an epitome of many different faces, then I get this weird facial texture. It's bizarre. But if I now highlight, for example, these three areas here because I see a little smile here, I see a closed eye here and I see some dark hair there, so if I pick these three things and I look back for all the images that map there, whose patches map there, I get these guys who have dark hair and tend to squint when they smile. If I instead choose this other type of an eye, I get these guys who actually have their eyes open when they smile. So this is where the vision comes in, right, the computer vision. And there's a lot of other applications that I don't have time to talk about. Epitomes were mostly used by computer graphics and computer vision people, but they've also been used in SIM design. But here is the thing that I've always found most exciting about epitomes but I've never been able to apply them this way. Here's an image with a lot of structure. Do you see the structure? You can't quite see the structure. And if this was some scientific image, you would just assume that there isn't nothing -- there isn't anything there, you wouldn't model it at all. But when you run the epitome on it, it actually finds that there is a lot of 45-degrees lines this way and that way, and that's because that image really did have that, but it was obstructed with lots of noise. So the idea here is that if you run this algorithm, since the epitome is taking all these patches from everywhere, trying to match them then average them, as you repeat this the obscure patterns arise. Now, this is an artificial example because you never find images that are this corrupted. Here's an example where -- that's the closest. It's electron microscopy example. But in this case the image is actually controlled. They have very fine control over the microscope so it doesn't move too much because the noise is humongous, so then a video is taken instead of just one frame. So here are these videos. And then you average them. But even a slight shaking of the microscope creates a little bit of blur. So if you use the epitome to actually compute the shape there and reconstruct it, you get much sharper boundaries, but also you get structures that were obscured before. But as I was saying, the more interesting example would be if you can actually have something like this where the patterns appear all over the place, you have some huge amount of data coming, you can't see them, you don't know where they are, they're obscured by noise, but then by averaging -- repeated alignment and averaging you reconstruct the pattern. Another example where these things could work well is if you have missing observations altogether. So in this case I'm talking about the3D data, a video sequence, and the idea is that data was missing. For example, if the data is missing, every other pixel, that's equivalent to trying to actually -- trying to reconstruct it, that's equivalent to the InGaP sampling for super-resolution, increasing the resolution of the image. If the data is missing in a rectangle like that -- I mean a solid like this, then you're trying to fill in the data there. There's something missing completely. If it's missing in different frames, then you're trying to interpolate the frames. And I'm going to show an example which doesn't really exist in practice, but, again, to emphasize where these algorithms would have the most impressive use, here the data is missing at random. There are bunch of different places where the data is missing and it's replaced by random noise. And then if you use this algorithm to reconstruct what's there, you get this. So I'm going to pause it here. And when I run it again you will probably see this car moving around, but you'll definitely not see the pedestrian. So I'll play it a couple of times. So when it comes to science, I'm thinking that a lot of times the first thing we want to do is to look at the data, and if we don't see anything at first, we're not even going to run the algorithms on it. And some of these algorithms are better at finding patterns than we are. In computer vision that's not the case typically because computer vision algorithms can't beat our vision because we've been optimized for it. But in science there may be an application. Now, now I'm going to switch to this other model which we call counting grids. And it's a relative of epitomes. In the epitomes we were using matrices, ordered datasets. They could be order sets -- each data point is actually a set of measurements, but these measurements are coming on a grid. So either it's a two-dimensional matrix or three-dimensional matrix. So there is some ordering there. In the counting grids, on the other hand, we have disordered feature sets. So, for example, if you managed all those patches I was showing you before and you don't give me the patch itself, you give me a histogram of colors in the patch, and that's all I have. Now the question is can I still reconstruct the image. In both of these cases, of course, the number and type of channels is arbitrary. We don't have to have just RGB measurements. It can be features of any kinds, measurements in different spectral -- in different spectral domains and so on. Now, where do we have these histogram-like data points? So here's one example. In the cellular immunity, what we have is that the cell expresses all kinds of proteins. We heard about it yesterday from David. So there are all kinds of proteins being expressed and they're working in the cell, they're regular, normal proteins, but sometimes you get viral proteins as well. And for the cell to actually figure out that something is wrong it's -- there are some mechanisms and it can try to fix that. But these viral proteins look a lot like normal cellular proteins. It's very difficult for a cell to self-diagnose. So then there is this more global approach to this where the cells, actually, as they're expressed in these proteins, they express a lot of the copy of the same protein, but a certain small percentage of these proteins are going to be chopped up and little pieces are going to be presented on the cell, on the surface of the cell, and then these little pieces are now visible to the outside of the cell. And then the immune system actually monitors through the use of T-cells, specialized surveillance cells, that they monitor all cells around, and if a certain pattern that seem abnormal keeps repeating, then the immune system might go and start killing cells that have a certain signature on the surface. So what does that signature look like? Well, it's basically, as I was showing here, a bag of epitopes, a bag of these little peptides as presented by image C molecules. So in the end, because we don't know of any ordering of these things on the surface, they just come out on the surface in certain spots, there is no particular ordering, really what we have is that we have two of this type and one of this type and, I don't know, two of this type and so on. So it's basically a count of different features that are present on the surface rather than the organization of them in some two-dimensional, three-dimensional, whatever, structure. So in the end what we have is a histogram. We have features, in this case 20-something features. And we have feature counts for each of those. And this type of data is present in various different applications. For example, documents on the Web are represented by a bag of words rather than the document most of the time, which is counts of how many words of different types you have. In computer vision actually they've started doing this because there's a lot of problems with aligning things, so they tried to instead find very informative features and that [inaudible], and then similar thing obviously happened in case of immune surveillance as you saw before. So what's the model for this, the counting grid? Well, we started with a peaky distribution because, just like in the epitome, we're going to have a grid, and in each position we're going to have instead of one color or a sharp distribution of colors we're going to have a sharp distribution of features, so we start a peaky distribution, it goes into a particular spot in the counting grid, and the counting grid is going to have a lot of these different distributions all peaky in comparison to what the data is going to have. And then the idea is you take a window, you put all these counts together, and that's what you're going to see in your data. So that's the [inaudible] of the data. And what that means, that, for example, it's a 3-by-3 patch that you're taking from the counting grid, then your histogram in this particular example looks like what's shown. But then if I move the patch a little bit, then I have some features disappear, those that are on the left over there, and others are included. So you have this smooth transition of feature counts. So then this is the summary of what the generative model looks like. And then of course the learning algorithm is going to be similar to what I talked about before except that now not only that you have to map these bags onto the grid but you have to take the individual features and map them inside the windows where these things are mapped. So you can iterate that and get something. So here's an example, a synthetic example, where we are pretending that a computer vision algorithm somehow found features that are very informative. So it can label things in the image as being mountain, sky, grass, building, roof, and so on and so on. And then if you take one of these windows, you just get a distribution of features. So in this case, for example, there is a little bit of a roof but a lot of the mountain and the grass, for example, and a little bit of a sky. So now the question is can I just run these histograms, infer things, or if I can from the histograms, how about if I take each one of these patches, break into four and get four histograms for these different parts, would that be enough for me to reason out what the original layout of all these things was. And I'm using this not because it's useful as it is, but because it can give us the idea how much it's possible. Is it possible from histograms to infer the spatial arrangement of features. It turns out it is. In this case of a 2-by-2 tessellation we get this as a reconstruction, whereas the original looked like that. It's pretty close. And here's another example where I'm working with colors. I took this image, discretizing to 64 features, so I have, I don't know, one feature would be blue, another feature would be another hue of blue and so on. And so I have 64 of them. And then in this case I'm trying to learn the counting grid and, of course, I'm showing in each cell of the counting grid what's the most likely feature. And in this case I'm trying to just learn it from histograms alone. And as you can see after a few iterations I start getting something that captures some of the spatial structure, but it's kind of -- it's difficult to get the spatial arrangement which is great so I do get some neighborhood -- I do infer some neighborhood information, certain features tend to border some other features, but I don't get a very good structure. But if I do a 2-by-2 tessellation so that each patch is represented by four histograms for the four quadrants, then I actually do reconstruct the original image. And then if I do a 16-by-16 tessellation, since the patches themselves were 16 by 16, I'm actually reducing this to the epitome algorithm and I reconstruct the whole thing. I will skip the application here and I'll just tell you how you can use it now for analyzing the viral load in HIV patients. So as I said here, we're going to model these bags of epitopes using this model. And so once all these -- once we have learned the model using the bags of epitopes which are actually inferred based on the immune type of the patient and their sequences, so we know what's presented on the surface, now we can take every patient and put it somewhere in the counting grid. Now, let's assume this whole screen is the counting grid. Now, this particular patient maps into this window and I just put blue here because they had a low viral load and this other patient had a high viral load, this one low, another one with a low viral load and so on. If I keep on placing them like this, hopefully the high and low viral load areas are going to emerge and I can do that again by averaging viral load into these windows. And when I actually do that, this is what I get. And this is with various different parameters of the size of the counting grid and the match that I'm using for embedding, because now I have to choose how big my patch is. The actual histogram of features doesn't come with any idea of how big the batch is, but I'm pretending here it's running 2-by-22 patch size and here it's 11 by 11. So I'm arranging all the features and then figuring out the counting grid. And this is an example of a 3D version of the same thing because I don't have to embed it in 22-by-22 patch, I can embed into a 10-by-10-by-10 patch, so I can use arbitrary embedding. It turns out actually that the 3D embedding was the best. And so here's after a few iterations you can see how the features kind of -- how the viral load starts separating. And it turned out it works actually better than other techniques we have tried to use for viral prediction. This is an example that I'm not going to talk about of using this to model text. This is a model of old NIPS papers. And here's the summary. I already told you what the difference between epitome and counting grids is, but the opportunity I think of discovering patterns is seemingly complete noise under the assumption that there is a pattern that repeats a lot, but it's misaligned, so you can't use Fourier analysis to find repeating patterns. And this is that example that I gave you before. So that's it from me. Thanks. [applause] >> Yan Xu: Okay [inaudible]. >>: Do you think [inaudible] should be used for doing super-resolution imaging, like, for example, if you have a video of a car going by [inaudible] license plate? >> Nebojsa Jojic: Yeah, we've done that actually. This will -- we've done it on grass waving in the wind. And it works way better than traditional techniques. But the artifacts are very weird. But the sharpness is way higher. I can show you that offline. Yeah. Um-hmm. >>: [inaudible] applied this to astronomical images? >> Nebojsa Jojic: Um-hmm. >>: [inaudible] galaxy images? >> Nebojsa Jojic: I don't know. I have no idea. I don't know much about galaxy images. So that's why I actually showed it like this. So I'm hoping that you guys can tell me, you can come up with some applications. >>: What is your text application? >> Nebojsa Jojic: The text application is -- the one we've -- we've tried a couple different ones. One was this NIPS dataset where we have all the papers from published NIPS with all the words in them. And then if you now try to infer what would the arrangement of these words with in a 2D little patch so that it belongs to one of these -- a large counting grid so there is a smooth transition and so on, you get essentially very interesting areas occurring. For example, computer vision is sort of surrounded by graphical models, neural networks, and neural science because -- not surprisingly. And then reinforcement learning is borrowed by neural networks which are also HMM because of the time models and the supervector machines. And actually when you move from one -- when you look at the documents, you can clearly see how this smooth transition is happening. It's very interesting. It would also be interesting to look at this if you're trying to choose your area of research in NIPS. You're going to see where the boundary is not populated too much. So that's one example. The other example are the news, especially the -- the -- the news, because there is this transition, there are multiple topics that are evolving over time. So you really see how words are dropped and new words are put in. But they don't necessarily follow a one-dimensional structure but multidimensional structure. So they capture, for example, CNN news better than the LDA models which are traditionally used for modeling bags of words. Yes. >> Yan Xu: Quick, quick. >>: Yeah, quick question. Do you have a generative model? In astronomy, if we applied it in astronomy, we often know a lot about our noise processes. Are there ways to modify the algorithm so that it ->> Nebojsa Jojic: Oh, yeah. >>: -- will know about all the noise properties? >> Nebojsa Jojic: Yeah. Obviously. I mean, you could think of this as just a part of that graphical model you've shown in places, especially the nonparametric parts for instance. But, yeah, you can definitely add all kinds of bells and whistles that depend on the prior knowledge. And, on the other hand, if you know almost nothing or you want to start fresh and you don't want to make presumptions, then you can learn nonparametric, because this does end up being nonparametric model, you would need larger epitomes if you have much more data. >> Yan Xu: Okay. Thank you very much. Let's move on to the next speaker, Giuseppe Longo, practical data mining services for astronomy. >> Giuseppe Longo: Well, I think that most people in the room know what I'm going to talk about since this has been the dominant theme of my talks for the last five years with many, many evolutions. And this is -- at some point, you know, you start changing title, but the substance doesn't change much. We have -- I'm going to talk about the data mining and about a project which has been going for many years. It was a sort of pioneer in the context of the virtual observer that we do. At the beginning it was called [inaudible], then it became DAME, but -- which just stands for data mining and exploration. But some -- due to some delays in the implementation of the standards which we needed -- I mean, at the beginning we were all pioneers in the visual observer. The DAME as a neighbor become a real virtual observer through compliant device, but I think that now we're told this -the pieces are where DAME has to be, the transformation will be done [inaudible]. DAME is a joint project between my university, University of Naples, so the California Institute [inaudible] and the Italian Institute of Astrophysics. I mean, the main collaborators over the years have been given the specificity of Italian situation. Most of those collaborators have emigrated and now they are in this room under different hats. But it has been a thing which has been going on for many, many years let's say. What is DAME? Actually, our system manager is fanatic about the acronym. So DAME -- what we used to call DAME, now it's DAMEWARE. DAME is a general platform where we are putting old Web application which we are -- which we are implementing over the years. And as you can see there are quite a few things. I mean, now we have applications of text mining related to the globular clusters, so it's a Web application for updating all knowledge available on globular cluster. I'm not going to waste much time on those things. Now we also open the new section for the Euclid Mission Data Quality. Our group was in charge of this thing. And but what matters is DAMEWARE. DAMEWARE is [inaudible] Web application resource which is aimed mainly at the data mining [inaudible]. Everyone from anywhere in the world can access an infrastructure through a browser because in fact it is a Web application [inaudible] infrastructure for data mining where the computing infrastructure is completely transparent to the user. I mean, the Web application decides where it is better to run the application as a function of the size of the data and the complexity of the problem. And we have a series of models which can perform classification, regression, clustering, feature extractions which are the [inaudible] of data mining applications. These models are usually the least, which is quite compete, I'll say. We have many methods of regression for classification. We have some methods for unsupervised clustering. Others are coming in. And the next to be implemented, which are in the process of being implemented, the Bayesian networks or the random decision forest, and another version of a multilayer percept. The -- I'm not -- actually since I'm speaking so many times, so there's things I'm not going to waste much of your time. I just want to point out that beside it is -- it has been often said that we must find -- we must approve the utility of the work which have been done over the years with the real scientific results. I never understood well this question. Why to prove the utility doing infrastructure you need [inaudible]. This is something which is completely [inaudible]. Most of our life, most of our career, most of our work is about the simple contributions, I mean, about producing papers of honest, decent work where you just make a measurement better than others or you improve upon something. I don't know why these types of astroinformatics, let's say, infrastructure we also want something more. So we made a point of trying -- I mean, we opened basically a call for collaboration with the community for good cases where to use the -- this platform, DAME. To solve problems we did data mining techniques. So basically I don't know what is my role in this model for data mine [inaudible] a project or as an astronomer. I don't know anymore what I am, but -- and I think this is a problem which I never solve. But in the last few years, also due to the fact that there was a cap in funding overall in Italy, I mean, about 60 percent, we stopped improving the structure. We couldn't do any more -- any development. We began to apply the existing infrastructure through a series of programs with results which, to my opinion, are quite satisfying since basically with the minimal effort in all cases we always have achieved the results than are much better than what is available in the literature at the moment. And I think also with this model -- I mean, we have been working on massive datasets on the whole [inaudible] in some cases on very small datasets, like, for instance, in this case where [inaudible] Puzia, a colleague from Chile, wanted to see whether it was easier to separate the globular clusters in external galaxies using data mining techniques. And we just applied it the platform, actually Thomas did from Chile, applied the thing to the -- to the problem of NGC 3199. He took 11 parameters, so it's a small dataset, a small number of features, something like less than 3,000 objects [inaudible], it's okay for date miners, he used a multilayer perceptron with [inaudible] and approximation of all the classification [inaudible]. It's very effective in generalizing from a small basis of knowledge. So it's the idea of this type of problem. And it performed something which is very important to remember when you take into account this type of analysis, is that the fact that you -- one of the advantages is that you can perform a feature extractor, so you can plug in as many features parameters. You have then let the method itself to evaluate which features are significant, which are not. So there was a first run of feature extraction where the network learned how to recognize globular clusters. So we have to stop for a second. What is the definition of globular cluster? Usually globular clusters in external galaxies are defined using -- if there are high-resolution images available, like, for instance, HST you can use a structure parameter plus color, the idea was to use a base of knowledge built upon HST knowledge to try to identify globular cluster using single-band HST, therefore without colors, plus colors from the ground, so mixing very low quality data from the point of view of [inaudible] resolution with the color obtained from the ground. I mean, I'm not going to spend much time because it's quite complex actually. It takes me sitting on a table checking [inaudible] plots together. But basically here on the upper graph you see the selection performed with the traditional method, and here on the right you see the selection performed using neural network where you can see there is a much, much better [inaudible] between the object identified with globular cluster with the selection box which is marked by this dashed line. Other example. We participated with simple tool. I mean, it's used via browser, the one which I mentioned before, to the PHAT contest for photometric redshift. This has been a large compass [inaudible] three year -- two years ago where basically all people who had implemented photometric redshift models applied their method to the same that are certain. The results and the statistics were then independently, not by the team who was applying the method, but a sort of revelation committee, and, well, this is the result of the application of the same method. This is the most incredible thing. Always you see I made a choice. Among all possible cases I took all the results which were obtained with the same method, okay, with MLP with the quasi-Newton approximation. Well, and without doing anything. Just running the usual feature selection. In this case we had a quite large number of features and we had the problem of missing data. In fact, here usually good basically refers to the amount of missing data. There is no missing data in this band. And this band quite good, which was [inaudible] missing data, down to poor. So we just -- since we know that neural network would be very bad with missing data, we just threw away these last two last groups' feature. Medium input we used only one. We run a few experiments of feature selection. We ended up with this combination. Without these features, without these features. This is the most successful. And then we ran on our training set to obtain our final result, and it's what you see here, against ordinate of empirical method of the PHAT contest. The QNA is our method. This is the table from the paper. I mean, QNA in the 18-band case, which for us actually was this one, without this band, because this really convey no information whatsoever and this is something which none of the other method noted, these bands were just useless. If used, they increased computing time and decrease accuracy. So our result were in one case comparable in the 18 band which was what you could obtain with another [inaudible] in the other one. They were by far the best. So even in problems which are -- I'm done. Really. It's really weak. Another example is this one. This is a paper which is coming out now which was inspired by a paper by yours truly and [inaudible]. You apply data to data simulation of -- to the mixture of the various quasars. We applied the -- our same method to the larger dataset extracted by the overlap of GALEX, UKIDSS [inaudible] and WISE, SDSS for the quasar in order to see whether it was possible to obtain a good estimate of photometric redshift for quasar. And, again, something which you cannot do easily with other technique. After a phase of feature extraction we found that the E5 and E16 are by far the most effective methods. So basically WISE is completely useless for what the photometric redshift derivation is concerned. And also some ways to measure magnets are completely useless as well. What we found is this. And then I'm sure you would like to discuss about this because we are better than you. We go down to 0.101, within percentage of outliers, which you want sigma, two sigma, three sigma, four sigma. Degrees is very [inaudible]. And we are not using the same [inaudible] object. We are using only 8,000 objects. So even without knowing the model, the physics behind, just by using in a proper relational way, let's say, data mining techniques you can obtain very useful scientific result. Well, then we cannot beat the killer app, but for sure are extremely effective. So very quickly I just want to focus a few seconds on the -- what are the problems. Well, if something will come back to the Italian shores, this is the direction which we intend to move together with our colleagues at the Caltech. As we have seen, one experiment for each application requires hundreds, hundreds of runs, because we have to do future revelation, you have to choose the better algorithm and so on. So it's -- and most of these algorithms, especially on large dataset, computing intensive. So -- and -- so if do this, you have the depth, you should move huge amount of dataset, I mean, the thing is not going to work. So at the moment we're trying to define the standards to be able to move the Web application with the data. This is an idea which we already suggested one -- couple of years ago at [inaudible] meeting and -- but due to the lack of funding we have not been able to proceed with it. But the idea is exactly that we want to have a set of data mining tool in common shared by the data repository and basically to have an interoperability, not totally of the data, but closer in interoperability of the Web application. So this is one thing. There are also other data action which we are trying to move in collaboration with other. But really my chairman is pointing out to the time run out. But I just want to mention that another useful approach is also to optimize the computing time of individual model. We are exploring the usual GPU for specific model. We have done an implementation with generic algorithm. But I think that the real problem is to find a way to standardize this Web application in order to be able to move them with the data. Perfect time. I'm done. [applause] >> Yan Xu: One or two questions? >>: The papers that you showed were mainly done by people who have been working directly on the development of DAME. >> Giuseppe Longo: Because this is part of the story, yes. This is -- >>: Are you starting to see more ->> Giuseppe Longo: Well, Puzia, for instance, Chilean colleague, Puzia is one of the globular clusters, the one who started [inaudible]. The work on the photometric redshift now has been moved inside the [inaudible] because it will become one of the methods used -- you know, it's a slow recognition of -- I must also be honest. Neural networks for photometric redshift that have been used by many, many years so busy. It's easier to offset something which is already common. So the fact that Euclid is accepted data mining methods in -- let's say in that context makes sense. The problem is that there is the usual long-lasting equation. I mean, most of the community sees these models as black box and therefore is a sort of skepticism about that. And the only way to beat it is just to show with papers that they work. So we are open to collaboration with everybody. So that's in few words the history of five years. Thank you. Three of them. Anyway. [applause] >> Yan Xu: The next talk is the last talk in this morning by George. >> George Djorgovski: And I'll be accompanied with [inaudible] Ciro when the time comes. >> Yan Xu: Okay. >> George Djorgovski: So Jim really brought up the some really important points about the need for better visualization of data, more natural interfaces and so on. I think technology is pushing us in that direction. And there's some very interesting things that we can be doing, so I'll show you a little bit of stuff that we've been doing. As probably some of you know, we've been experimenting with virtual worlds as a platform for scientific research collaboration as well as data visualization. And that's basically right on a technology that's driven by the Moore's law and paid for by the commercial world. 3D video is coming thanks to the movies and other entertainment. Games are already virtual reality; that all the digital natives have grown up with it and think nothing's more natural. And so -- and people are paying attention, including people like chief technologically officers of Intel, and we have good collaboration with Intel Labs in Oregon on doing this. So this is something that is going to be as informative as the Web itself was, and it behooves us to find out what we can do with it. So today I'll focus explicitly on some data visualization angles of it. There is already a lot of immersive, collaborative, interactive data visualization going on in platforms in virtual worlds of different sorts, and not just in astronomy, other fields as well, mathematical objects, molecules, numerical simulations of globular clusters, numerical simulations of data mining in Intel world that I couldn't bring up. And we already know how to encode up to maybe 10 dimensions in these pseudo 3D displays. One of the new things that our student Franz Sauer has done is that we now can click on individual data points and pull up Web pages or archives in other links for information to follow on. So if you see us in outlier, find out what is it, and carry on from there. So this has been done using so-called OpenSimulator technology which has been standard so far basically in this business, and it's an open source version of the old Second Life, but that's pretty much dying. New thing is Unity 3D library. We had a lot of development with that. Our student Alex Cioc has been doing this work, and Ciro will demonstrate some of this, but first let me take you through just a quick demo of what we have. So this is the real data display. The data are actually 8-dimensional dataset on quasar spectral energy distributions from Gordon Richards, et al. What's shown here are three XYZ axes or flux ratios, and I don't remember which ones, the colors and the sizes of data points encode two other physical parameters, which I again don't remember. And, you know, you can fly around this, you can see that objects of different kinds partly separate. Something like this is really essential if you're going to do proper clustering analysis because things don't look Gaussian, you have strange shapes and topology, and it can certainly guide proper data mining research. Now, of course you can have your colleagues join you in this, add the data, poke at it, plot new data, can switch the axes and things like that. The old one that I have to just show you -- this is the one I used for our car picture. This is again data from Sloan on stars, galaxies, and quasars where the XYZ axis represent three colors, sizes represent [inaudible], shapes represent kinds of objects, and this has been actually heavily biased towards high redshift quasars. That's what these guys are. And you can see that you can really find out where the outliers are in a much more effective and intuitive fashion, and you can immerse yourself in the data with your colleagues and interact with them and so on. So, Ciro, do you want to demonstrate the DAME player now? >> Ciro Donalek: Yeah. >> George Djorgovski: This is done on a standalone platform, but also more importantly through the Web browser. So that's what Ciro is showing. >> Ciro Donalek: [inaudible] but you want to do something -- also something different. So be able to have a collaborative environment through the browser or even through your -- on your tablet. Yes, we can also make app for the iPhone or iPad. So we tested a few game engine softwares like ShiVa, but we ended up using the Unity 3D because we really like the platform and we can program in JavaScript, C++, C# [inaudible]. So now this is the same dataset, and we're going to visualize [inaudible]. >> George Djorgovski: We should point out that in the old platform we can do thousands of data points and with some effort maybe 150,000. >> Ciro Donalek: Yeah [inaudible] 100,000 data points. >> George Djorgovski: Easily. >> Ciro Donalek: Yeah. >> George Djorgovski: Very fast. And we're working towards a million. >> Ciro Donalek: Yeah. If this takes -- okay. And -- okay. Don't pay attention to the -- towards these [inaudible] because this is actually the fifth time we are showing it, and this is the -- a development [inaudible]. I wasn't even aware to [inaudible] to show this. So you can do a bunch of stuff. Okay. [inaudible] the data. Okay. [inaudible] you can click [inaudible] Web page. In our case each name is associated [inaudible]. And then we can go back. And we can choose to remove like clusters from the cylinders or tubes, cylinders and things like that. >> George Djorgovski: [inaudible] solo mode, but this is written to be a shared visualization with different partners at different locations. >> Ciro Donalek: Yeah. Actually I can give you the [inaudible] if you have a Mac or Windows and you can connect to my -- just through the IP address. But now you can connect and I can run in broadcast mode right here. And so whatever I do, you will see. And we can also track. So I can remove my points and you will -- now, there is some limitation in this. That's about points are [inaudible]. >>: [inaudible]. >> Ciro Donalek: Huh? >>: [inaudible]. >> Ciro Donalek: No. Whoever is the -- whoever is [inaudible]. And so I can release the button and you can press the button and send them -- you can send the [inaudible]. >>: [inaudible] on the data [inaudible] I'm not allowed to make [inaudible]? >> Ciro Donalek: Yeah. Yeah. I mean, if then broadcasting and you want to broadcast, you just tell me. I push again the button, is where a list of my broadcast and you can press your button and then going to see what [inaudible] ->> George Djorgovski: So collaborators can agree who is going to be changing the data. It has to be done synchronously for all. And you can choose, however, to have your vantage point same as the person driving visualization, or you can detach yourself and look from a different angle. >> Ciro Donalek: [inaudible] more things that give you [inaudible] you can see some information here and can be whatever you have stored. But now I'm just plotting the coordinates, but it could be [inaudible]. So we can make some good stuff, like good for making video games. You can shoot [inaudible]. >> George Djorgovski: Oh, come on. >> Ciro Donalek: So you can shoot that. >> George Djorgovski: So there is a ->> Ciro Donalek: [inaudible] the free version of the Unity platform. >> George Djorgovski: So the idea is to have a Web-based 3D data browser that you don't have to do anything for. It would be just available you. Then there would be a little more sophisticated version later. >>: So are color selections hardwired by the input data? >> George Djorgovski: No, no, no. >> Or can that be ->> George Djorgovski: There is a -- there is a standard format where which column which be represented how is XYZ, whatnot. We can switch that internal to the graphic essentially. >> Ciro Donalek: Yeah. And the cluster setting coded with color. So you have the column that you call cluster, it takes a number 1, 2, 3, and they are encoded with the shapes and colors. >>: For example, could you select a subset of these points and color them yellow? >> George Djorgovski: You have to interact with the data file. There is no such option in browser yet. >> Ciro Donalek: No. But that's possible. >> George Djorgovski: So there's a fine line of how much you want data browser to do and become more clunky, versus streamline and you just have a separate data program and scale it however you wish. But the coordinates that are available to you are XYZ, size, shape, color, which will be one or two, brightness, transparency, and then beyond that we're talking about other shapes of the points I said, [inaudible] on the data points, or if they're not spherically symmetric they can spin or they can be oriented a different way. They can also pulsate at the different places. Now, that point you have a raging headache. >>: So -- sorry. >> George Djorgovski: Yeah. >>: Maybe you just said this. How do you get the data in there? This is being set up by you, or is this something ->> Ciro Donalek: No. I just loaded the data before. >> George Djorgovski: We now use plain ASCII or CSV files that we will have via table input as well. >>: The plan would be this is a cloud service so you just point something on that Web page at a local data file and it will read it in and then ->> George Djorgovski: Yeah. You can read your own local data or data anywhere on the Internet. >>: Do you have some [inaudible] and then generate data or pull data out of databases using some other cloud service? >> George Djorgovski: Yes. Give us some time. Okay? This is not even alpha version yet. >> Ciro Donalek: [inaudible] a lot of data. I'm just copying and pasting [inaudible] and can be [inaudible]. >>: But that's the ultimate -- one of the ultimate goals, right? >> George Djorgovski: Oh, yeah. Sure. >> Ciro Donalek: Yeah, yeah. >>: [inaudible]. >> George Djorgovski: Absolutely. So to conclude here, and then we can take some more questions, in addition to this, I mean, there is a whole number of technologies that are coming up, and 3D displays of course. But there are haptic interfaces, and Kinect is the most prominent of them. I'm of the opinion that Kinect may be the single most important thing that Microsoft's ever done because it's opening up a whole new paradigm of how to interact with cyberspace, not just playing games, but interact with other people, interact with data and so on. And there will be a choice of whether people want to be represented as their own video version, 3D video version, or as avatars or whatever. We're still learning how to use these technologies properly. But this is essentially science fiction coming to life, right, in like -- seen in displays of data that are that manipulated by hand. We can already start to do this with Kinect. In fact, we're working on Kinect thanks to input from Jonathan and others. And Jim [inaudible] pointed out this great little new piece of equipment called Oculus. We have invested in that. We expect to get SDK for it in December. And so this would be a lightweight, easy way for complete 3D immersive sense of VR. Eventually maybe we'll all have Google glasses or something equivalent that we can overlay [inaudible] or fully immersive VR on our vision. So that's where we're going. So the key points are here. I mean, visualization is probably single most difficult problem we have now in massive data science because you don't really understand things intuitively unless you can visualize them somehow. And we have problems not with just quantity but complexity of the data, dimensionality, and so on. Another thing is -- if there is one thing that I learned from playing with VR is that we are totally meant to interact in 3D. We're optimized to interact in 3D. And so the whole two-dimensional display thing is obsolete artifact. And so this could conceivably lead to greater insights as we explore the data. Even so, I mean, we can still -- there are still limitations to what brain can process. I said up to a dozen. That really is absolute maximum. Effectively maybe six, seven, or eight dimensions are what you can comfortably deal with in display. So we now already have working tools that are free available to anyone. You can do this on virtual world platforms, and we're developing this one that's Web browser based, which I think will also be zero cost until we get into the goggles and stuff, in which case you have to pay money for those. But the goal is to be able to do fully immersive 3D visualization anywhere -- on your laptop, desktop, at home, the airport -- without any special equipment. So that's that. Thank you. [applause]