Document 17860167

advertisement
>> Yan Xu: Let's get started again. In the second half of today's morning session, we have four
contributed talks, 20 minutes each, including questions. And the first speaker is David Hogg.
He's going to talk about going beyond map-reduce and going beyond maximum-likelihood.
>> David Hogg: Thanks. Okay. So this is a slightly odd talk because it's not results, it's not
even a proposal, it's a statement that where we're currently going is I think in the wrong
direction, and it's a challenge to the people in the audience. We've got good people in the
audience, so really I am pitching a project to the world and hope somebody picks it up.
So there's this map-reduce framework, and I will say what that is, although I actually think most
people in this room know it better than me, so it's a little embarrassing if I say anything about it,
but I'm going to say something about it.
There's this map-reduce framework, and it does tasks in log N time when you have huge data,
and so it's really important to data centers and to data analysis and to all big data companies on
the Web. And we can do things like maximum-likelihood analyses in frameworks that look a lot
like map-reduce, though not exactly.
Oh. And, by the way, when I say map-reduce, I don't mean anybody's product; I mean these
operations of map and reduce, which I'll say in a minute. I'm not talking about MapReduce a
commercial product.
The bad news is that we cannot do the things we needed to do in the next generation of
astronomical surveys in map-reduce. This will not work. We need to do something different.
And we don't know how to do these -- the things we need to do at scale. We don't know how to
do things in less than polynomial or exponential in the amount of data we have.
And people -- there are people on these various projects here, and they might disagree with me,
but I think they have a limited view of what their data analysis will be. I think if you really think
about what you want your data analysis to look like, you can't work -- you won't be able to log N
operations.
So this is a call to arms. But I might point out that if you solve any of these problems, you might
get fabulously rich. So it's not just an astronomy problem, of course, because lots of people -lots of people are stuck with scale.
Okay. So I have various collaborators. Bovy was a student of mine. Brewer is a statistician at
Auckland who's very good with sampling and Bayesian things. Fergus is a computer vision
faculty member at NYU. Foreman-Mackey is my student right now. Goodman is an applied
mathematician at NYU who works on applied math relating to data analysis. And Lang, was my
student, is now a postdoc at CMU and works on astronomical imaging.
Okay. So this quotation at the top here: We won't even consider any algorithms that can't be
written in the map-reduce framework. I've heard that said with small modifications by people at
Google who say they will not run things inside the house that are not map-reduce and also at
Microsoft -- I mean at IBM. The Watson team and those people say we won't do anything that
isn't map-reduce.
And the reason, of course, is they need to be -- they think of Google searching the whole Web,
Watson also searching the whole Web in a different way. They need the -- the amount of data is
so large you have to work log N. You can't work slower than log N.
So what do I mean by map-reduce? What I mean is that you split your data analysis into these
two steps. The first is a map step where you run the same operation on every piece of data. And
then the next is a reduce step where you then compare pairwise all the outputs that you got from
the map step.
So this is the way all log N algorithms look, because you have a tree, you do some operation of
the leaves of the tree, and then you need the information back up through the tree. And if you
have N things, N leaves in your tree, then you have log-based 2N branches.
And so you can do operations in this form in log N time because each of your leaves is sitting
itself on a computer. You have a datacenter, you have your data all over the datacenter. Every
datum you have, every piece of data you have, is sitting on a CPU.
So you can do all the map operations locally, and then reduce just goes up the tree. So that's the
map-reduce. I'm explaining it badly, but that's because it's not really my business. In fact, as I'm
about to say, none of the problems I do -- I work on actually fit in map-reduce.
But the way to think about this is -- like the standard thing is Web search. You search every
document locally for the word kittens. And then if it has kittens, you return the Document ID
and the PageRank. And then the reduce step you're comparing all the PageRanks you're seeing.
And so in log N time you can return the top PageRank'd page that mentions kittens in log N time
and you searched the whole Internet.
Okay. Good. It's brilliant, but it's also a huge opportunity. That statement at the top means
there's a huge opportunity for somebody who wants to get rich. I am not that person. But all the
reasons that we can't work in map-reduce is also reasons that lots of interesting companies can't
work in map-reduce.
Okay. Good. So one -- here's an example of something you can do more or less in map-reduce.
It's not exactly map-reduce but it's the same kind of structure. Say I have a likelihood function -do I have a little pointer? Ah, yes, I do. Does that work? Yeah.
So say I have a likelihood function that looks like this. I want the probability of my full dataset
given some parameters. This is a standard thing for astronomy. And also standard is imagine
the data are independent. So these are many -- say these are observations of quasars, tons and
tons of quasars you've observed here. Each quasar has a probability given the parameters of your
model. But you really want to set the parameters of your model using all the data so it's this
massive product over all your objects.
And if you think forward to -- I'll talk about the scale, but this could easily be tens of billions of
astronomical objects here.
So if you want to find the maximum with respect to theta, you want to do maximum-likelihood,
it's really straightforward. In the map step for each data point you compute the derivative with
respect to the parameters, and then the reduce step you just sum up those derivatives. And then
you can go uphill. And, okay, you actually have to -- if you want to go to maximum-likelihood,
you actually have to run some algorithm, some optimization algorithm like conjugate gradient or
BFGS or something. But each iteration in that only takes log N time because in each iteration
you can compute the derivative in this map-reduce way.
So this is a very nice problem. You can do it all in map-reduce. So if astronomers are happy
with maximum-likelihood and that's all we need forever, we're fine. We're absolutely fine.
Of course I'm going to say that's not true. So what is -- just to get the scale, the scale of
astronomy is not nearly as large as the scale of other industries. So I don't actually think of
astronomy as being all that large scale. I spin a large fraction of all of the astronomical data
that's ever been taken just in New York, so it's not -- we're not talking about Microsoft scale
here. But it's big enough that we care about scaling.
So if you think forward to LSST -- so the two projects I'm going to kind of talk about as my
examples are LSST and Gaia. And you might not know Gaia. It's less known here in North
America than in Europe. But in -- both of these are at a substantial scale that matters.
So in LSST there's something like 10 to the 15 pixels -- I'm right about that, right, Jelco
[phonetic]? -- and 10 to the 10 galaxies-ish depending on how deep it goes and what cuts you
make. But it's this scale.
And what we want to do -- well, there's many projects you want to do with LSST. It's an
absolutely amazing source of science. And I won't summarize. I'm only going to talk about
weak lensing because it's a good example to discuss.
In weak lensing what you want to do is you want to measure the shear map, the distortions of all
those galaxies because of gravitational structure formation, so because of the mass map is
inhomogeneous, there's this inhomogeneous lensing. So you want to measure this
inhomogeneous lensing and then from that lensing you want to get the cosmological parameters.
That is one of the most ambitious projects in all of astronomy. And we have to succeed or LSST
was not such a good idea. So we are going to succeed, but it's going to hurt, as I'm going to tell
you in a second.
And then Gaia, Gaia is a smaller scale, but the inference problem is even harder in Gaia, because
in Gaia what you're doing is measuring precisely the stars in the Milky Way, measuring their
positions and their velocities, and then you're trying to infer the nonlinear dynamics of the Milky
Way from the positions and velocities of the stars.
So it's a much more nonlinear problem. So it's actually -- even though the scale is smaller, the
problem is harder. You want to understand the dynamics of the Milky Way, so the mass model
and the orbit structure of the Milky Way, but unfortunately you have this distribution function of
stars. Stars are not distributed on the orbits democratically, some orbits are more populated than
others, so there's a big distribution function problem in here.
And the thing that's going to kill us and the reason that map-reduce isn't going to work and the
reason why we cannot -- we've got a really big problem coming up is this shear map and this
distribution function and probably also this dynamic is going to be nonparametric. It has to be
nonparametric.
The reason it has to be nonparametric is we don't actually know what -- we don't know how to
parameterize these functions properly.
And what do I mean by nonparametric? So I mean the strict sense of nonparametric which is
that the model gets bigger as the data get bigger. So as you get more and more data, you increase
the number of parameters. So your model scale is rising with your data scale and you can -everybody in the room already sees why that's going to be a huge problem.
And the reason that all these maps and distribution functions and all these nuisance parameters
are going to be nonparametric is that as you observe more and more galaxies, you want to
increase the angular resolution of your map. Duh. Like that's why we're taking more galaxies,
so that we can get more resolution.
So everything important in astronomy is really nonparametric in some sense, everything
internally in astronomy. I'm actually going to say at the high level we want to get parameters.
But in the middle -- I'm going to show you some graphical models in a second.
Okay. Good. And importantly this is an important issue, but I'm not going to talk about it so
much, nonparametric models are never inferred at high singleton noise. Why not? Because once
you have high singleton noise, the model grows, you get -- add more parameters, and then you're
back down to medium singleton noise again. You never have high singleton noise in a
nonparametric model.
So, by the way, so I'm going to show you some graphical models just to set the stage. I don't
know if you've seen graphical models. Not everybody in here probably uses graphical models.
So if I have a process where I have cosmological parameters which produce a shear map which
produces observations, you can think of this as a -- showing the causal relationship that the
parameters create the shear map which create the observations, but you can also see this as a
description of a probability statement; that the probability of the whole set of all the variables I
care about is a probability of the original cosmological parameters, the probability of this shear
map given the parameters, and then the probabilities of all these observations given the shear
map.
See, it's a way of breaking down the causal structure or the probability structure of the problem.
So I'm not going to -- if you don't know probabilistic graphical models, learn them, because
they're incredibly valuable for thinking about data analysis. But now I'm just going to use them
as causal structure to look at the problems.
So this is the Gaia problem. In Gaia, Gaia is measuring a billion positions of stars, and these are
six-dimensional positions. They're position and velocity. So Gaia measures a billion
six-dimensional positions. Where did those positions come from? Well, the star positions it
measures are related to the true star positions, and then the true star positions are created by the
dynamical model of the Milky Way and a distribution function.
So these are produced by two things: one thing we care deeply about, which is what is the
structure of the Milky Way; and one we don't care nearly as much about, which is about the
happenstance of what orbits happen to be populated.
Now, some astronomers would actually reverse those and say the other one's more interesting.
But from my perspective, these are the interesting parameters; these are the uninteresting
parameters. But this is a huge nonparametric model, and this might be a parametric model or
nonparametric, but this is a huge nonparametric model which is affecting my observations and
I'm going to have to marginalize this out if I want to learn these parameters. That's what's going
to hurt.
And of course there's also noise coming in and there's a noise model. So there's a noise model
generating uncertainties which are also affecting the observations. So you see how this is like a
causal model for the Gaia data. And that's the kind of problem we work on.
Here's another one. Here's a slightly worse one. This is the weak lensing model. So in weak
lensing you have cosmological parameters. These are the cosmological parameters of the density
of the universe, the initial conditions, sigma-8, all those things. They produce a shear map. This
is a huge nonparametric object; that is, our shear map is a function of redshift and angle.
On the other side over here before we come into the box, over here there is some distribution of
galaxy properties. Galaxies have complicated shapes on the sky, and that generates the true
shape of the galaxy. So the true shapes of the galaxies are generated by some horrible process
that I certainly don't care about, although I should care about it, and then -- but what you observe
is not the true shapes, you observe shapes that have been affected by this distortion map.
So once again this has the problem -- this has this character that there's this big nasty
nonparametric model here affecting the observations, and there's a big nonparametric model here
affecting the observations, but all I care about is this. I just want to know what are the
cosmological parameters.
See, that's the -- this is the structure of essentially all problems in astronomy. Actually, I could
argue that this is the structure of all problems in science. I could even go further. But I'll stop at
science. I think all quantitative data analysis problems have structures that look like this, and in
particular the thing you care about is separated from the thing you observe by one hell of a lot of
nuisance parameters.
>>: What's XN?
>> David Hogg: That's the position of the galaxy, because in the shear map you have to know
the position of the galaxy to predict its shear.
Even this is overly simplified. The positions of the galaxies are also created by the cosmological
model. And in principle the cosmological model makes this galaxy formation model which
produces the shapes of the galaxies. See? So I've really left out -- this is already a massive
oversimplification of the true situation. But this is more sophisticated than the analysis people
currently do of weak lensing. So this would already be a step in the hard direction.
Okay. So just in case somebody thinks that we can solve all of our problems with Gaussians,
you can't. I don't think I'm even going to say much about this, but the fact is you never
understand your noise properties as precisely as you think you do, and your data are produced by
a mixture of processes.
Whenever you have a data point -- people were just talking about it at the end of the last talk.
When you have a data point, you don't know whether that's a correct data point or an incorrect
data point, so your data are produced by a mixture of correct and incorrect things. And it can be
a lot worse than that. But at the very least it's that bad. And once you're that bad, every data
point produces multimodal things because there's this classification inside your data point that
allows you to jump to two different locations, bad data, good data, and so you generically get
multimodal things.
You also get very broad support and parameter space in the sense that a data point, if it's true, is
telling you something very useful, but if it isn't true, it has broad support. And so it's very hard
to approximate the likelihoods you're getting from individual data points with very compact
functions.
Okay. Good. So now I'm to the final point I want to make. This is the main point I want to
make. Bayesian inference cannot be done in this map-reduce framework. And you might think
it is because it looks hella like map-reduce. And I've seen a lot of glib statements. It looks really
map-reduce. But because you have all the data which are producted together and you have a
prior, so this is like a Bayesian, standard Bayesian inference, you -- you can compute these local
likelihood functions. You can compute these all locally and then only product them together in
the reduce step.
So it looks really good. You can locally produce your likelihood functions and then on the
reduce step you bring the likelihood functions together and you multiply them together. It's
perfect. In fact, you don't even have to be a Bayesian. Good. You don't have to be a Bayesian.
This is a really -- you can just think of the likelihood analysis. You can do this -- you can think,
oh, good, I can do this all with map-reduce. But you can't. You cannot pass forward those
functions.
Theta is a very high-dimensional object. The functions are multimodal. The support is broad,
and these nonparametrics mean that this might even have a variable number of parameters that's
growing with the data.
So no individual node knows enough about the whole dataset to know how much detail does it
have to pass forward in its load function. You have to know a huge amount about the whole
dataset to say with what fidelity do you need to pass forward this function.
So there's -- basically this fails. And I say but that's not all because the other problem is
marginalization. We also have to do a massive marginalization, and there's no way to
marginalize out -- if all I care about is this and all I'm observing is this, I have to marginalize out
this and I have to marginalize out this. These are big nonparametric functions. And,
furthermore, their marginalization depends on all the data. This -- after you've observed this, the
posterior on this depends on all the data and you have to then integrate over to that, so there's no
way to do that locally. Of course it's necessary. If you don't do that, your constraints on the
cosmological parameters will be wrong.
One of the things about weak lensing, one of the reasons weak lensing is such a good example, is
you never measure this at high signal-to-noise, but you really care about your cosmological
parameters. So if you don't do this marginalization, you will fail.
There are enormous nonparametric Bayesian inferences that have been done. How did people do
them? I just said you can't do them. It takes exponential time or something. I didn't tell you
what time it takes. It takes of order N cubed if you just naively write it down. Not log N, but
more like N cubed.
So how have people successfully done this in the past? The way they've done them in the past is
by carefully choosing priors that make the inferences tractable.
We are not allowed to do this. Why not? Because we are scientists and our prior information
has to contain our prior knowledge, and our prior knowledge won't be in conjugate form. We
actually have nontrivial prior information. So we can't do this.
So these Bayesian nonparametrics, if you look at Bayesian nonparametrics, it's almost useless to
us, unfortunately. So my view is that the word Bayesian is becoming a bad word.
My approach? I don't have an approach. That's my approach. But I think this is the issue. I
think the next generation of surveys will fail if we don't solve some of these problems.
I have to say the applied mathematicians and the computer vision people know a lot of really
great things about these kinds of problems and we should be working with them more.
All my collaborative -- if you notice, more than half of my collaborators in this were not
astronomers. That's been extremely valuable to us. And there are people in this room of course
who have also drunk that Kool-Aid.
In my day job, just to say I'm not a total bullshitter, I do do big models. Right now we're
building a 10 to the 9 parameter model of the SDSS pixels as a kind of baby step towards LSST.
We are getting some Bayesian nonparametrics working but with real priors, priors that represent
our knowledge. We have one of the best adaptive MCMC samplers around. If you need some
MCMC sampling, we have -- both Brewer and Foreman-Mackey are my experts.
We have done full marginalization of a nonparametric distribution function, but of course we -in this problem we only had eight stars, not 80 million stars.
We've done huge models of data, data-driven models and made large predictions, and we're big
believers in working with missing data and heterogenous data.
And I'm done.
[applause]
>> Yan Xu: Okay. Questions?
>>: Two comments. Firstly, there's this new stuff coming out of Google now called Dremel. I
mean, we all know the MapReduce version that came from Google. Google have done away
with that, now they're using this new framework called Dremel which is supposed to be beyond
MapReduce, scale [inaudible] scale problems. Can you comment on -- have you looked at that
or does that solve something ->> David Hogg: No. So I haven't looked at it, but I'm almost certain that they still need log N.
>>: Yeah.
>> David Hogg: And so nobody is doing a marginalization, basically, in log N time. Because I
think it's -- so I haven't looked at it is my basic answer, but I very much doubt it will solve these
problems. This will be a real breakthrough because if people can solve the problem of passing
forward the posterior probability through some log N thing, that would really be game changing,
like new companies will appear the next day.
But I have to say there is -- there is one respect in which people have done this in log N time, and
you could do, and that is that -- I forget what the name of it -- there's a technical name for it, but
it's where you build only very, very low-parameterized models of the posterior and pass those
forward only. And if you can work in that approximation, you're good. I'm just saying in
astronomy we can't work in that approximation. But, anyway, yes.
>>: My other comment was going to be that people have translated Bayesian networks into
map-reduce.
>> David Hogg: That's correct.
>>: Where ->> David Hogg: Yeah.
>>: And that does work.
>> David Hogg: Yes. And that is exactly in this form. That's this like Deep Learning. So in
Deep Learning is one of these nonparametric Bayesian things. So in Deep Learning you have
many layers in like a neural network. And you design that so that all the edges in your graph are
analytically marginalizable. They're very beautiful models, actually, to be fair. And they're
extremely high performing. Like many of the best computer vision algorithms use these Deep
Learning networks now.
And Deep Learning -- like there's now on-chip Deep Learning. It's amazing. But it really
doesn't work for us. It works really well in the world, but it doesn't work for us because our
priors are so specific. Like our priors on that shear map are very specific. So we're just not
allowed to use these tractable priors.
>> Yan Xu: Okay. In the back.
>>: I really enjoyed your talk and I agreed with most what you said, but there was one statement
that made me almost fall off my chair.
>> David Hogg: Uh-oh.
>>: You said that if we don't solve this problem, as you phrased it ->> David Hogg: Yeah.
>>: -- that the future surveys will fail.
>> David Hogg: Yeah.
>>: Which I think it's a gross overstatement. I do agree with you that without sophisticated
nonparametric models we might not be able to extract the entire information [inaudible] surveys
like LSST, but even using contemporary methods, and you'll definitely have better methods a
decade or two from now, we can still extract significant fraction of information in these surveys,
and I don't know quantitatively how much better you would do with nonparametric models, but I
do know for a fact that even in standard methods we will get enough information content that we
cannot claim these surveys will fail.
>> David Hogg: Good. So let me -- let me be a little bit more specific. LSST without solving
this problem will meet its goals. But in my view, that is not sufficient because we're spending a
lot of money, and we should be, and -- and the -- the current plan with LSS- -- I mean, if you do
not solve this problem, you're in the -- you're in the state of classical statistical estimators and all
the argument about bias and variance and so on.
So most of the weak lensing literature right now is people arguing about minor differences in
how they estimate shears to look for unbiasedness.
And it -- as the survey grows, the sensitivity to those bias' invariances will grow. So it's not
obvious to me -- so I agree that LSST will meet its goals without solving this problem. It will
meet its stated goals without solving this problem. But it is possible that there are systematic
problems with all the point estimates people have for this problem such that LSST will drift into
a systematics-dominated position on the weak lensing.
>>: [inaudible] that the complexity grows linearly with the size of survey, that this is perhaps
true in nonparametric models. But, for example, if you just use the extreme [inaudible] as you
increase your sample size, the number of components you result does not increase linearly with
the data [inaudible].
>> David Hogg: That's true. One thing that might help a lot is it is possible that the complexity
of these models might only grow as like the log of the data size. And in that case that might save
us. That might save us.
It's still the case, though, that this marginalization is very hard. The marginalization is still hard,
even if it gets -- because the marginalization itself is expensive operation. But it is true that these
may only grow as the log of the size of the data. That's not a known scaling yet. That's right.
>> Yan Xu: I hope to except, but time is running out, so please ask him during the lunchtime.
So I'd like to move on to the next talk. So before that let's thank the speaker again.
[applause]
>> Yan Xu: Our next talk is by -- what's your first name?
>> Nebojsa Jojic: Nebojsa.
>> Yan Xu: -- Nebojsa Jojic. His talk is titled as "Epitomes and Counting grids: Discovering
patterns by (re)arranging feature counts."
>> Nebojsa Jojic: Okay. So I'll talk about a very simple model. And since you guys are
astronomers and looking at images all the time, maybe this is not that new to you as it was in the
computer vision community many years ago. So I'll talk about a very old model that I've
introduced many years ago and then an extension that's more recent.
These models have been used also in other applications. We've used them for [inaudible] design,
we've used them viral load prediction and so on.
Now, the basic idea is very simple. You will very soon see why I chose to talk about this
particular area, which is not actually what I'm spending most of my times today -- I mean these
days, but it may be most relevant to your problems.
So the idea is very simple, and I talked about it briefly the last time we had a similar workshop
several years ago. Let's say you have an image, my mother's poodle, for example, and let's say
that I want to somehow summarize the texture of this image, the repeating patterns in this image.
And the way I would do that is to take at random lots of different patches of different sizes from
this image and then try to assemble them back, but I want some of them back into something
that's much smaller than the original image.
If I'm able to do that, then obvious I will have to somehow squish all that texture so instead of
having all these flowers in the background I'll have only a couple. Which is actually what
happens here. I have couple different types of flowers, a little bit of area here for the gravel and
some area for the fur and so on, which means that I've kind of learned the structure of the image,
I know what the self-similar parts in the image are. Right? I can now do segmentation for -- for
example, of the image by segmenting this out here in the epitome and transferring that back by
inverting the process, finding out what this area came from in these images and then -- in these
patches and then where these patches came from the original image and then segment out the
gravel, for instance.
So, again, idea is that if I take all these patches, if I just look at the patches here, I can't really
reconstruct the entire image back without knowing where they came from, but I can see enough
of common structure that I can learn a squish texture.
Now, how would I learn that? Well, the approach is very simple. I would start with some guess.
And this is what we call the epitome, this little squished texture.
So I'll start with some guess of that epitome which is basically just the average color in the image
with some noise to break the symmetry. And then I'll take all those patches, see if I can map
them there, and then we'll initially map equally well or equally badly to all parts of the epitome,
but some parts are going to map a little better. And then I will count all the times that I've hit
every pixel in the epitome with these patches and then average those pixels based on how often
they've hit there, taking all the probabilities into account.
And then when I do that I'll in the next iteration get something like this where you can see a little
breakup of a little whitish part to the left and right and black in the middle. The reason why you
have whitish to the left and right is because the epitome is actually defined on a torus, so the left
and right connect and the top and bottom connect.
So then if I keep iterating this, after a few iterations, in here I'm showing 16 iterations, you get
this texture that I've shown in the first slide. So that's basically the algorithm.
Now, the question, of course, is what do I mean by mapping the patch into the epitome. I could
introduce some kind of geometric shears or intensity, manipulations, or rotations and shift and so
on, or I could just map the patch as it was. And of course if I do all these transformations, then I
have to keep track of them when I'm reestimating the epitomes.
And then the other question is am I always going to get the same epitome? Obviously not
because I have this random initialization, so I'm going to get different epitomes. But they'll have
a qualitatively very similar appearance.
So to demonstrate that, here's the epitome of Microsoft Research as it was, I don't know,
maybe -- this, I don't know, seven, eight years ago. Different building, as you can see. So the
epitome here has a lot of these elements, they're window elements because there's a lot of them
in the image here, and it has a few faces as well. And probably most of these faces map to either
this one or this one or maybe this area here or the area there.
But if I rerun this, I'll get a slightly different epitome. So here I'm kind of zooming in to this one.
But qualitatively it's very similar. Again, I have some face-like thing here. I have a face-like
thing here as well. And in this area I've put red and green on either sides of the nose of the face
that appear there, and then I map back from the epitome through the patches into the original
image to see which areas contributed to this area. And I get the face detector from just one
image of Microsoft Research. It doesn't work like the best face detector, but, still, it's just one
algorithm that does it, simple one, simple iterative thing.
If I do an epitome of many different faces, then I get this weird facial texture. It's bizarre. But if
I now highlight, for example, these three areas here because I see a little smile here, I see a
closed eye here and I see some dark hair there, so if I pick these three things and I look back for
all the images that map there, whose patches map there, I get these guys who have dark hair and
tend to squint when they smile.
If I instead choose this other type of an eye, I get these guys who actually have their eyes open
when they smile. So this is where the vision comes in, right, the computer vision.
And there's a lot of other applications that I don't have time to talk about. Epitomes were mostly
used by computer graphics and computer vision people, but they've also been used in SIM
design.
But here is the thing that I've always found most exciting about epitomes but I've never been able
to apply them this way. Here's an image with a lot of structure. Do you see the structure? You
can't quite see the structure. And if this was some scientific image, you would just assume that
there isn't nothing -- there isn't anything there, you wouldn't model it at all.
But when you run the epitome on it, it actually finds that there is a lot of 45-degrees lines this
way and that way, and that's because that image really did have that, but it was obstructed with
lots of noise.
So the idea here is that if you run this algorithm, since the epitome is taking all these patches
from everywhere, trying to match them then average them, as you repeat this the obscure
patterns arise.
Now, this is an artificial example because you never find images that are this corrupted. Here's
an example where -- that's the closest. It's electron microscopy example. But in this case the
image is actually controlled. They have very fine control over the microscope so it doesn't move
too much because the noise is humongous, so then a video is taken instead of just one frame. So
here are these videos. And then you average them.
But even a slight shaking of the microscope creates a little bit of blur. So if you use the epitome
to actually compute the shape there and reconstruct it, you get much sharper boundaries, but also
you get structures that were obscured before.
But as I was saying, the more interesting example would be if you can actually have something
like this where the patterns appear all over the place, you have some huge amount of data
coming, you can't see them, you don't know where they are, they're obscured by noise, but then
by averaging -- repeated alignment and averaging you reconstruct the pattern.
Another example where these things could work well is if you have missing observations
altogether. So in this case I'm talking about the3D data, a video sequence, and the idea is that
data was missing. For example, if the data is missing, every other pixel, that's equivalent to
trying to actually -- trying to reconstruct it, that's equivalent to the InGaP sampling for
super-resolution, increasing the resolution of the image.
If the data is missing in a rectangle like that -- I mean a solid like this, then you're trying to fill in
the data there. There's something missing completely. If it's missing in different frames, then
you're trying to interpolate the frames. And I'm going to show an example which doesn't really
exist in practice, but, again, to emphasize where these algorithms would have the most
impressive use, here the data is missing at random. There are bunch of different places where
the data is missing and it's replaced by random noise. And then if you use this algorithm to
reconstruct what's there, you get this.
So I'm going to pause it here. And when I run it again you will probably see this car moving
around, but you'll definitely not see the pedestrian. So I'll play it a couple of times.
So when it comes to science, I'm thinking that a lot of times the first thing we want to do is to
look at the data, and if we don't see anything at first, we're not even going to run the algorithms
on it. And some of these algorithms are better at finding patterns than we are. In computer
vision that's not the case typically because computer vision algorithms can't beat our vision
because we've been optimized for it. But in science there may be an application.
Now, now I'm going to switch to this other model which we call counting grids. And it's a
relative of epitomes. In the epitomes we were using matrices, ordered datasets. They could be
order sets -- each data point is actually a set of measurements, but these measurements are
coming on a grid. So either it's a two-dimensional matrix or three-dimensional matrix. So there
is some ordering there.
In the counting grids, on the other hand, we have disordered feature sets. So, for example, if you
managed all those patches I was showing you before and you don't give me the patch itself, you
give me a histogram of colors in the patch, and that's all I have.
Now the question is can I still reconstruct the image. In both of these cases, of course, the
number and type of channels is arbitrary. We don't have to have just RGB measurements. It can
be features of any kinds, measurements in different spectral -- in different spectral domains and
so on.
Now, where do we have these histogram-like data points? So here's one example. In the cellular
immunity, what we have is that the cell expresses all kinds of proteins. We heard about it
yesterday from David. So there are all kinds of proteins being expressed and they're working in
the cell, they're regular, normal proteins, but sometimes you get viral proteins as well. And for
the cell to actually figure out that something is wrong it's -- there are some mechanisms and it
can try to fix that. But these viral proteins look a lot like normal cellular proteins. It's very
difficult for a cell to self-diagnose.
So then there is this more global approach to this where the cells, actually, as they're expressed in
these proteins, they express a lot of the copy of the same protein, but a certain small percentage
of these proteins are going to be chopped up and little pieces are going to be presented on the
cell, on the surface of the cell, and then these little pieces are now visible to the outside of the
cell.
And then the immune system actually monitors through the use of T-cells, specialized
surveillance cells, that they monitor all cells around, and if a certain pattern that seem abnormal
keeps repeating, then the immune system might go and start killing cells that have a certain
signature on the surface.
So what does that signature look like? Well, it's basically, as I was showing here, a bag of
epitopes, a bag of these little peptides as presented by image C molecules.
So in the end, because we don't know of any ordering of these things on the surface, they just
come out on the surface in certain spots, there is no particular ordering, really what we have is
that we have two of this type and one of this type and, I don't know, two of this type and so on.
So it's basically a count of different features that are present on the surface rather than the
organization of them in some two-dimensional, three-dimensional, whatever, structure.
So in the end what we have is a histogram. We have features, in this case 20-something features.
And we have feature counts for each of those. And this type of data is present in various
different applications. For example, documents on the Web are represented by a bag of words
rather than the document most of the time, which is counts of how many words of different types
you have.
In computer vision actually they've started doing this because there's a lot of problems with
aligning things, so they tried to instead find very informative features and that [inaudible], and
then similar thing obviously happened in case of immune surveillance as you saw before.
So what's the model for this, the counting grid? Well, we started with a peaky distribution
because, just like in the epitome, we're going to have a grid, and in each position we're going to
have instead of one color or a sharp distribution of colors we're going to have a sharp distribution
of features, so we start a peaky distribution, it goes into a particular spot in the counting grid, and
the counting grid is going to have a lot of these different distributions all peaky in comparison to
what the data is going to have.
And then the idea is you take a window, you put all these counts together, and that's what you're
going to see in your data. So that's the [inaudible] of the data.
And what that means, that, for example, it's a 3-by-3 patch that you're taking from the counting
grid, then your histogram in this particular example looks like what's shown. But then if I move
the patch a little bit, then I have some features disappear, those that are on the left over there, and
others are included. So you have this smooth transition of feature counts.
So then this is the summary of what the generative model looks like. And then of course the
learning algorithm is going to be similar to what I talked about before except that now not only
that you have to map these bags onto the grid but you have to take the individual features and
map them inside the windows where these things are mapped. So you can iterate that and get
something.
So here's an example, a synthetic example, where we are pretending that a computer vision
algorithm somehow found features that are very informative. So it can label things in the image
as being mountain, sky, grass, building, roof, and so on and so on. And then if you take one of
these windows, you just get a distribution of features. So in this case, for example, there is a
little bit of a roof but a lot of the mountain and the grass, for example, and a little bit of a sky.
So now the question is can I just run these histograms, infer things, or if I can from the
histograms, how about if I take each one of these patches, break into four and get four
histograms for these different parts, would that be enough for me to reason out what the original
layout of all these things was.
And I'm using this not because it's useful as it is, but because it can give us the idea how much
it's possible. Is it possible from histograms to infer the spatial arrangement of features.
It turns out it is. In this case of a 2-by-2 tessellation we get this as a reconstruction, whereas the
original looked like that. It's pretty close.
And here's another example where I'm working with colors. I took this image, discretizing to 64
features, so I have, I don't know, one feature would be blue, another feature would be another
hue of blue and so on. And so I have 64 of them. And then in this case I'm trying to learn the
counting grid and, of course, I'm showing in each cell of the counting grid what's the most likely
feature. And in this case I'm trying to just learn it from histograms alone. And as you can see
after a few iterations I start getting something that captures some of the spatial structure, but it's
kind of -- it's difficult to get the spatial arrangement which is great so I do get some
neighborhood -- I do infer some neighborhood information, certain features tend to border some
other features, but I don't get a very good structure.
But if I do a 2-by-2 tessellation so that each patch is represented by four histograms for the four
quadrants, then I actually do reconstruct the original image.
And then if I do a 16-by-16 tessellation, since the patches themselves were 16 by 16, I'm actually
reducing this to the epitome algorithm and I reconstruct the whole thing.
I will skip the application here and I'll just tell you how you can use it now for analyzing the
viral load in HIV patients.
So as I said here, we're going to model these bags of epitopes using this model. And so once all
these -- once we have learned the model using the bags of epitopes which are actually inferred
based on the immune type of the patient and their sequences, so we know what's presented on the
surface, now we can take every patient and put it somewhere in the counting grid.
Now, let's assume this whole screen is the counting grid. Now, this particular patient maps into
this window and I just put blue here because they had a low viral load and this other patient had a
high viral load, this one low, another one with a low viral load and so on. If I keep on placing
them like this, hopefully the high and low viral load areas are going to emerge and I can do that
again by averaging viral load into these windows.
And when I actually do that, this is what I get. And this is with various different parameters of
the size of the counting grid and the match that I'm using for embedding, because now I have to
choose how big my patch is.
The actual histogram of features doesn't come with any idea of how big the batch is, but I'm
pretending here it's running 2-by-22 patch size and here it's 11 by 11. So I'm arranging all the
features and then figuring out the counting grid.
And this is an example of a 3D version of the same thing because I don't have to embed it in
22-by-22 patch, I can embed into a 10-by-10-by-10 patch, so I can use arbitrary embedding. It
turns out actually that the 3D embedding was the best.
And so here's after a few iterations you can see how the features kind of -- how the viral load
starts separating. And it turned out it works actually better than other techniques we have tried to
use for viral prediction.
This is an example that I'm not going to talk about of using this to model text. This is a model of
old NIPS papers. And here's the summary. I already told you what the difference between
epitome and counting grids is, but the opportunity I think of discovering patterns is seemingly
complete noise under the assumption that there is a pattern that repeats a lot, but it's misaligned,
so you can't use Fourier analysis to find repeating patterns. And this is that example that I gave
you before.
So that's it from me. Thanks.
[applause]
>> Yan Xu: Okay [inaudible].
>>: Do you think [inaudible] should be used for doing super-resolution imaging, like, for
example, if you have a video of a car going by [inaudible] license plate?
>> Nebojsa Jojic: Yeah, we've done that actually. This will -- we've done it on grass waving in
the wind. And it works way better than traditional techniques. But the artifacts are very weird.
But the sharpness is way higher. I can show you that offline. Yeah. Um-hmm.
>>: [inaudible] applied this to astronomical images?
>> Nebojsa Jojic: Um-hmm.
>>: [inaudible] galaxy images?
>> Nebojsa Jojic: I don't know. I have no idea. I don't know much about galaxy images. So
that's why I actually showed it like this. So I'm hoping that you guys can tell me, you can come
up with some applications.
>>: What is your text application?
>> Nebojsa Jojic: The text application is -- the one we've -- we've tried a couple different ones.
One was this NIPS dataset where we have all the papers from published NIPS with all the words
in them. And then if you now try to infer what would the arrangement of these words with in a
2D little patch so that it belongs to one of these -- a large counting grid so there is a smooth
transition and so on, you get essentially very interesting areas occurring.
For example, computer vision is sort of surrounded by graphical models, neural networks, and
neural science because -- not surprisingly. And then reinforcement learning is borrowed by
neural networks which are also HMM because of the time models and the supervector machines.
And actually when you move from one -- when you look at the documents, you can clearly see
how this smooth transition is happening. It's very interesting.
It would also be interesting to look at this if you're trying to choose your area of research in
NIPS. You're going to see where the boundary is not populated too much.
So that's one example. The other example are the news, especially the -- the -- the news, because
there is this transition, there are multiple topics that are evolving over time. So you really see
how words are dropped and new words are put in. But they don't necessarily follow a
one-dimensional structure but multidimensional structure. So they capture, for example, CNN
news better than the LDA models which are traditionally used for modeling bags of words. Yes.
>> Yan Xu: Quick, quick.
>>: Yeah, quick question. Do you have a generative model? In astronomy, if we applied it in
astronomy, we often know a lot about our noise processes. Are there ways to modify the
algorithm so that it ->> Nebojsa Jojic: Oh, yeah.
>>: -- will know about all the noise properties?
>> Nebojsa Jojic: Yeah. Obviously. I mean, you could think of this as just a part of that
graphical model you've shown in places, especially the nonparametric parts for instance. But,
yeah, you can definitely add all kinds of bells and whistles that depend on the prior knowledge.
And, on the other hand, if you know almost nothing or you want to start fresh and you don't want
to make presumptions, then you can learn nonparametric, because this does end up being
nonparametric model, you would need larger epitomes if you have much more data.
>> Yan Xu: Okay. Thank you very much. Let's move on to the next speaker, Giuseppe Longo,
practical data mining services for astronomy.
>> Giuseppe Longo: Well, I think that most people in the room know what I'm going to talk
about since this has been the dominant theme of my talks for the last five years with many, many
evolutions. And this is -- at some point, you know, you start changing title, but the substance
doesn't change much.
We have -- I'm going to talk about the data mining and about a project which has been going for
many years. It was a sort of pioneer in the context of the virtual observer that we do. At the
beginning it was called [inaudible], then it became DAME, but -- which just stands for data
mining and exploration.
But some -- due to some delays in the implementation of the standards which we needed -- I
mean, at the beginning we were all pioneers in the visual observer. The DAME as a neighbor
become a real virtual observer through compliant device, but I think that now we're told this -the pieces are where DAME has to be, the transformation will be done [inaudible].
DAME is a joint project between my university, University of Naples, so the California Institute
[inaudible] and the Italian Institute of Astrophysics.
I mean, the main collaborators over the years have been given the specificity of Italian situation.
Most of those collaborators have emigrated and now they are in this room under different hats.
But it has been a thing which has been going on for many, many years let's say.
What is DAME? Actually, our system manager is fanatic about the acronym. So DAME -- what
we used to call DAME, now it's DAMEWARE. DAME is a general platform where we are
putting old Web application which we are -- which we are implementing over the years. And as
you can see there are quite a few things. I mean, now we have applications of text mining
related to the globular clusters, so it's a Web application for updating all knowledge available on
globular cluster. I'm not going to waste much time on those things. Now we also open the new
section for the Euclid Mission Data Quality. Our group was in charge of this thing.
And but what matters is DAMEWARE. DAMEWARE is [inaudible] Web application resource
which is aimed mainly at the data mining [inaudible]. Everyone from anywhere in the world can
access an infrastructure through a browser because in fact it is a Web application [inaudible]
infrastructure for data mining where the computing infrastructure is completely transparent to the
user. I mean, the Web application decides where it is better to run the application as a function
of the size of the data and the complexity of the problem.
And we have a series of models which can perform classification, regression, clustering, feature
extractions which are the [inaudible] of data mining applications.
These models are usually the least, which is quite compete, I'll say. We have many methods of
regression for classification. We have some methods for unsupervised clustering. Others are
coming in.
And the next to be implemented, which are in the process of being implemented, the Bayesian
networks or the random decision forest, and another version of a multilayer percept.
The -- I'm not -- actually since I'm speaking so many times, so there's things I'm not going to
waste much of your time. I just want to point out that beside it is -- it has been often said that we
must find -- we must approve the utility of the work which have been done over the years with
the real scientific results. I never understood well this question.
Why to prove the utility doing infrastructure you need [inaudible]. This is something which is
completely [inaudible]. Most of our life, most of our career, most of our work is about the
simple contributions, I mean, about producing papers of honest, decent work where you just
make a measurement better than others or you improve upon something. I don't know why these
types of astroinformatics, let's say, infrastructure we also want something more.
So we made a point of trying -- I mean, we opened basically a call for collaboration with the
community for good cases where to use the -- this platform, DAME. To solve problems we did
data mining techniques. So basically I don't know what is my role in this model for data mine
[inaudible] a project or as an astronomer. I don't know anymore what I am, but -- and I think this
is a problem which I never solve.
But in the last few years, also due to the fact that there was a cap in funding overall in Italy, I
mean, about 60 percent, we stopped improving the structure. We couldn't do any more -- any
development. We began to apply the existing infrastructure through a series of programs with
results which, to my opinion, are quite satisfying since basically with the minimal effort in all
cases we always have achieved the results than are much better than what is available in the
literature at the moment.
And I think also with this model -- I mean, we have been working on massive datasets on the
whole [inaudible] in some cases on very small datasets, like, for instance, in this case where
[inaudible] Puzia, a colleague from Chile, wanted to see whether it was easier to separate the
globular clusters in external galaxies using data mining techniques.
And we just applied it the platform, actually Thomas did from Chile, applied the thing to the -- to
the problem of NGC 3199. He took 11 parameters, so it's a small dataset, a small number of
features, something like less than 3,000 objects [inaudible], it's okay for date miners, he used a
multilayer perceptron with [inaudible] and approximation of all the classification [inaudible]. It's
very effective in generalizing from a small basis of knowledge. So it's the idea of this type of
problem.
And it performed something which is very important to remember when you take into account
this type of analysis, is that the fact that you -- one of the advantages is that you can perform a
feature extractor, so you can plug in as many features parameters. You have then let the method
itself to evaluate which features are significant, which are not.
So there was a first run of feature extraction where the network learned how to recognize
globular clusters. So we have to stop for a second. What is the definition of globular cluster?
Usually globular clusters in external galaxies are defined using -- if there are high-resolution
images available, like, for instance, HST you can use a structure parameter plus color, the idea
was to use a base of knowledge built upon HST knowledge to try to identify globular cluster
using single-band HST, therefore without colors, plus colors from the ground, so mixing very
low quality data from the point of view of [inaudible] resolution with the color obtained from the
ground.
I mean, I'm not going to spend much time because it's quite complex actually. It takes me sitting
on a table checking [inaudible] plots together. But basically here on the upper graph you see the
selection performed with the traditional method, and here on the right you see the selection
performed using neural network where you can see there is a much, much better [inaudible]
between the object identified with globular cluster with the selection box which is marked by this
dashed line.
Other example. We participated with simple tool. I mean, it's used via browser, the one which I
mentioned before, to the PHAT contest for photometric redshift. This has been a large compass
[inaudible] three year -- two years ago where basically all people who had implemented
photometric redshift models applied their method to the same that are certain. The results and
the statistics were then independently, not by the team who was applying the method, but a sort
of revelation committee, and, well, this is the result of the application of the same method.
This is the most incredible thing. Always you see I made a choice. Among all possible cases I
took all the results which were obtained with the same method, okay, with MLP with the
quasi-Newton approximation. Well, and without doing anything. Just running the usual feature
selection.
In this case we had a quite large number of features and we had the problem of missing data. In
fact, here usually good basically refers to the amount of missing data. There is no missing data
in this band. And this band quite good, which was [inaudible] missing data, down to poor. So
we just -- since we know that neural network would be very bad with missing data, we just threw
away these last two last groups' feature. Medium input we used only one. We run a few
experiments of feature selection. We ended up with this combination. Without these features,
without these features. This is the most successful.
And then we ran on our training set to obtain our final result, and it's what you see here, against
ordinate of empirical method of the PHAT contest. The QNA is our method. This is the table
from the paper.
I mean, QNA in the 18-band case, which for us actually was this one, without this band, because
this really convey no information whatsoever and this is something which none of the other
method noted, these bands were just useless. If used, they increased computing time and
decrease accuracy.
So our result were in one case comparable in the 18 band which was what you could obtain with
another [inaudible] in the other one. They were by far the best.
So even in problems which are -- I'm done. Really. It's really weak.
Another example is this one. This is a paper which is coming out now which was inspired by a
paper by yours truly and [inaudible]. You apply data to data simulation of -- to the mixture of
the various quasars. We applied the -- our same method to the larger dataset extracted by the
overlap of GALEX, UKIDSS [inaudible] and WISE, SDSS for the quasar in order to see whether
it was possible to obtain a good estimate of photometric redshift for quasar.
And, again, something which you cannot do easily with other technique.
After a phase of feature extraction we found that the E5 and E16 are by far the most effective
methods. So basically WISE is completely useless for what the photometric redshift derivation
is concerned. And also some ways to measure magnets are completely useless as well.
What we found is this. And then I'm sure you would like to discuss about this because we are
better than you. We go down to 0.101, within percentage of outliers, which you want sigma, two
sigma, three sigma, four sigma. Degrees is very [inaudible]. And we are not using the same
[inaudible] object. We are using only 8,000 objects.
So even without knowing the model, the physics behind, just by using in a proper relational way,
let's say, data mining techniques you can obtain very useful scientific result. Well, then we
cannot beat the killer app, but for sure are extremely effective.
So very quickly I just want to focus a few seconds on the -- what are the problems. Well, if
something will come back to the Italian shores, this is the direction which we intend to move
together with our colleagues at the Caltech.
As we have seen, one experiment for each application requires hundreds, hundreds of runs,
because we have to do future revelation, you have to choose the better algorithm and so on. So
it's -- and most of these algorithms, especially on large dataset, computing intensive.
So -- and -- so if do this, you have the depth, you should move huge amount of dataset, I mean,
the thing is not going to work. So at the moment we're trying to define the standards to be able
to move the Web application with the data.
This is an idea which we already suggested one -- couple of years ago at [inaudible] meeting
and -- but due to the lack of funding we have not been able to proceed with it.
But the idea is exactly that we want to have a set of data mining tool in common shared by the
data repository and basically to have an interoperability, not totally of the data, but closer in
interoperability of the Web application. So this is one thing.
There are also other data action which we are trying to move in collaboration with other. But
really my chairman is pointing out to the time run out. But I just want to mention that another
useful approach is also to optimize the computing time of individual model.
We are exploring the usual GPU for specific model. We have done an implementation with
generic algorithm. But I think that the real problem is to find a way to standardize this Web
application in order to be able to move them with the data.
Perfect time. I'm done.
[applause]
>> Yan Xu: One or two questions?
>>: The papers that you showed were mainly done by people who have been working directly
on the development of DAME.
>> Giuseppe Longo: Because this is part of the story, yes. This is --
>>: Are you starting to see more ->> Giuseppe Longo: Well, Puzia, for instance, Chilean colleague, Puzia is one of the globular
clusters, the one who started [inaudible]. The work on the photometric redshift now has been
moved inside the [inaudible] because it will become one of the methods used -- you know, it's a
slow recognition of -- I must also be honest. Neural networks for photometric redshift that have
been used by many, many years so busy.
It's easier to offset something which is already common. So the fact that Euclid is accepted data
mining methods in -- let's say in that context makes sense.
The problem is that there is the usual long-lasting equation. I mean, most of the community sees
these models as black box and therefore is a sort of skepticism about that. And the only way to
beat it is just to show with papers that they work.
So we are open to collaboration with everybody. So that's in few words the history of five years.
Thank you. Three of them. Anyway.
[applause]
>> Yan Xu: The next talk is the last talk in this morning by George.
>> George Djorgovski: And I'll be accompanied with [inaudible] Ciro when the time comes.
>> Yan Xu: Okay.
>> George Djorgovski: So Jim really brought up the some really important points about the
need for better visualization of data, more natural interfaces and so on. I think technology is
pushing us in that direction. And there's some very interesting things that we can be doing, so I'll
show you a little bit of stuff that we've been doing.
As probably some of you know, we've been experimenting with virtual worlds as a platform for
scientific research collaboration as well as data visualization. And that's basically right on a
technology that's driven by the Moore's law and paid for by the commercial world.
3D video is coming thanks to the movies and other entertainment. Games are already virtual
reality; that all the digital natives have grown up with it and think nothing's more natural. And
so -- and people are paying attention, including people like chief technologically officers of Intel,
and we have good collaboration with Intel Labs in Oregon on doing this.
So this is something that is going to be as informative as the Web itself was, and it behooves us
to find out what we can do with it. So today I'll focus explicitly on some data visualization
angles of it.
There is already a lot of immersive, collaborative, interactive data visualization going on in
platforms in virtual worlds of different sorts, and not just in astronomy, other fields as well,
mathematical objects, molecules, numerical simulations of globular clusters, numerical
simulations of data mining in Intel world that I couldn't bring up.
And we already know how to encode up to maybe 10 dimensions in these pseudo 3D displays.
One of the new things that our student Franz Sauer has done is that we now can click on
individual data points and pull up Web pages or archives in other links for information to follow
on. So if you see us in outlier, find out what is it, and carry on from there.
So this has been done using so-called OpenSimulator technology which has been standard so far
basically in this business, and it's an open source version of the old Second Life, but that's pretty
much dying. New thing is Unity 3D library. We had a lot of development with that. Our
student Alex Cioc has been doing this work, and Ciro will demonstrate some of this, but first let
me take you through just a quick demo of what we have.
So this is the real data display. The data are actually 8-dimensional dataset on quasar spectral
energy distributions from Gordon Richards, et al. What's shown here are three XYZ axes or flux
ratios, and I don't remember which ones, the colors and the sizes of data points encode two other
physical parameters, which I again don't remember.
And, you know, you can fly around this, you can see that objects of different kinds partly
separate. Something like this is really essential if you're going to do proper clustering analysis
because things don't look Gaussian, you have strange shapes and topology, and it can certainly
guide proper data mining research.
Now, of course you can have your colleagues join you in this, add the data, poke at it, plot new
data, can switch the axes and things like that.
The old one that I have to just show you -- this is the one I used for our car picture. This is again
data from Sloan on stars, galaxies, and quasars where the XYZ axis represent three colors, sizes
represent [inaudible], shapes represent kinds of objects, and this has been actually heavily biased
towards high redshift quasars. That's what these guys are. And you can see that you can really
find out where the outliers are in a much more effective and intuitive fashion, and you can
immerse yourself in the data with your colleagues and interact with them and so on.
So, Ciro, do you want to demonstrate the DAME player now?
>> Ciro Donalek: Yeah.
>> George Djorgovski: This is done on a standalone platform, but also more importantly
through the Web browser. So that's what Ciro is showing.
>> Ciro Donalek: [inaudible] but you want to do something -- also something different. So be
able to have a collaborative environment through the browser or even through your -- on your
tablet. Yes, we can also make app for the iPhone or iPad.
So we tested a few game engine softwares like ShiVa, but we ended up using the Unity 3D
because we really like the platform and we can program in JavaScript, C++, C# [inaudible]. So
now this is the same dataset, and we're going to visualize [inaudible].
>> George Djorgovski: We should point out that in the old platform we can do thousands of
data points and with some effort maybe 150,000.
>> Ciro Donalek: Yeah [inaudible] 100,000 data points.
>> George Djorgovski: Easily.
>> Ciro Donalek: Yeah.
>> George Djorgovski: Very fast. And we're working towards a million.
>> Ciro Donalek: Yeah. If this takes -- okay. And -- okay. Don't pay attention to the -- towards
these [inaudible] because this is actually the fifth time we are showing it, and this is the -- a
development [inaudible]. I wasn't even aware to [inaudible] to show this.
So you can do a bunch of stuff. Okay. [inaudible] the data. Okay. [inaudible] you can click
[inaudible] Web page. In our case each name is associated [inaudible]. And then we can go
back. And we can choose to remove like clusters from the cylinders or tubes, cylinders and
things like that.
>> George Djorgovski: [inaudible] solo mode, but this is written to be a shared visualization
with different partners at different locations.
>> Ciro Donalek: Yeah. Actually I can give you the [inaudible] if you have a Mac or Windows
and you can connect to my -- just through the IP address. But now you can connect and I can run
in broadcast mode right here. And so whatever I do, you will see. And we can also track. So I
can remove my points and you will -- now, there is some limitation in this. That's about points
are [inaudible].
>>: [inaudible].
>> Ciro Donalek: Huh?
>>: [inaudible].
>> Ciro Donalek: No. Whoever is the -- whoever is [inaudible]. And so I can release the button
and you can press the button and send them -- you can send the [inaudible].
>>: [inaudible] on the data [inaudible] I'm not allowed to make [inaudible]?
>> Ciro Donalek: Yeah. Yeah. I mean, if then broadcasting and you want to broadcast, you just
tell me. I push again the button, is where a list of my broadcast and you can press your button
and then going to see what [inaudible] ->> George Djorgovski: So collaborators can agree who is going to be changing the data. It has
to be done synchronously for all. And you can choose, however, to have your vantage point
same as the person driving visualization, or you can detach yourself and look from a different
angle.
>> Ciro Donalek: [inaudible] more things that give you [inaudible] you can see some
information here and can be whatever you have stored. But now I'm just plotting the
coordinates, but it could be [inaudible]. So we can make some good stuff, like good for making
video games. You can shoot [inaudible].
>> George Djorgovski: Oh, come on.
>> Ciro Donalek: So you can shoot that.
>> George Djorgovski: So there is a ->> Ciro Donalek: [inaudible] the free version of the Unity platform.
>> George Djorgovski: So the idea is to have a Web-based 3D data browser that you don't have
to do anything for. It would be just available you. Then there would be a little more
sophisticated version later.
>>: So are color selections hardwired by the input data?
>> George Djorgovski: No, no, no.
>> Or can that be ->> George Djorgovski: There is a -- there is a standard format where which column which be
represented how is XYZ, whatnot. We can switch that internal to the graphic essentially.
>> Ciro Donalek: Yeah. And the cluster setting coded with color. So you have the column that
you call cluster, it takes a number 1, 2, 3, and they are encoded with the shapes and colors.
>>: For example, could you select a subset of these points and color them yellow?
>> George Djorgovski: You have to interact with the data file. There is no such option in
browser yet.
>> Ciro Donalek: No. But that's possible.
>> George Djorgovski: So there's a fine line of how much you want data browser to do and
become more clunky, versus streamline and you just have a separate data program and scale it
however you wish.
But the coordinates that are available to you are XYZ, size, shape, color, which will be one or
two, brightness, transparency, and then beyond that we're talking about other shapes of the points
I said, [inaudible] on the data points, or if they're not spherically symmetric they can spin or they
can be oriented a different way. They can also pulsate at the different places. Now, that point
you have a raging headache.
>>: So -- sorry.
>> George Djorgovski: Yeah.
>>: Maybe you just said this. How do you get the data in there? This is being set up by you, or
is this something ->> Ciro Donalek: No. I just loaded the data before.
>> George Djorgovski: We now use plain ASCII or CSV files that we will have via table input
as well.
>>: The plan would be this is a cloud service so you just point something on that Web page at a
local data file and it will read it in and then ->> George Djorgovski: Yeah. You can read your own local data or data anywhere on the
Internet.
>>: Do you have some [inaudible] and then generate data or pull data out of databases using
some other cloud service?
>> George Djorgovski: Yes. Give us some time. Okay? This is not even alpha version yet.
>> Ciro Donalek: [inaudible] a lot of data. I'm just copying and pasting [inaudible] and can be
[inaudible].
>>: But that's the ultimate -- one of the ultimate goals, right?
>> George Djorgovski: Oh, yeah. Sure.
>> Ciro Donalek: Yeah, yeah.
>>: [inaudible].
>> George Djorgovski: Absolutely.
So to conclude here, and then we can take some more questions, in addition to this, I mean, there
is a whole number of technologies that are coming up, and 3D displays of course. But there are
haptic interfaces, and Kinect is the most prominent of them.
I'm of the opinion that Kinect may be the single most important thing that Microsoft's ever done
because it's opening up a whole new paradigm of how to interact with cyberspace, not just
playing games, but interact with other people, interact with data and so on.
And there will be a choice of whether people want to be represented as their own video version,
3D video version, or as avatars or whatever. We're still learning how to use these technologies
properly. But this is essentially science fiction coming to life, right, in like -- seen in displays of
data that are that manipulated by hand. We can already start to do this with Kinect. In fact,
we're working on Kinect thanks to input from Jonathan and others.
And Jim [inaudible] pointed out this great little new piece of equipment called Oculus. We have
invested in that. We expect to get SDK for it in December. And so this would be a lightweight,
easy way for complete 3D immersive sense of VR.
Eventually maybe we'll all have Google glasses or something equivalent that we can overlay
[inaudible] or fully immersive VR on our vision.
So that's where we're going. So the key points are here. I mean, visualization is probably single
most difficult problem we have now in massive data science because you don't really understand
things intuitively unless you can visualize them somehow. And we have problems not with just
quantity but complexity of the data, dimensionality, and so on.
Another thing is -- if there is one thing that I learned from playing with VR is that we are totally
meant to interact in 3D. We're optimized to interact in 3D. And so the whole two-dimensional
display thing is obsolete artifact. And so this could conceivably lead to greater insights as we
explore the data.
Even so, I mean, we can still -- there are still limitations to what brain can process. I said up to a
dozen. That really is absolute maximum. Effectively maybe six, seven, or eight dimensions are
what you can comfortably deal with in display.
So we now already have working tools that are free available to anyone. You can do this on
virtual world platforms, and we're developing this one that's Web browser based, which I think
will also be zero cost until we get into the goggles and stuff, in which case you have to pay
money for those.
But the goal is to be able to do fully immersive 3D visualization anywhere -- on your laptop,
desktop, at home, the airport -- without any special equipment.
So that's that. Thank you.
[applause]
Download