>> Yan Xu: Welcome to the third day’s lecture and... start off by introducing Steven Drucker from Microsoft. He...

advertisement
>> Yan Xu: Welcome to the third day’s lecture and it’s my great pleasure to
start off by introducing Steven Drucker from Microsoft. He is going to tell
us about interface exploration for managing complexity.
>> Steven Drucker: Thank you. So I tell my students not to begin talks with
disclaimers, but I am going to do that right now. I am not an astronomer. I
am not a statistician. I really focus on --. My background is on computer
graphics and more recently in interface and information visualization.
So I am going to talk a little bit about this and really I wanted to show a
couple demos, actually three demos of things that we are really working on
right now. So first a kind of brief intro, it probably is not necessary for
this audience, but I can never quite tell. What is information visualization
about and really it’s about how do we understand data? And understand data
by taking advantage of the human perceptual system.
And the way we go about doing this, and again I am going quickly because I
assume most of you know this, is we convert this information in some way to a
graphical form, a graphical representation so that we can use our perception
of patterns, colors and other aspects to see patterns. And there are lots of
questions, how do you go about doing that, how do you do a better job in
other methods? And that is kind of what and information visualization is
about.
Really specifically as I said before it’s about making these large data sets
coherent and really it’s how to summarize this information and present it
compactly. Hey, I have got a talk right now, how nice.
It’s also about presenting information from various viewpoints and showing
information from different levels of detail. You know Marti Hearst is a
researcher at Berkeley and she has got a great intro to information
visualization that you can find on the web. And Ben Scneiderman at the
University of Maryland also has some fine book about it. And he has also
coined this visualization mantra which is to say, “Let’s give an overview
first, then we zoom and filter in details on demand”. That’s kind of the
pattern of my talk I am giving. It’s overview first and then I am going to
zoom in on a couple interesting problems.
First of all a lot of people come to me when I say, “I do information
visualization”. They say, “Oh, well I have got large data and I am really
skeptical that your visual system is going to be able to do anything about
that”. And I like to point them to Anscombe’s quartet. This is Frances
Anscombe in 1973 came up with this nice representation. This is a small data
set and you say, “Okay, here are four things, if you actually look at the
statistics of these four things they all have the same mean, they have all
got the same minimum, the same maximum, the same regression lines. Can
anybody see the pattern’s in this from looking at these tables”?
Well you might be able to; you might be good at that. But if you look at the
visual representation, a very simple visual representation you can see these
are what these patterns are. And to me this is one of the most sort of
profound simple representations. Look we have got trends going downwards, we
have got outliers, we have got you know everything the same with a single
outlier. We have got scatters. So, you know, from this you can immediately
sort of say, “You know everything is not going to be that simple, but you
know this is to me about the essence and core of why it is important to
visualize that data”.
Other people are realizing this. This is a company called Tableau that is
trying to make it easy for turn key, for N users to be able to visualize this
information without programming in something like R and a very drag and drop.
And that’s kind of some ways a current state of the art in the industry in
visualizing information. And rather than show something like Tableau I
wanted to kind of show some recent directions that we are looking in.
So really I am going to talk about two kinds of things. I am going to have
to log onto this in a moment. One of them is how do we make it more natural
to interact with data. How do we use a tablet? Tablets are starting to be
pervasive. Is there someway of taking something like Tableau, putting it
into someone’s hands and can they use that? Then I want to talk about
something probably more relevant to you guys which is how do we start scaling
this up? How do we make it so that we are dealing with PetaPixel databases
or even just look at large amounts of data all at once?
So for this tablet thing I am going to have to like take a little moment and
log onto my table with this stupid keyboard here because it, Microsoft
security makes it log out every 15 minutes if untouched. And that was just
long enough for it to log out, so one moment here.
>>: You should worry when it logs out just as you are typing.
>> Steven Drucker: No its when it install’s the updates.
Okay, so really this project we are just doing right now. And in fact it’s
going to be submitted to a conference next week. It’s really just about from
an interface standpoint you have got two possible ways of going about putting
an application on a table device.
One is that you can follow the paradigms that you have gotten familiar with
on the desktop and just make it table enabled. Just make it so that the
buttons are bigger, the menu’s are bigger, and you can click it, and you can
operate all the stuff, there is no hover, that’s one way of doing it. And
that’s got some real advantages because people are familiar with it. And
they have built up paradigms for years of how to use a desktop like
application.
The other way is let’s re-think it. Let’s see if we can with touch first.
Direct manipulation, you are really touching the data. And will people find
the benefit of that? That’s the fairly simple exploration that we did.
Basically we built full, two prototypes of this. And I wanted to just kind
of show you these prototypes. This is mostly business data, but hopefully
you will get the idea. I will first show you the prototype. Let me switch
over to here.
Okay, good. So I will first show you the sort of standard prototype. This
is a business data set of coffee sales over a year. And you can see that
okay, look I can see in 2001 I did better than 2010. We can do things like
let’s actually view this by, oh the regions that we are in. And you can very
quickly see that oh I can probably see that the west region was slightly
better than the central region, but you know I can very quickly kind of see
that. If I want to view by something like month and we want to find the best
sales over the month that you might actually want to sort your data.
So I can’t tell if you see if where I am clicking, but everything on this
interface I am touching on this control panel. It’s a very standard way of
interacting with this information. Now if I want to actually see you know,
what sold in the month of July I can go down here and I can say, let me turn
off everything except for, well let’s say January. So that’s the month of
January; again, very simple to sort of do that sort of thing. Let me turn
everything back on here.
Likewise if I want to sort of drill down to focus on everything, how much did
coffee sell in each of the months I can tell that. And if I want to I can
split this out by you know region, and let me actually re-set here. Break it
apart by region and then drill down by something like caffeine type and see
did caffeinated beverages sell better than the non-caffeinated and in what
region? It seems like in the South they sold about equal, but in the West
where we need coffee, we usually don’t get weather like this, we sell a lot
more caffeinated than decaf.
So this is a fairly simple interface to understand. What we are trying to do
now is compare this with a kind of re-thought interface which is to get rid
of that Chrome. We are calling it the Chrome Less version. Just everything
that’s sort of on the graft you are actually interacting with the graft in
some ways as if the elements were elements that you could touch.
So first of all we can simply so it by month. And now if I have got it by
month I can simply drag on it and sort that or I can drag it and sort that
way or we can alphabet it. So dragging goes like that and you can sort just
by touching the axis and by dragging. Even more important is that we can
kind of throw things out. So I am actually just touching the data and
throwing it out. If I want to throw out a bunch of things I can throw them
all out, or if I want to focus on one individual item I can focus on that
item. And then if I want to I can switch to different product types and say,
“Okay, I just want to see espresso sales in a particular region”. And be
able to do that very quickly, so fairly simple different approaches to the
same problem.
And we genuinely did not know which was going to be better. And the paradigm
that we work is that we build these interfaces and we show them to people and
we try to actually have them solve real problems in doing this. So we
actually ask them, users to just come in who had some experience doing charts
and business visualization to use both of these interfaces.
And by the way I don’t know if you realize, but this second interface is
actually slightly crippled in comparison to the other interface. You can do
everything that you can in the other interface, except that if you want to
filter something out you need to be viewing the data in that form. So if I
want to filter out coffee I need to be seeing if there is coffee, tea and
other things in order to be able to filter it out.
And so anyway, it’s like okay, are people going to have problems with this?
We were actually really surprised which is why I am actually telling this
story is because we just sampled 17 so certainly not a huge sample, although
it was statistically significant; 14 of the people really liked the Chrome
Less version. Now it’s hard to see, and we are delving into why that’s the
case. It might be that it was new and novel. It might be the problems that
we were asking them to do since they could be done on both these interfaces
they liked this Chrome Less interface. It might be because the screen was
bigger; they had more real estate to do it.
But really what people said is that they felt that their hands were on the
data. They were touching it, they were feeling it, they really kind of felt
that it matched the flow that they were doing to solve the problems. And as
I say this is work that we are still doing. It is going to be sent in soon
so we are still analyzing all the data and time, but in most categories there
was actually 13 of these people really liked this better, 4 liked the other
better, so we know that it wasn’t crippled. Some people liked the other
interface better. You could do everything. So we tried to make this as fair
as possible.
So think this is pretty interesting and intriguing, especially if you look at
how we actually start pouring these applications. Does it make sense to do
an application specifically for a different UI? And yes, these people
weren’t switching back and forth and they are all sorts of other
considerations. But it clearly, at least here, people liked a re-thought out
interface that matched the device that they were using. So that’s kind of
some of the conclusions that we are coming onto this.
Now let’s switch back here. Okay, so again just to summarize a little bit,
13 preferred Chrome Less, 4 preferred this, and this is just some reading and
some subjective measures. The scores on the right were essentially for the
Chrome list interface and these questions were how easy it was to use? How
easy was it to learn? How quickly could you do it; and all these other
things? And you can actually see that yes there were some outliers, but
there was clearly a huge difference on user preference on this.
Okay, let’s actually move onto the next area. So the next area that I want
to talk about is scaling up data visualization; and scaling up actually
interacting with data. So the idea right now is this is an environment that
we built that is all about essentially scripting. If you are familiar with R
and I see someone has got a poster about R here, this is very similar to that
kind of environment where you type something. Some of the differences of
this environment are that we are always trying to provide some visualization
feedback as you type. So if you type something and you can do a histogram of
the results it will do that histogram.
We are looking at automatically inferring what’s the best visualization to do
based upon the data you are seeing. So again, you can see this experiment
and part of this is that we are trying to create an entire environment where
people can be online, evaluate data, share it with other people and
experience what they are doing along with what you are doing.
Now this doesn’t help deal with petabyte data sources. What does help you
deal with petabyte data sources is this next notion of progressive queries.
So the idea here is that we issues a query, and I might want to stop this
video for a moment, so we issued a query and instead of waiting for the day
that it might take to calculate the entire query we actually start putting
back results from the query right away. And looking at the incremental
results allows us to, one see if we made a really big mistake in our query
and this happens a lot of the time. You know I might say this is actually
some flight data and we are looking at delays per week and when I first did
this query I ended up doing sum of delays by week instead of the average
delays by week. And if you look at that it’s completely nonsensical.
And you can find that out immediately as opposed to issuing a command, we are
back to the batch days of computing. You issue a job overnight, you wait for
it to come back and then you find out that you made a mistake. So the idea
here is to start looking at the results right away, whoops, let me go back
here. So the idea here is, you issue a command, you start looking at your
results as they come back and you also give confidence intervals based upon
what you have seen so far. What are the means, what are the variances of
what you have seen so far? So you can guess fairly quickly how well the
answer you have got will be representing your results.
Now this is not great for outliers and other problems, but it is very good
for mean trends. So you can see actually here after I have looked at, you
know, some fractions, in fact you are only about .2% of the data we already
know that Thursday and Friday are the worst days of the week to travel if you
care about being delayed. And I can stop it right there. So those of you
that are going to be traveling out --.
>>: Then send people to the airport today.
>> Steven Drucker: Yeah, exactly.
>>: Guess I am leaving then.
>> Steven Drucker: Sorry, I will just, if I can get that there, yeah. So you
can see very quickly, and that is statistically significant at that point.
So the idea is to actually combine both of these techniques into one
application. So here we are issuing queries looking at a database. We start
the query and the results start streaming back right away. So here we are
looking at you know, essentially what words were typed in with weather into a
query log that you say, “Weather, what other words were typed in
coincidentally with weather”? And we can start looking at those results
right away. And in this case we are just looking at the work lengths, which
is not really important. But the important thing here is that you can type
queries, see interactive results from this large database, start streaming
those results back to you in order to start evaluating, and further.
And we are seeing, you know, huge amounts of improvement. And a lot of the
trick to this his how do we actually structure our data, databases in order
to make incremental queries possible? And the first exploration was even if
it would be, would people find it useful? Because maybe people say, “Oh
maybe I want the exact results”. But we found out that people make so many
errors when they are doing queries and they really want this sort of
incremental feedback that they are going right. So people investigated
further on different queries and different things when they had this facility
available.
Okay, so some technical notes. Right now this is written in C sharp so you
are typing in sort of incremental C sharp. We are using something called
stream insight which is a streaming database back end that allows us to write
regular sequel queries, but essentially incrementally stream the results
back. And we are using an internal MSR toolkit right that lets us do
visualizations in HDML.
Okay, so the last demo that I want to do was actually finished at about
9:00am this morning, no actually it was at 8:30am this morning. So this is
about how do we actually take the data and visualize this data, there you go,
if you have got every single point? Right now we are looking at 50,000
points here. And these are points from a census data point. So right now
it’s just being presented up here randomly. But I can actually now kind of
see what’s going on here as it resolves into shape here.
So this is using the graphics processing unit to actually render actually
every single point in real time. And it will allow me now to kind of start
exploring this data and looking at the transitions between different views of
this data. So this data is just census data from different regions. And you
can actually kind of just see if we actually look at longitude or latitude,
actually its latitude, but let me just look at longitude here. You can see
that there actually far more counties to the East than the West because the
counties to the West are bigger.
If we kind of switch back to the map view you can actually see that’s the
case. That there are big regions where there aren’t too many counties.
That’s not all that interesting, but if we start looking at some other
patterns here we can start, really start investigating some patterns. So I
am going to actually look at per capita income in different areas. And you
can see in this sort of heap map result that here is New York City where
people are making a lot of money, you have got Silicon Valley, a little
Seattle area, and you can see the pockets and the cities that start having
more income.
Now that’s just any visualization that you can do that for, but it’s nice to
be able to do that interactively. And what’s also nice is to be able to
change what you are looking at. It’s not just a map, but let’s actually look
at how this works if we are looking at things like unemployment rate. So
that’s unemployment rate sort of across the country. And you can see sort of
in different areas where there are higher peaks of unemployment rates. These
are actually counties that are unemployed. So you are seeing a real problem
in this area.
And actually let’s change this and not just look at geographically, but let’s
look at this based upon things like the percent with a bachelor’s degree or
higher. And now we see some clear trends that we are looking at which are of
course the counties that have more educated folks. They tend to be making
more money, there tends to be less a less unemployment rate.
>>: Are there any [indiscernible]?
>> Steven Drucker: Yeah, like I said this was done at 8:00 this morning. I
have an older version that is not quite as stable, but essentially the axis
that you are looking at along here is the percent of bachelor degree or
higher. So it pretty much goes from 0 to I am not sure what the maximum is.
And then the axis along here is the percent unemployment. I think if we
actually kind of look at one of these points here, sorry, I should be able to
click on and find that. Okay, well this is not working. I should be able to
find out what the county is for each of those areas. So I will go back to
here, average household size, so unemployment, per capita income. So yes,
it’s very true in any visualization, I feel embarrassed that I am unable to
do that.
So the point here though is to be able to very quickly illustrate outliers
from this and being able to see those outliers is one thing, but being able
to actually, let me just scale this down here, well it reset there. As I
said I just got done. This is unemployment rate, boom. This is bachelor
degree or higher, and I should have another visualization. These
visualizations are actually linked together, so if I, again I am not used to
showing this on a little tiny screen here, but if we look at this data --.
So these are two web pages that I am looking at simultaneously and they are
linked together. So you could be on your computer. I could be on my
computer. We can actually kind of take a look at these kinds of outliers and
we can see where they show up in this other visualization. So you can
actually see that, yeah, let’s go here. You should be able to see very
quickly that Flint, Michigan and a couple places, there was Newport, Rhode
Island before when I was looking at this, and are outliers in this data.
And again the point of this is to say, let’s link these two data sets very
quickly, sorry, so let’s link these. Let’s be able to make selections in one
and see correspondence in the other. Let’s be able to convert the data. And
the reason why these histograms that I am showing over here are useful is
because you actually see where this data is coming from. And it gives you a
visceral feel with how many, you know, where do these people come from and
the individual. So again this is experimental data, but what we are really
trying to do is let people look at outliers. Let people see in multiple
corresponding views what things you are selecting.
So let me kind of get back and summarize because I have only got a few
moments left. So right now this is about multiple linked views of data, use
layout. I didn’t show filtering because in this version filtering is not
working, but also motions should sort of reveal these patterns to users.
Right now it’s dealing with arbitrary database. And the way filtering is
working in the other prototype is that we can filter based on any selection.
So, you can filter, you can select in one view and filter and just focus on
that data and then relay that out.
And in order to deal with 50, 100, and actually we have gone up to 300,000
points and be able to still maintain interactive rates of dealing with this,
we are using GPU based acceleration for doing that. And right now it’s
implemented in WebGL because that gives us the benefit of being able to put
it on anybodies desktop. So we have tied 4 or 5 of these together so people
have had joint analysis sessions, at least on the simple data on this
prototype so that they can all be talking together and be saying, “Oh what
about over here, let me see this”. And right now anybody who makes a
selection it over rules 1everybody else’s selection and we will look at the
collaboration protocols next.
So as I say it scales easily from 50K-100K but we have gone up to 300K-400K
and collaborative analysis. So let me just do some quick final words. The
fact that you have got so much data is not unique. There is more data
showing up all over the place; data.gov is making government data, we are
getting sensors collecting all this data and really this domain is about
looking about ways for analyzing and presenting the data. And we have some
other projects that are looking are more compelling ways for you to tell
stories about the data. For you to be able to do a guide that people can
stop and interact with.
But really, in some ways I have been trying to show these pictures very
quickly, but the purpose of visualization is insight and not these pretty
pictures. And we are looking for different ways of giving insight and again,
Ben Scneiderman has a great quote about “It gives you the answered questions
you didn’t even know”. What was going on in this outlier? I didn’t even
know that something was going on here. Let me look into that some more. So
it’s not, if you actually know the question you are trying to answer there
might be a better way to just mine on the data, get that specific analysis.
But if you are trying to discover patterns this is a promising way to do so.
So I ended up just about exactly on time and will be happy to answer some
questions.
>>: Okay questions.
>>: So when you started you have shown at first at least one dimensional
data, presented in a style of traditional Excel business graphics and much
more interesting two dimensional data here, but our problems are in
effectively visualizing highly dimensional data sets; way more than two
dimensions. How many you can squeeze in? Are there any plans to going that
direction?
>> Steven Drucker: So, the way that I have been looking at dealing with
multi- dimensional data is the sort of divide and conquers approach. I am
not convinced that 3D right now gives much benefit at all. I mean, again
since you --.
>>: I am sorry, why are you not convinced?
>> Steven Drucker: Partly we are mostly using 2D displays, and on the 2D
display we are already looking at some sort of projection from 3D onto 2D.
>>: You have actually experimented with 3D?
>> Steven Drucker: Yes, quite a bit. I have about 10 years worth of
experimentation, and in fact the entire information visualization field is
littered with people who have felt that 3D was the way to go, and yet we have
not managed to actually make it, when tested useful. Now that doesn’t mean
that someone’s going to come up with a breakthrough, but the fact that we are
projecting down from end dimensions into 3D and then down into 2D means you
have got occlusion issues, you have got size and relationship issues that
don’t become immediately apparent.
And again, there are specific domains where I think it can be useful. And
astronomy might be one of them because you have a strong spatial component of
your data. But when you are dealing with abstract --.
What’s that?
>>: That’s no important [indiscernible].
contrary to our experiments.
And what you said is exactly
>> Steven Drucker: Oh that’s great. I would love to talk to you deeply about
that, because at least in the information visualization community people have
felt this and tried this, but have never been really successful in
effectively using 3D. Now I have used GPU acceleration quite a bit. That’s
really important and I have also looked at 3D as yet another dimension in
another way, temporal data and using the temporal data as another dimension
is also important.
So there are a lot of different ways. Maybe we can break out afterwards and
talk about that because at least to date there has not been effective stuff
in our field.
>>: So there is a difference in time and space that you are looking at.
>> Steven Drucker: Yeah, exactly.
>>: Yeah, maybe, maybe not.
>> Steven Drucker: Okay.
>>: So when you are doing user evaluation of these interfaces how do you
normalize the demographic of your users? I mean 20 some things are going to
react differently than 50 some things.
>> Steven Drucker: Yeah, I mean we try to essentially ensure that we have got
a good sample and that we use that as a dependent variable in the analysis of
this. At least on using the tablets the ages are actually 32 to 64. And so
they were older. We actually expect them to be a little more, you know let’s
stage, let’s go with what we know. And we were kind of surprised that they
were, oh no, let’s try this new way. And maybe it’s because lots of them do
already have phones and they are doing these gestures already on those things
so it’s not completely new.
But you are absolutely right; people have a much different facility in their
exposure to video games and other things. So it’s an important thing that we
do try to take into account. It’s hard because it’s hard enough to get you
know 20 people that are experienced at analyzing data to come in and use the
product, much less control now for age as a variable.
>>: The most [indiscernible] that you showed were filtering and what about
brushing, especially with [indiscernible]?
>> Steven Drucker: Yes, again I showed linking in that last demo which I
think brushing and linking are really very important, especially with touch.
Really I give some other talks about a whole bunch of other things. I find
that the general approach I tend to take is I try to extract some salient
features that are going to be interesting, figure out a layout of those
salient features and then use sort of divide and conquer to kind of focus in
on those regions, and linking and brushing. And those are the techniques
that I kind of use over and over and over again when I am trying to pull
apart and use data. So I think that’s pretty important.
>>: Does it already animate to a 3rd dimension time is the obvious one than
others?
>> Steven Drucker: I mean it’s actually very easy for us to put it in other
dimensions and do that. So time is already animated. You can certainly play
it over time. We can also use the 3rd dimension. We are doing a graphics
[indiscernible] and we have got x, y, and z positions on any of our data and
we can present it that way. So yes, absolutely, and we can also map size and
shape onto any of these points and be able to get even that many more. Part
of again the history of information and visualization is finding out what are
the aspects that are perceptual. So again, layout was the first thing and
size is the next, and a couple of other things.
So temporal, there is a lot of discussion and controversy also about how
important is it to be able to play data over time. You know maybe a lot of
you have seen Hans Rosling gave a great demo in Gapminder and looked at the
UN and the history of that. And he plays this wonderful time changing thing.
What we have actually found is that with him guiding you where to look its
great. If you just play this up it’s actually not so great because too many
things are moving.
Now if you are the one interacting it helps a lot. So it goes back and forth
how useful is animation except when you are interacting or when you are
guided at what you are looking at.
>>: Last question. So your approach to the big data, I thought that was very
interesting to sort of stream it back and get interim results and keep
showing it. Do you see any hope for dealing with large data where you want
to display quantities that sort of you can’t just keep a running total of as
the stream is coming by? You know, like you were doing all sorts of crosscutting displays with the census data, different axis and so on. Do you see
any hope for generating that sort of thing in any sort of timely way with
very large data sets?
>> Steven Drucker: It’s an interesting question in that I can see sort of
harnessing a Hadoop cluster and sort of condensing some set of the results on
the fly and interacting with those things, issuing queries and seeing those
things. I am not sure how interactive it was or expensive this will be in
terms of cost. So, I mean certainly some of the motivation behind this was
to try to prevent pre-processing. And you know you can do an OLAP cube and
be able to kind of be able to do sums and other things very quickly, but you
can’t do things outside of that domain.
So really what we are kind of doing is taking a sample of the data and
operating on the sample, but apriori we don’t know how big the sample of that
data should be to be significant. So it’s an ever growing sample.
Especially if you have got a filter on the data it’s really important that
you grow, and grow, and grow, and grow. So that’s at least the technique
that we are doing right now. Putting computation in the loop would be really
interesting and I am not sure.
>>: Yeah, that would be very useful.
>> Steven Drucker: Absolutely.
>>: [indiscernible] so, thanks’ very much again.
>> Yan Xu: So it’s my great pleasure to welcome Kirk Borne who’s going to
talk about conquering the astronomical data flood through machine learning
and citizen science.
>> Kirk Borne: Okay, thank you very much. I just added one sentence to my
slide in the last two seconds there which caught me off guard.
So, partly of this talk is sort of an extension of some of the concepts we
just heard in the previous talk about how the visual inspection of data
provides a lot of insight, or at least provides an opportunity for insight.
But at the very least it provides an opportunity for people to say, “Ah ha, I
see this in the image”. And that’s sort of how citizen science was born in
this sense. And that is I will say a few things in this talk which are very
familiar to the astronomers and are extremely familiar to people who are
already doing this stuff. But I assume that there are some people who
haven’t done this.
So citizen science is essentially the volunteer scientist involvement with
the science process. And Galaxy Zoo was a process which 800 to 900 thousand
galaxies in this one digital sky survey were presented to a community of
users who volunteered to classify those galaxies.
So with classification, that step was really just characterizing what they
looked like; were they elliptical in shape, were they spiral in shape? Were
they mergers or something else? And so that’s a very short summary, but
there’s lots to be said which I wont say in this talk about that.
So of course or problem is big data. So another thing which we have
mentioned a few times in this week is the large synoptic survey telescope. I
am not going to give a summary of what that will hopefully be 10 years from
now. But to just mention a couple of data challenges associated with this
telescope which as been proposed and hopefully will be funded in the coming
years.
So, one of those data challenges is that the LLST will acquire
[indiscernible] images of about 20 terabytes of data. And in these images
there will be roughly 100 million sources per image, taken every 40 seconds,
throughout the night. And so this quantity of data represents about an
equivalent amount of data that you can cram onto 40,000 CDs. So from my
perspective I am asking a student to mine this data, to analyze this data, to
even [indiscernible] this data, to do something with this data. It’s just a
completely different paradigm where currently I might hand my student a CD of
data or a couple of CDs of data and ask them to do something with it. Now
it’s 40,000 different, new CDs of data every single day for 10 years.
So after the life of the survey, 10 years, this corresponds roughly to a
football stadium filled with CDs. So this is qualitatively different than
anything imaginable in astronomy so far. So the real challenge for us is how
do we make the best scientific use of this? How do we make the surprising
discoveries that are waiting in that? So how do we find the unknowns? So it
goes to this idea that more data isn’t just more data, it’s really
qualitatively different.
So the second data challenge is different from the data volume. It’s the
event volume, which is each night as a time domain survey repeats
observations of the sky and each of these nights it will find roughly 1 to 10
million, so let’s just say 2 million new events in the night sky. And an
event is anything that has changed since the last time we looked at that
spot.
And the real challenge is therefore, what are those things? Are they really
scientifically imperative to followup on them? Are the more of the same
kinds of things we have seen before? Are they totally new objects that need
some kind of followup observation to figure out what they are, and so on? So
the real challenge is to understand how they are behaving before we try and
put a label on it.
So the way I say this is characterize first, classify later. Okay, so in the
language of data mining and machine learning that would be apply the
unsupervised learning techniques first and then apply the supervised
learning. And that is, don’t try and put a label on it. That’s not the
point. We want to describe what it is. Okay, so we have heard talks about
this already yesterday and we will hear more I guess today when
[indiscernible] talks.
And that is if you characterize a variable object that appears in your image
and you say it’s increased by 5 magnitudes since the last time we looked
which was a day ago and it’s, you know, one arcminute from a galaxy or it’s
in the spiral arm of a galaxy you might have a pretty good guess of what it
might be. It is a supernova, but if you don’t know all those extra pieces of
information then all you have is a single data point which is a flux with an
error bar.
And we want to characterize it first and then curate these characterizations,
curate these descriptors of what’s happened or what we see in the data. And
then allow the scientist to proceed with the understanding of it, of labeling
it, classifying it. And so this is where citizen scientists can come in real
handy because they may not know the language of astronomy. They may now
know, well some of them are very smart people, they do know the language, but
in general the volunteers are not required to know the language. They are
not required to know modern astrophysics, but they can certainly use their
own human cognition to see a pattern, to see a trend, to see an anomaly in an
image; if they see something they have never seen before, if they are trained
to look for a certain thing and they find those things and then they find
things that deviate from that, and so on.
So characterization includes this feature extraction, or first detection and
then extraction. So identifying and describing these features. Okay, so
this is where the human inspection comes in very handy. And the end goal of
this is that we are not going to have humans doing all of this because that
would defeat the purpose of having NSF build a data management system for us.
But no, seriously it’s to train the automatic classifiers which we will build
into the pipeline.
So most of the, well hopefully all of the known types of events and objects
in astronomy will have algorithms already in the pipeline and as we discover
new things that visual inspection, whether it’s through citizen scientists or
science team members discover we can re-train the pipeline algorithms to find
those objects. So then the focus goes onto the ones that are unknown, the
unknown “unknowns”. The ones that are more outlying, that are more outliers
with respect to the known behaviors that we would expect to see.
So in a way this is the way of dealing with the data flood, and that is you
have this sort of pyramid where you have all these things that you already
know about and they are already encoded in the pipeline algorithms. And as
you move up this pyramid of the more extreme and more unusual and more rare
types of objects, you get help with interpreting it, analyzing those,
characterizing those, and then that pushes more of those discoveries to the
pipeline. And that opens up opportunity for again visual inspection of the
very rare things.
So what we put in front of the N user and the N user again might be a member
of the science team or it might be a member of a citizen science project; the
things we want to put in front of those people are the set of things that are
different, peculiar, unusual, unknown. So when the volumes of data get this
large we try to automate as much as possible. So we need to train as much as
possible.
So once we have these features and have extracted them from the data, which
is what I would say is a nice level 3 product as it’s called in LLST land or
value added catalog if you want to think of it that way, is a curated set of
these things. Okay, so somebody in a university or some research team may
curate features of galaxies, or curate features of time series from these
variable stars.
And then people can, essentially it’s a database of characterizations which
are completely descriptive of the data and not descriptive of the astronomers
opinion of what’s in the data. Okay, so this is the characterization step,
not the labeling, but the measuring and detecting of features. And then
curating that set and making it searchable, mineable by other’s, look for
patterns, trends and relationships to know astrophysical phenomenon and maybe
discover even knew astrophysical phenomenon.
Okay, and so in the language of unsupervised things here; so clustering, or
class discovery, principal component analysis, which is of course dimension
reduction. Outlier detection which I prefer to call surprised discovery
because outlier is basically something that is not behaving like the rest of
the data, so it’s a surprise, it’s behaving in a manor inconsistent with the
normal behavior of the data distribution. Okay so finding those unusual
behaviors in the data. Link analysis and association analysis or network
analysis, basically you know building a sort of network of these curated
features to find strong associations and strong links. And hopefully as you
find those things they are actually implying some kind of astrophysical
process behind it.
So the discovery of these links and these features and the characterization
space hopefully leads to better insights into the astrophysical processes
that are work. I mean that’s what astronomers do all the time. We are just
doing it at a much larger scale.
All right. So the promise: big data leads to big insights and new
discoveries. We thought that this was kind of fun the KDD conference starts
today in Beijing, so get on the plane quick Alex. All those who just came
back from Beijing head back.
Okay the scary news is that big data is taking us to this tipping point. So
it’s coming down at us and old tools are not going to work, like that guy
there. The good news is that big data is sexy and by that I mean we can
really attract really great minds and great thinking people as evidenced by
the people in the room today. But also we can attract these citizen
scientists. We can attract people to our problems because they see it’s
really exciting and interesting and it’s pretty cool to work on this stuff.
So if you can’t read the cartoon it says, she says to Dilbert, “So what do
you do for a living”? And Dilbert says, “I am working on a framework to
allow construction of large-scale-analytical queries on un-structured data”.
And she says, “I’m a little turned on by that”. And he says, “Settle down.
It’s just a framework”.
>>: [indiscernible].
>>: This is Dilbert, he is very sexist, and it’s not my fault.
>>: [indiscernible].
>> Kirk Borne: So there are many technologies associated with big data,
including approaches that are computational science, an approach that are
data science and as we are now saying approaches which are citizen science.
Okay so a crowd sourcing data.
So a colleague of mine put a slide together somewhat similar to this and I
sort of enhanced it a little bit. Some of you have seen this slide before
and I know George has presented things like this before, so modes of
computing to sort historically, computational and numeric computation and
silico computing.
Okay, that’s our high performance computing computational science paradigm.
I like to say to my students who are just learning how to do science I talk
about when you build a model for something first you have to sort of
parameterize the problem. And by parameterizing the problem you immediately
injected subjectivity into what you are doing.
So I collided pairs of galaxies. I used to do this a lot in my younger days.
And I would make all kinds of assumptions about the properties of the galaxy
field, the distribution of dark matter, how much the ratio of luminous to
dark matter in the galaxy, the recipe for star formation, all these things
were knobs and they were characterized by some parameter on the model.
So basically I was parameterizing my ignorance of how it all really worked,
okay. And so the model in the sense was very subjective. That if I didn’t
have a good understanding or representation of a certain astrophysical
behavior to put into that numerical coding I was going to get garbage in
garbage out situation. Okay so even though it’s very powerful and you can do
an enormous number of things with numerical computation and I spent half my
career doing this stuff so I am not dissing it in any way, I am just saying
it has subjectivity associated with it, that’s all.
In the realm of data science or computational intelligence, now the ideal
situation is that it becomes objective and data driven. And this is where I
like to focus mostly for myself on unsupervised techniques where we are not
trying to apply a label previously learned because there might be something
new going on. So I guess in the surprised discovery space people call that
sometimes semi-supervised learning, because you try to classify an object
into a known class. To put a known label on it, but its behaviors and
feature space are so distinctly different from everything else. You need to
create a new class, or new cluster out of that data point or cloud of data
points. So you try to do a supervised algorithm on it that is
classification. But you end up having to do something unsupervised with it.
So its data drive, it’s subjective in that sense that it’s the evidence
itself. It’s a forensic based approach to the science. All right. So this
is great, but it only works as good as the algorithm that you have working on
it. And again, if you are applying the wrong algorithm or you don’t know the
right algorithm to apply then you might be missing something. So human
computation sort of takes, fills that gap in there where we haven’t quite
figured out what to ask the data yet, you know, which featured extraction
algorithm to apply to the data yet, and then we have people.
And so when I say human computation it’s not like citizen science, but
actually members of the scientist team. So when I say scientist it can be
any science member now; people who look at the data and exploit the
capabilities of human cognition to recognize patterns, to recognize
anomalies, outliers and data. And this is the power that they bring.
So you think about the discovery of Hanny's Voorwerp. Some of you know the
story about this blue blob next to this galaxy and sort of a traditional
algorithm for galaxy morphology --.
Oh, okay, the building is not on fire.
So the traditional algorithm for a data pipeline for galaxy morphology is
that you scan the image until you find an extended source, then you scan
those pixels until you find the peak and the distribution of that extended
source, and then you measure all the brightness and sort of the matrix of
pixels until you sort of reach the sky brightness. Then you stop and now all
those pixels make a galaxy and then you start measuring shape and color,
orientation, asymmetry, and all kinds of other things. But if something is
outside of that box it’s no longer considered part of the galaxy.
So Hanny’s Voorwerp was one of these things that were outside the box of the
pixels around that galaxy. So when Hanny was asked to classify this galaxy
one morning, a nice looking small galaxy, her as being human you know she
wasn’t trained as the algorithm was to look only in that matrix of pixels,
she looked and said, “What is this”? So she was providing context to the
data. And this is what the human can do for us. Okay, again, so whether the
human is a trained PhD astronomer or one of our volunteer citizen scientists
they, the human naturally will look at the context of what is being presented
in front of them to understand what it means, okay.
So I have this problem when I come to conferences. I go to too many things
and I will see someone, and of course I know their name in real life, but at
the instant they arrive in front of my face, who is this person? And I say,
“Oh yeah, I am at an astroinformatics conference it’s so and so”. I mean
it’s sort of like I had that happen to me Monday morning to me a couple of
time. People walked up to me and I had to look at their name tag and it was
very embarrassing. Sorry Joe.
But it’s just like
and then Hanny was
understand it, but
there is something
once you have the context then it sort of fits. Okay, so
providing context to understand this anomaly. Well not to
to actually be the first one to say, “Hey look at this
different here”.
All right. So Galaxy Zoo is an example of crowd sourcing. And so I just
want to mention that we have this project which [indiscernible] is the leader
and there are many, many citizen scientist projects within this universe.
>>: I re-launched yesterday.
>> Kirk Borne: Re-launched yesterday, yeah, yeah, yeah. There is a whole
bunch more new galaxies; a couple 200 thousand from Hubbel or something in
there.
>>: Oh and [indiscernible].
>>: And from [indiscernible].
>> Kirk Borne: Oh yeah, from [indiscernible].
>>: Now is it the same galaxy platform for the biology?
>>: Yeah, it is.
>> Kirk Borne: Yeah, there are lots of different things there.
>>: Because there is platform for [indiscernible], visualization and
processing which is called Galaxy [indiscernible]. I don’t know
[indiscernible].
>> Kirk Borne: No, no, no, no, no, that’s something else. We are talking
about real galaxies, not a software package called Galaxies.
All right. So just a brief statement which will lead up to a more specific
thing I will show you. And that is of course there are two types of galaxies
normally in the universe. There is the spirals and elliptical. Here are
some elliptical, here are some spirals, but there is also lots of peculiar
galaxies; things that don’t fit.
Okay again, coming back to this power of human cognition to discern anomalies
and things that don’t fit the normal pattern. So this is where discovery
becomes possible when you have people looking at the data, because the
algorithm may want to claim that something like this is an elliptical or this
is a spiral, or this is a spiral, but it’s really quite a bit more complex
than that.
Okay, so there are lots of things you can do with peculiar galaxies. For
example one of the other things you can say Galaxy Zoo announced recently is
that you can write out any phrase you want with Galaxy. So I wrote my name
last night. All right. So you just put in a phrase and it finds the galaxy
alphabet to spell out whatever you wish. Well you can also do real
scientific things.
Okay, so galaxies gone wild and that is what I spent the first part of my
career doing, which is understanding the astrophysical process of two
galaxies passing each other, in space over the age of the universe,
transference of their orbital energy into internal energies, and those two
galaxies merging and becoming one.
And this merger process, this assembly process of galaxies may explain why we
have two generic types of galaxies, because the spirals may merge to become
the elliptical. So the study of this has developed very sophisticated
theories like this equation at the top of the slide here, 1+1=1; the only
equation in my talk, so the merger of two galaxies to become one.
And so way back in the day, so I am really dating myself. So back in the
late 70s, early 80s, I personally developed a computer algorithm, a numerical
simulation algorithm that would collide two galaxies and I would then explore
the shape of that those two galaxies and their merger product as time went on
during the simulation to see if it matches one of these observed things.
And I would tweak the orbit parameters, and the viewing parameters, and the
mass ratios, and all kinds of things to find the set of galaxies in my
simulation that best matched an observed pair. And that search process took
quite a bit of time so over the course of the four years I worked on my
thesis I probably simulated roughly a thousand simulations; and solved the
orbit and mass ratios, and internal shape parameters, and all the complete
solution in some sense for two pairs of galaxies.
pairs of galaxies in four years.
Okay, two interacting
So along comes Galaxy Mergers Zoo. So now the idea was, so colleagues of
mine at George Mason University led by John Wallin, we put together one of
the tasks on the Zooniverse site, Galaxy Mergers Zoo where we now present to
volunteers essentially a Las Vegas sort of slot machine, sort of user
interface. Okay, so there is a 3x3 array of galaxies. The one in the middle
is the actual Sloan image. Okay, so an actual image of a pair of colliding
galaxies from Sloan. And in the other boxes, the other eight boxes, would be
eight independent simulations which you can watch them run, like you watch
the oranges and apples spin on the slot machine.
So you push go and you see eight new simulations. And if you see one that
looks close you click on it and if not you push goes again and then
everything spins and you see more. And so after the first day I think we had
20,000 simulations viewed. Okay, so my thesis was done 20 times over in the
first day.
So what people were doing is they were clicking on and discovering the
simulations that looked most like the pair there. And there was --. Thank
you. There was a lot of randomness in the parameter selection, but it
actually improved with time and there is ways which we changed the interface
to enable people to start selecting their one parameter rangers and so on,
but that’s another talk for another day. But I just wanted to show some
examples in the next few slides.
So they will all have this sort of pattern in the next few slides. There
will be the Sloan image which is reproduced in gray scale in this corner and
then just three examples of simulations that people found just by this
inspection. And I should say that we have done about, I already lost track
of the number, like 60 or 80 galaxy pairs now. And about 10 million
simulations viewed.
So again, I spent four years looking at a thousand simulations and our
volunteers 20,000 or so volunteers have now viewed roughly 10 million
simulations and found really good matching pairs. And as another feature of
this site, which again I could talk about over the break called Galaxy Wars
where we take some of the best ones that people have found and pit those
against each other. And so basically say, “Okay here’s, here’s the Sloan
image and here’s what some people thought was a good fit. Here’s what other
people thought was a good fit. Which one do you think is the best”?
And so we did all these [indiscernible] test one against the other. We
pitted these simulations, one against the other. And the ones that were
promoted to this slide were those that won all of their galaxy war
competitions. That is every time it was; a particular simulation was
compared with another simulation that people thought was a good match, this
particular one always won.
All right. So there is so many of these we, there is a [indiscernible]
problem with getting them all to compete with one another. And so there is
many of them which have unanimous votes, but not too many, like a handful.
In other words there really are three that won every time in their
competition.
So anyway, here are some examples --.
Yeah?
>>: Are there any snapshots of the simulations of them in real time?
>> Kirk Borne: Oh you actually can see the galaxies move. We are doing real
time. It’s not full in body, okay. It’s a, it’s a, it’s not, it’s sort of a
restricted three body where the force field is actually calculated from the
actual galaxy star distribution. So people are actually watching the
simulation take place. And it goes quickly.
>>: Yeah, but that might actually make it more difficult for them to --.
>> Kirk Borne: No, people are having fun with this. I don’t know if it’s --.
No the simulation runs and then it stops at a certain point, because we know
the projected separation. So it will stop at that point where it reaches
that. But anyway, we can talk about that separately, but. But the
simulation doesn’t just keep running and then they have to find the moment.
It runs to the point where the projected separation and the tilts of the
galaxies as best as we can tell them look something like the final output.
And so there is a lot of pre-processing that the science team does before it
even, one of these even goes up on the website. That is figuring out what
those end points are of the simulation so that when it’s presented to the
user it stops at a point where the separations and orientations correspond to
what they are looking for.
Okay. So just to show you some examples to show that simulations, just like
the real universe can produce a wide variety of outcomes and it’s really
remarkable that people can find simulations this way that actually match a
whole range of peculiar morphologies.
So again the goal is we want to parameterize and characterize these pairs of
galaxies for their [indiscernible] of the revolution, the mass ratios, the
likely chance of merging, how soon will the merge, and a number --. We are
actually doing some star formation in the simulations now and actually using
the, the end states that the, or using I should say the best fit models that
the humans have provided for us to do full end body simulations with star
formation. You know tree code plus SPH. And so we have a graduate student
who is actually doing this for his thesis. He is using these initial
parameter models to feed more sophisticated simulations.
And one of the interesting discoveries from this is that the best fit orbit,
okay to fit into a narrow tube of in orbit parameter space. That if you look
at all the possible orbits, so we show the trajectories of the collision for
all possible orbits that were presented to end users and it just fills this
volume. But then when you turn off all the trajectories that people didn’t
click on and you show the trajectories of just those that people thought were
good matches they tend to follow a tube. They fit into a well defined tube
in this three dimensional space.
And then when you pick out those simulations that won the Galaxy Wars, the
head to head competitions, they fit into a much narrower confined set of
trajectories. So it really is, they really are finding a unique or
hopefully, considering the constraints we have, of the number of parameters
we are looking at, a unique solution there.
So I am supposed to stop. I will just run past some slides. So again we are
trying to train the automatic classifiers with this inspection. Again the
human inspection may include again members of science team or millions of
citizen scientists. And at the end of the day we would really like to
annotate, and tag things and curate them so that discoveries can be made.
And this is really applicable to these events that LLST will discover so
people can start describing things they see in the time series.
And all of these words pretty much say the things I have already been saying.
That we want to use this service to actually enable scientists to explore
that parameter, that feature space and start discovering anomalies, outliers,
or as I say surprises and better characterization of known events and
discovery of the unknown, unknown events. And so we are really addressing
these challenges both through data science and through citizen science.
Okay, so human computation, which includes the human providing the tag. Or
we want to move from this space down here to autonomous tagging. So better
and better algorithms here so that the data which are shown to the
scientists, the humans, becomes more focused on those that really need
attention and not those that are obvious.
All right, so. Thank you very much.
>>: Yan Xu: Okay, so questions?
>> Kirk Borne: Yes, Joe.
>>: You may have already answered it, but when you were doing the, I guess
the gnawing panel comparisons with the actual Sloan images, did you look to
see if there were any biases evident that humans were tending to pick images
preferentially on the right than on the left, or to the bottom?
>> Kirk Borne: No, I have not done that.
>>: There was a documented case of [indiscernible].
>> Kirk Borne: No, no that’s a different question.
>>: But it is a similar bias.
>> Kirk Borne: No, what happened in that case was the buttons were, you know
is it elliptical, you know counter clockwise, clockwise or anti-clockwise and
so people will tend to gravitate to the middle button when they are confused.
So they tended to gravitate to that middle button, which led to more anticlockwise galaxies. But in this case it’s symmetric all the way around.
That is there is no preference where that simulation --. So a lot of the
same simulations are presented to multiple people. And there is no
preference to where we place that in this array of 3x3.
So in a given simulation it will appear anywhere at any given time. And so
if there is some preference to click on the left all the time it’s going to
be washed up.
>>: What George was referring to is a really interesting thing where people
tended to preferentially select right hand spirals over left hand spirals.
And when they dug into the data --.
>>: You are from Australia right.
>>: Yes, that’s right. But, no they found that right handed people tended to
pick right hand spirals and left hand people tended to pick left. So in fact
there is a Galaxy Zoo psychology paper published on this.
>> Kirk Borne: Yeah, right, but in fact the solution was not about, it was
neither astrophysics which people initially thought that the universe had
this handedness. And it wasn’t psychology, that is it wasn’t a perception
thing, it was a user interface going right back to Ben Shneiderman’s work
which we heard about earlier this morning. And that is how you place the
buttons on the screen makes all the difference of where people will click.
Anyway, Ani had his hand up first I think.
>>: As far as scaling up through the LLST event numbers or even other kinds
of classifications, do you think citizen science will get there to
[indiscernible]?
>> Kirk Borne: Well, if we were to launch a citizen science project today
where we asked people to classify 20 billion galaxies instead of the 200,000
new galaxies, no, it won’t scale. If we ask people to look at 2 million
events every night verses what Planet Hunters does today which is some few
thousands of [indiscernible], no it won’t scale. But the goal is that we are
learning from the current citizen science experiments, the light curves, you
know the timed series light curved stuff with Planet Hunters and so on to
train the algorithms better. So by the time we get to LLST we will
understand a lot of the anomalous, “anomalous” things well by then to
automatically classify those. And what will be left hopefully will be the
things which are still new and different above that.
Okay, so the whole goal is to move the unusual, the anomalous, and the stuff
that doesn’t fit our algorithms into the space where we have an algorithm
that we can put it into the pipeline and then focus on things that are left.
So it’s that multi-fingered thing we saw the other day. We are trying to
move that envelope of what we know how to label and classify up into that
space of unknown unknowns.
>>: George?
>>: Just a comment to expand on that, I think harvesting human pattern
recognition and domain knowledge during [indiscernible] is exactly what we
ought to be doing, because on that scale there is not enough human time and
attention [indiscernible]. And we have been trying to do this for some time
now. The approach we have been gravitating to is not just open and critical
citizen science, which has its own good users. But trying to get the
communities of a certain level of expertise; for example amateur astronomers
who can answer much more sophisticated questions posed to them. And also
dynamically change the level of the inquiry from the basis of what that
particular citizen scientist has done in the past.
And overall I would say that this is continuing on the path of collaborative
human computer discovery; where a computer can suggest something that the
human can agree or not agree and just evolve towards a much better solution.
>> Kirk Borne: So, in fact, on the LLST team we are having that very
conversation, because when we discuss our user interface we have the science
user interface people in a room along with the education people. So the idea
is how do you recognize, if you can, what type of user you have just by their
interaction? And then give them different tasks so to speak in this
volunteer space that are appropriate to their skill level.
>>: I think it was kind of interesting, your pyramid and you know sort of
putting the hard problems sort of up at the top for citizen science. I think
that what often happens in processing is that the stuff that you put up to
the top so that people can look at will confuse the baseline processing,
because of the volume of data. You know it’s not that the algorithms are
just so good at putting out the anomalous stuff or the outlier stuff, it’s
sort of forms it into what it’s looking at and confuses, confuses the
processing. So I think there is actually a challenge down at the bottom.
And then the other thing that I just want to say is that it’s kind of
interesting, and I know you have got a volume problem and I know you have got
a lot of data coming out, but we have always worked hard to do just the
opposite. When we have to do validation or quality control we have kept it
to one or two, or a very tight group of people just to get the subjectiveness
out of that dread. But you know, this is a different scale of things.
>> Kirk Borne: Right.
>>: And I guess there is a concern about that subjective nature; not only in
the computational stuff, but in the citizen science stuff.
>> Kirk Borne: Well you hit, you hit on a very important point there I think,
thank you for doing that. And that is the quality assurance. So this type
of human interaction with the data, by the science team is all important for
the idea, you know for actual pipeline and detector quality assurance type of
issues. You know of finding these anomalies in the data; sort of Q and A of
the data. And so we are not necessarily asking volunteers, these volunteers
to do the quality assurance for us right, because hopefully it’s already gone
past that process that we know it’s not an image artifact. We know it’s not
a glint in the optics; it really is some astrophysical thing there. You know
let’s put this in front of the end user community who is really good at
detecting the pattern.
So I think, so I was talking earlier about outlier detection and how I like
to call that surprised discovery. I mean the surprise might be that there is
something wrong with your pipeline or there is something wrong with your,
with your camera or it might be a really truly astrophysical phenomenon that
is causing that six sigma deviation from the rest of the data.
So in a sense you might be sitting --. What I am --. Again, you sort of hit
the nail on the head there, is what we are trying to do is instead of doing
outlier removal, which most of our pipelines are doing where they do this
three sigma clipping or whatever we do and throw that away. No, let’s take a
look at what we are throwing away before we assume that it’s just, you know
just some statistical deviation in the data and it might be an astrophysical
deviation in the data.
>>: Last question.
>>: So, I have always wondered about a couple of ways that machine learning
might help the citizen science [indiscernible]; so one is optimal combination
of [indiscernible] labelers, some high expert labelers and image labelers.
The images can still be used in [indiscernible] ways, but maybe in different
ways with different ratings or different objects. And like you said like you
said just choosing which objects to present and which labelers can be done
[indiscernible].
>> Kirk Borne: Right, well certainly this whole concept of ensemble learning,
and multiple re-classifiers comes into play here where you have, in fact I
saw an interesting title for a paper about this topic recently called Evil
Teachers for Good Learners. And the idea is that these individuals made by
them do not have a very high accuracy. You know they may be only at like 55
percent accurate. But, so if you have lots of these classifiers of voting
and they are all voting in the same direction then you sort of have a pretty
good idea.
So yeah, so the, so the algorithms themselves provide some vote to what we
think the thing is, but also the humans are in the loop. And so again; as
you say, it is sort of that interaction between the two where we hopefully
will get the power of what, of finding the right weights of those votes so we
come up the best interpretation at the end. But again, this is exploratory
research so this is an exciting field right now I think.
>>: And a proposal like that has been turned down three times in a row.
>> Kirk Borne: Well you know all good proposals are rejected.
>>: Yeah right.
>>: Okay, I am afraid I have we have got to stop there.
again.
So we are running a bit behind schedule --.
Let’s thank Kirk
Download