>> Yan Xu: Welcome to the third day’s lecture and it’s my great pleasure to start off by introducing Steven Drucker from Microsoft. He is going to tell us about interface exploration for managing complexity. >> Steven Drucker: Thank you. So I tell my students not to begin talks with disclaimers, but I am going to do that right now. I am not an astronomer. I am not a statistician. I really focus on --. My background is on computer graphics and more recently in interface and information visualization. So I am going to talk a little bit about this and really I wanted to show a couple demos, actually three demos of things that we are really working on right now. So first a kind of brief intro, it probably is not necessary for this audience, but I can never quite tell. What is information visualization about and really it’s about how do we understand data? And understand data by taking advantage of the human perceptual system. And the way we go about doing this, and again I am going quickly because I assume most of you know this, is we convert this information in some way to a graphical form, a graphical representation so that we can use our perception of patterns, colors and other aspects to see patterns. And there are lots of questions, how do you go about doing that, how do you do a better job in other methods? And that is kind of what and information visualization is about. Really specifically as I said before it’s about making these large data sets coherent and really it’s how to summarize this information and present it compactly. Hey, I have got a talk right now, how nice. It’s also about presenting information from various viewpoints and showing information from different levels of detail. You know Marti Hearst is a researcher at Berkeley and she has got a great intro to information visualization that you can find on the web. And Ben Scneiderman at the University of Maryland also has some fine book about it. And he has also coined this visualization mantra which is to say, “Let’s give an overview first, then we zoom and filter in details on demand”. That’s kind of the pattern of my talk I am giving. It’s overview first and then I am going to zoom in on a couple interesting problems. First of all a lot of people come to me when I say, “I do information visualization”. They say, “Oh, well I have got large data and I am really skeptical that your visual system is going to be able to do anything about that”. And I like to point them to Anscombe’s quartet. This is Frances Anscombe in 1973 came up with this nice representation. This is a small data set and you say, “Okay, here are four things, if you actually look at the statistics of these four things they all have the same mean, they have all got the same minimum, the same maximum, the same regression lines. Can anybody see the pattern’s in this from looking at these tables”? Well you might be able to; you might be good at that. But if you look at the visual representation, a very simple visual representation you can see these are what these patterns are. And to me this is one of the most sort of profound simple representations. Look we have got trends going downwards, we have got outliers, we have got you know everything the same with a single outlier. We have got scatters. So, you know, from this you can immediately sort of say, “You know everything is not going to be that simple, but you know this is to me about the essence and core of why it is important to visualize that data”. Other people are realizing this. This is a company called Tableau that is trying to make it easy for turn key, for N users to be able to visualize this information without programming in something like R and a very drag and drop. And that’s kind of some ways a current state of the art in the industry in visualizing information. And rather than show something like Tableau I wanted to kind of show some recent directions that we are looking in. So really I am going to talk about two kinds of things. I am going to have to log onto this in a moment. One of them is how do we make it more natural to interact with data. How do we use a tablet? Tablets are starting to be pervasive. Is there someway of taking something like Tableau, putting it into someone’s hands and can they use that? Then I want to talk about something probably more relevant to you guys which is how do we start scaling this up? How do we make it so that we are dealing with PetaPixel databases or even just look at large amounts of data all at once? So for this tablet thing I am going to have to like take a little moment and log onto my table with this stupid keyboard here because it, Microsoft security makes it log out every 15 minutes if untouched. And that was just long enough for it to log out, so one moment here. >>: You should worry when it logs out just as you are typing. >> Steven Drucker: No its when it install’s the updates. Okay, so really this project we are just doing right now. And in fact it’s going to be submitted to a conference next week. It’s really just about from an interface standpoint you have got two possible ways of going about putting an application on a table device. One is that you can follow the paradigms that you have gotten familiar with on the desktop and just make it table enabled. Just make it so that the buttons are bigger, the menu’s are bigger, and you can click it, and you can operate all the stuff, there is no hover, that’s one way of doing it. And that’s got some real advantages because people are familiar with it. And they have built up paradigms for years of how to use a desktop like application. The other way is let’s re-think it. Let’s see if we can with touch first. Direct manipulation, you are really touching the data. And will people find the benefit of that? That’s the fairly simple exploration that we did. Basically we built full, two prototypes of this. And I wanted to just kind of show you these prototypes. This is mostly business data, but hopefully you will get the idea. I will first show you the prototype. Let me switch over to here. Okay, good. So I will first show you the sort of standard prototype. This is a business data set of coffee sales over a year. And you can see that okay, look I can see in 2001 I did better than 2010. We can do things like let’s actually view this by, oh the regions that we are in. And you can very quickly see that oh I can probably see that the west region was slightly better than the central region, but you know I can very quickly kind of see that. If I want to view by something like month and we want to find the best sales over the month that you might actually want to sort your data. So I can’t tell if you see if where I am clicking, but everything on this interface I am touching on this control panel. It’s a very standard way of interacting with this information. Now if I want to actually see you know, what sold in the month of July I can go down here and I can say, let me turn off everything except for, well let’s say January. So that’s the month of January; again, very simple to sort of do that sort of thing. Let me turn everything back on here. Likewise if I want to sort of drill down to focus on everything, how much did coffee sell in each of the months I can tell that. And if I want to I can split this out by you know region, and let me actually re-set here. Break it apart by region and then drill down by something like caffeine type and see did caffeinated beverages sell better than the non-caffeinated and in what region? It seems like in the South they sold about equal, but in the West where we need coffee, we usually don’t get weather like this, we sell a lot more caffeinated than decaf. So this is a fairly simple interface to understand. What we are trying to do now is compare this with a kind of re-thought interface which is to get rid of that Chrome. We are calling it the Chrome Less version. Just everything that’s sort of on the graft you are actually interacting with the graft in some ways as if the elements were elements that you could touch. So first of all we can simply so it by month. And now if I have got it by month I can simply drag on it and sort that or I can drag it and sort that way or we can alphabet it. So dragging goes like that and you can sort just by touching the axis and by dragging. Even more important is that we can kind of throw things out. So I am actually just touching the data and throwing it out. If I want to throw out a bunch of things I can throw them all out, or if I want to focus on one individual item I can focus on that item. And then if I want to I can switch to different product types and say, “Okay, I just want to see espresso sales in a particular region”. And be able to do that very quickly, so fairly simple different approaches to the same problem. And we genuinely did not know which was going to be better. And the paradigm that we work is that we build these interfaces and we show them to people and we try to actually have them solve real problems in doing this. So we actually ask them, users to just come in who had some experience doing charts and business visualization to use both of these interfaces. And by the way I don’t know if you realize, but this second interface is actually slightly crippled in comparison to the other interface. You can do everything that you can in the other interface, except that if you want to filter something out you need to be viewing the data in that form. So if I want to filter out coffee I need to be seeing if there is coffee, tea and other things in order to be able to filter it out. And so anyway, it’s like okay, are people going to have problems with this? We were actually really surprised which is why I am actually telling this story is because we just sampled 17 so certainly not a huge sample, although it was statistically significant; 14 of the people really liked the Chrome Less version. Now it’s hard to see, and we are delving into why that’s the case. It might be that it was new and novel. It might be the problems that we were asking them to do since they could be done on both these interfaces they liked this Chrome Less interface. It might be because the screen was bigger; they had more real estate to do it. But really what people said is that they felt that their hands were on the data. They were touching it, they were feeling it, they really kind of felt that it matched the flow that they were doing to solve the problems. And as I say this is work that we are still doing. It is going to be sent in soon so we are still analyzing all the data and time, but in most categories there was actually 13 of these people really liked this better, 4 liked the other better, so we know that it wasn’t crippled. Some people liked the other interface better. You could do everything. So we tried to make this as fair as possible. So think this is pretty interesting and intriguing, especially if you look at how we actually start pouring these applications. Does it make sense to do an application specifically for a different UI? And yes, these people weren’t switching back and forth and they are all sorts of other considerations. But it clearly, at least here, people liked a re-thought out interface that matched the device that they were using. So that’s kind of some of the conclusions that we are coming onto this. Now let’s switch back here. Okay, so again just to summarize a little bit, 13 preferred Chrome Less, 4 preferred this, and this is just some reading and some subjective measures. The scores on the right were essentially for the Chrome list interface and these questions were how easy it was to use? How easy was it to learn? How quickly could you do it; and all these other things? And you can actually see that yes there were some outliers, but there was clearly a huge difference on user preference on this. Okay, let’s actually move onto the next area. So the next area that I want to talk about is scaling up data visualization; and scaling up actually interacting with data. So the idea right now is this is an environment that we built that is all about essentially scripting. If you are familiar with R and I see someone has got a poster about R here, this is very similar to that kind of environment where you type something. Some of the differences of this environment are that we are always trying to provide some visualization feedback as you type. So if you type something and you can do a histogram of the results it will do that histogram. We are looking at automatically inferring what’s the best visualization to do based upon the data you are seeing. So again, you can see this experiment and part of this is that we are trying to create an entire environment where people can be online, evaluate data, share it with other people and experience what they are doing along with what you are doing. Now this doesn’t help deal with petabyte data sources. What does help you deal with petabyte data sources is this next notion of progressive queries. So the idea here is that we issues a query, and I might want to stop this video for a moment, so we issued a query and instead of waiting for the day that it might take to calculate the entire query we actually start putting back results from the query right away. And looking at the incremental results allows us to, one see if we made a really big mistake in our query and this happens a lot of the time. You know I might say this is actually some flight data and we are looking at delays per week and when I first did this query I ended up doing sum of delays by week instead of the average delays by week. And if you look at that it’s completely nonsensical. And you can find that out immediately as opposed to issuing a command, we are back to the batch days of computing. You issue a job overnight, you wait for it to come back and then you find out that you made a mistake. So the idea here is to start looking at the results right away, whoops, let me go back here. So the idea here is, you issue a command, you start looking at your results as they come back and you also give confidence intervals based upon what you have seen so far. What are the means, what are the variances of what you have seen so far? So you can guess fairly quickly how well the answer you have got will be representing your results. Now this is not great for outliers and other problems, but it is very good for mean trends. So you can see actually here after I have looked at, you know, some fractions, in fact you are only about .2% of the data we already know that Thursday and Friday are the worst days of the week to travel if you care about being delayed. And I can stop it right there. So those of you that are going to be traveling out --. >>: Then send people to the airport today. >> Steven Drucker: Yeah, exactly. >>: Guess I am leaving then. >> Steven Drucker: Sorry, I will just, if I can get that there, yeah. So you can see very quickly, and that is statistically significant at that point. So the idea is to actually combine both of these techniques into one application. So here we are issuing queries looking at a database. We start the query and the results start streaming back right away. So here we are looking at you know, essentially what words were typed in with weather into a query log that you say, “Weather, what other words were typed in coincidentally with weather”? And we can start looking at those results right away. And in this case we are just looking at the work lengths, which is not really important. But the important thing here is that you can type queries, see interactive results from this large database, start streaming those results back to you in order to start evaluating, and further. And we are seeing, you know, huge amounts of improvement. And a lot of the trick to this his how do we actually structure our data, databases in order to make incremental queries possible? And the first exploration was even if it would be, would people find it useful? Because maybe people say, “Oh maybe I want the exact results”. But we found out that people make so many errors when they are doing queries and they really want this sort of incremental feedback that they are going right. So people investigated further on different queries and different things when they had this facility available. Okay, so some technical notes. Right now this is written in C sharp so you are typing in sort of incremental C sharp. We are using something called stream insight which is a streaming database back end that allows us to write regular sequel queries, but essentially incrementally stream the results back. And we are using an internal MSR toolkit right that lets us do visualizations in HDML. Okay, so the last demo that I want to do was actually finished at about 9:00am this morning, no actually it was at 8:30am this morning. So this is about how do we actually take the data and visualize this data, there you go, if you have got every single point? Right now we are looking at 50,000 points here. And these are points from a census data point. So right now it’s just being presented up here randomly. But I can actually now kind of see what’s going on here as it resolves into shape here. So this is using the graphics processing unit to actually render actually every single point in real time. And it will allow me now to kind of start exploring this data and looking at the transitions between different views of this data. So this data is just census data from different regions. And you can actually kind of just see if we actually look at longitude or latitude, actually its latitude, but let me just look at longitude here. You can see that there actually far more counties to the East than the West because the counties to the West are bigger. If we kind of switch back to the map view you can actually see that’s the case. That there are big regions where there aren’t too many counties. That’s not all that interesting, but if we start looking at some other patterns here we can start, really start investigating some patterns. So I am going to actually look at per capita income in different areas. And you can see in this sort of heap map result that here is New York City where people are making a lot of money, you have got Silicon Valley, a little Seattle area, and you can see the pockets and the cities that start having more income. Now that’s just any visualization that you can do that for, but it’s nice to be able to do that interactively. And what’s also nice is to be able to change what you are looking at. It’s not just a map, but let’s actually look at how this works if we are looking at things like unemployment rate. So that’s unemployment rate sort of across the country. And you can see sort of in different areas where there are higher peaks of unemployment rates. These are actually counties that are unemployed. So you are seeing a real problem in this area. And actually let’s change this and not just look at geographically, but let’s look at this based upon things like the percent with a bachelor’s degree or higher. And now we see some clear trends that we are looking at which are of course the counties that have more educated folks. They tend to be making more money, there tends to be less a less unemployment rate. >>: Are there any [indiscernible]? >> Steven Drucker: Yeah, like I said this was done at 8:00 this morning. I have an older version that is not quite as stable, but essentially the axis that you are looking at along here is the percent of bachelor degree or higher. So it pretty much goes from 0 to I am not sure what the maximum is. And then the axis along here is the percent unemployment. I think if we actually kind of look at one of these points here, sorry, I should be able to click on and find that. Okay, well this is not working. I should be able to find out what the county is for each of those areas. So I will go back to here, average household size, so unemployment, per capita income. So yes, it’s very true in any visualization, I feel embarrassed that I am unable to do that. So the point here though is to be able to very quickly illustrate outliers from this and being able to see those outliers is one thing, but being able to actually, let me just scale this down here, well it reset there. As I said I just got done. This is unemployment rate, boom. This is bachelor degree or higher, and I should have another visualization. These visualizations are actually linked together, so if I, again I am not used to showing this on a little tiny screen here, but if we look at this data --. So these are two web pages that I am looking at simultaneously and they are linked together. So you could be on your computer. I could be on my computer. We can actually kind of take a look at these kinds of outliers and we can see where they show up in this other visualization. So you can actually see that, yeah, let’s go here. You should be able to see very quickly that Flint, Michigan and a couple places, there was Newport, Rhode Island before when I was looking at this, and are outliers in this data. And again the point of this is to say, let’s link these two data sets very quickly, sorry, so let’s link these. Let’s be able to make selections in one and see correspondence in the other. Let’s be able to convert the data. And the reason why these histograms that I am showing over here are useful is because you actually see where this data is coming from. And it gives you a visceral feel with how many, you know, where do these people come from and the individual. So again this is experimental data, but what we are really trying to do is let people look at outliers. Let people see in multiple corresponding views what things you are selecting. So let me kind of get back and summarize because I have only got a few moments left. So right now this is about multiple linked views of data, use layout. I didn’t show filtering because in this version filtering is not working, but also motions should sort of reveal these patterns to users. Right now it’s dealing with arbitrary database. And the way filtering is working in the other prototype is that we can filter based on any selection. So, you can filter, you can select in one view and filter and just focus on that data and then relay that out. And in order to deal with 50, 100, and actually we have gone up to 300,000 points and be able to still maintain interactive rates of dealing with this, we are using GPU based acceleration for doing that. And right now it’s implemented in WebGL because that gives us the benefit of being able to put it on anybodies desktop. So we have tied 4 or 5 of these together so people have had joint analysis sessions, at least on the simple data on this prototype so that they can all be talking together and be saying, “Oh what about over here, let me see this”. And right now anybody who makes a selection it over rules 1everybody else’s selection and we will look at the collaboration protocols next. So as I say it scales easily from 50K-100K but we have gone up to 300K-400K and collaborative analysis. So let me just do some quick final words. The fact that you have got so much data is not unique. There is more data showing up all over the place; data.gov is making government data, we are getting sensors collecting all this data and really this domain is about looking about ways for analyzing and presenting the data. And we have some other projects that are looking are more compelling ways for you to tell stories about the data. For you to be able to do a guide that people can stop and interact with. But really, in some ways I have been trying to show these pictures very quickly, but the purpose of visualization is insight and not these pretty pictures. And we are looking for different ways of giving insight and again, Ben Scneiderman has a great quote about “It gives you the answered questions you didn’t even know”. What was going on in this outlier? I didn’t even know that something was going on here. Let me look into that some more. So it’s not, if you actually know the question you are trying to answer there might be a better way to just mine on the data, get that specific analysis. But if you are trying to discover patterns this is a promising way to do so. So I ended up just about exactly on time and will be happy to answer some questions. >>: Okay questions. >>: So when you started you have shown at first at least one dimensional data, presented in a style of traditional Excel business graphics and much more interesting two dimensional data here, but our problems are in effectively visualizing highly dimensional data sets; way more than two dimensions. How many you can squeeze in? Are there any plans to going that direction? >> Steven Drucker: So, the way that I have been looking at dealing with multi- dimensional data is the sort of divide and conquers approach. I am not convinced that 3D right now gives much benefit at all. I mean, again since you --. >>: I am sorry, why are you not convinced? >> Steven Drucker: Partly we are mostly using 2D displays, and on the 2D display we are already looking at some sort of projection from 3D onto 2D. >>: You have actually experimented with 3D? >> Steven Drucker: Yes, quite a bit. I have about 10 years worth of experimentation, and in fact the entire information visualization field is littered with people who have felt that 3D was the way to go, and yet we have not managed to actually make it, when tested useful. Now that doesn’t mean that someone’s going to come up with a breakthrough, but the fact that we are projecting down from end dimensions into 3D and then down into 2D means you have got occlusion issues, you have got size and relationship issues that don’t become immediately apparent. And again, there are specific domains where I think it can be useful. And astronomy might be one of them because you have a strong spatial component of your data. But when you are dealing with abstract --. What’s that? >>: That’s no important [indiscernible]. contrary to our experiments. And what you said is exactly >> Steven Drucker: Oh that’s great. I would love to talk to you deeply about that, because at least in the information visualization community people have felt this and tried this, but have never been really successful in effectively using 3D. Now I have used GPU acceleration quite a bit. That’s really important and I have also looked at 3D as yet another dimension in another way, temporal data and using the temporal data as another dimension is also important. So there are a lot of different ways. Maybe we can break out afterwards and talk about that because at least to date there has not been effective stuff in our field. >>: So there is a difference in time and space that you are looking at. >> Steven Drucker: Yeah, exactly. >>: Yeah, maybe, maybe not. >> Steven Drucker: Okay. >>: So when you are doing user evaluation of these interfaces how do you normalize the demographic of your users? I mean 20 some things are going to react differently than 50 some things. >> Steven Drucker: Yeah, I mean we try to essentially ensure that we have got a good sample and that we use that as a dependent variable in the analysis of this. At least on using the tablets the ages are actually 32 to 64. And so they were older. We actually expect them to be a little more, you know let’s stage, let’s go with what we know. And we were kind of surprised that they were, oh no, let’s try this new way. And maybe it’s because lots of them do already have phones and they are doing these gestures already on those things so it’s not completely new. But you are absolutely right; people have a much different facility in their exposure to video games and other things. So it’s an important thing that we do try to take into account. It’s hard because it’s hard enough to get you know 20 people that are experienced at analyzing data to come in and use the product, much less control now for age as a variable. >>: The most [indiscernible] that you showed were filtering and what about brushing, especially with [indiscernible]? >> Steven Drucker: Yes, again I showed linking in that last demo which I think brushing and linking are really very important, especially with touch. Really I give some other talks about a whole bunch of other things. I find that the general approach I tend to take is I try to extract some salient features that are going to be interesting, figure out a layout of those salient features and then use sort of divide and conquer to kind of focus in on those regions, and linking and brushing. And those are the techniques that I kind of use over and over and over again when I am trying to pull apart and use data. So I think that’s pretty important. >>: Does it already animate to a 3rd dimension time is the obvious one than others? >> Steven Drucker: I mean it’s actually very easy for us to put it in other dimensions and do that. So time is already animated. You can certainly play it over time. We can also use the 3rd dimension. We are doing a graphics [indiscernible] and we have got x, y, and z positions on any of our data and we can present it that way. So yes, absolutely, and we can also map size and shape onto any of these points and be able to get even that many more. Part of again the history of information and visualization is finding out what are the aspects that are perceptual. So again, layout was the first thing and size is the next, and a couple of other things. So temporal, there is a lot of discussion and controversy also about how important is it to be able to play data over time. You know maybe a lot of you have seen Hans Rosling gave a great demo in Gapminder and looked at the UN and the history of that. And he plays this wonderful time changing thing. What we have actually found is that with him guiding you where to look its great. If you just play this up it’s actually not so great because too many things are moving. Now if you are the one interacting it helps a lot. So it goes back and forth how useful is animation except when you are interacting or when you are guided at what you are looking at. >>: Last question. So your approach to the big data, I thought that was very interesting to sort of stream it back and get interim results and keep showing it. Do you see any hope for dealing with large data where you want to display quantities that sort of you can’t just keep a running total of as the stream is coming by? You know, like you were doing all sorts of crosscutting displays with the census data, different axis and so on. Do you see any hope for generating that sort of thing in any sort of timely way with very large data sets? >> Steven Drucker: It’s an interesting question in that I can see sort of harnessing a Hadoop cluster and sort of condensing some set of the results on the fly and interacting with those things, issuing queries and seeing those things. I am not sure how interactive it was or expensive this will be in terms of cost. So, I mean certainly some of the motivation behind this was to try to prevent pre-processing. And you know you can do an OLAP cube and be able to kind of be able to do sums and other things very quickly, but you can’t do things outside of that domain. So really what we are kind of doing is taking a sample of the data and operating on the sample, but apriori we don’t know how big the sample of that data should be to be significant. So it’s an ever growing sample. Especially if you have got a filter on the data it’s really important that you grow, and grow, and grow, and grow. So that’s at least the technique that we are doing right now. Putting computation in the loop would be really interesting and I am not sure. >>: Yeah, that would be very useful. >> Steven Drucker: Absolutely. >>: [indiscernible] so, thanks’ very much again. >> Yan Xu: So it’s my great pleasure to welcome Kirk Borne who’s going to talk about conquering the astronomical data flood through machine learning and citizen science. >> Kirk Borne: Okay, thank you very much. I just added one sentence to my slide in the last two seconds there which caught me off guard. So, partly of this talk is sort of an extension of some of the concepts we just heard in the previous talk about how the visual inspection of data provides a lot of insight, or at least provides an opportunity for insight. But at the very least it provides an opportunity for people to say, “Ah ha, I see this in the image”. And that’s sort of how citizen science was born in this sense. And that is I will say a few things in this talk which are very familiar to the astronomers and are extremely familiar to people who are already doing this stuff. But I assume that there are some people who haven’t done this. So citizen science is essentially the volunteer scientist involvement with the science process. And Galaxy Zoo was a process which 800 to 900 thousand galaxies in this one digital sky survey were presented to a community of users who volunteered to classify those galaxies. So with classification, that step was really just characterizing what they looked like; were they elliptical in shape, were they spiral in shape? Were they mergers or something else? And so that’s a very short summary, but there’s lots to be said which I wont say in this talk about that. So of course or problem is big data. So another thing which we have mentioned a few times in this week is the large synoptic survey telescope. I am not going to give a summary of what that will hopefully be 10 years from now. But to just mention a couple of data challenges associated with this telescope which as been proposed and hopefully will be funded in the coming years. So, one of those data challenges is that the LLST will acquire [indiscernible] images of about 20 terabytes of data. And in these images there will be roughly 100 million sources per image, taken every 40 seconds, throughout the night. And so this quantity of data represents about an equivalent amount of data that you can cram onto 40,000 CDs. So from my perspective I am asking a student to mine this data, to analyze this data, to even [indiscernible] this data, to do something with this data. It’s just a completely different paradigm where currently I might hand my student a CD of data or a couple of CDs of data and ask them to do something with it. Now it’s 40,000 different, new CDs of data every single day for 10 years. So after the life of the survey, 10 years, this corresponds roughly to a football stadium filled with CDs. So this is qualitatively different than anything imaginable in astronomy so far. So the real challenge for us is how do we make the best scientific use of this? How do we make the surprising discoveries that are waiting in that? So how do we find the unknowns? So it goes to this idea that more data isn’t just more data, it’s really qualitatively different. So the second data challenge is different from the data volume. It’s the event volume, which is each night as a time domain survey repeats observations of the sky and each of these nights it will find roughly 1 to 10 million, so let’s just say 2 million new events in the night sky. And an event is anything that has changed since the last time we looked at that spot. And the real challenge is therefore, what are those things? Are they really scientifically imperative to followup on them? Are the more of the same kinds of things we have seen before? Are they totally new objects that need some kind of followup observation to figure out what they are, and so on? So the real challenge is to understand how they are behaving before we try and put a label on it. So the way I say this is characterize first, classify later. Okay, so in the language of data mining and machine learning that would be apply the unsupervised learning techniques first and then apply the supervised learning. And that is, don’t try and put a label on it. That’s not the point. We want to describe what it is. Okay, so we have heard talks about this already yesterday and we will hear more I guess today when [indiscernible] talks. And that is if you characterize a variable object that appears in your image and you say it’s increased by 5 magnitudes since the last time we looked which was a day ago and it’s, you know, one arcminute from a galaxy or it’s in the spiral arm of a galaxy you might have a pretty good guess of what it might be. It is a supernova, but if you don’t know all those extra pieces of information then all you have is a single data point which is a flux with an error bar. And we want to characterize it first and then curate these characterizations, curate these descriptors of what’s happened or what we see in the data. And then allow the scientist to proceed with the understanding of it, of labeling it, classifying it. And so this is where citizen scientists can come in real handy because they may not know the language of astronomy. They may now know, well some of them are very smart people, they do know the language, but in general the volunteers are not required to know the language. They are not required to know modern astrophysics, but they can certainly use their own human cognition to see a pattern, to see a trend, to see an anomaly in an image; if they see something they have never seen before, if they are trained to look for a certain thing and they find those things and then they find things that deviate from that, and so on. So characterization includes this feature extraction, or first detection and then extraction. So identifying and describing these features. Okay, so this is where the human inspection comes in very handy. And the end goal of this is that we are not going to have humans doing all of this because that would defeat the purpose of having NSF build a data management system for us. But no, seriously it’s to train the automatic classifiers which we will build into the pipeline. So most of the, well hopefully all of the known types of events and objects in astronomy will have algorithms already in the pipeline and as we discover new things that visual inspection, whether it’s through citizen scientists or science team members discover we can re-train the pipeline algorithms to find those objects. So then the focus goes onto the ones that are unknown, the unknown “unknowns”. The ones that are more outlying, that are more outliers with respect to the known behaviors that we would expect to see. So in a way this is the way of dealing with the data flood, and that is you have this sort of pyramid where you have all these things that you already know about and they are already encoded in the pipeline algorithms. And as you move up this pyramid of the more extreme and more unusual and more rare types of objects, you get help with interpreting it, analyzing those, characterizing those, and then that pushes more of those discoveries to the pipeline. And that opens up opportunity for again visual inspection of the very rare things. So what we put in front of the N user and the N user again might be a member of the science team or it might be a member of a citizen science project; the things we want to put in front of those people are the set of things that are different, peculiar, unusual, unknown. So when the volumes of data get this large we try to automate as much as possible. So we need to train as much as possible. So once we have these features and have extracted them from the data, which is what I would say is a nice level 3 product as it’s called in LLST land or value added catalog if you want to think of it that way, is a curated set of these things. Okay, so somebody in a university or some research team may curate features of galaxies, or curate features of time series from these variable stars. And then people can, essentially it’s a database of characterizations which are completely descriptive of the data and not descriptive of the astronomers opinion of what’s in the data. Okay, so this is the characterization step, not the labeling, but the measuring and detecting of features. And then curating that set and making it searchable, mineable by other’s, look for patterns, trends and relationships to know astrophysical phenomenon and maybe discover even knew astrophysical phenomenon. Okay, and so in the language of unsupervised things here; so clustering, or class discovery, principal component analysis, which is of course dimension reduction. Outlier detection which I prefer to call surprised discovery because outlier is basically something that is not behaving like the rest of the data, so it’s a surprise, it’s behaving in a manor inconsistent with the normal behavior of the data distribution. Okay so finding those unusual behaviors in the data. Link analysis and association analysis or network analysis, basically you know building a sort of network of these curated features to find strong associations and strong links. And hopefully as you find those things they are actually implying some kind of astrophysical process behind it. So the discovery of these links and these features and the characterization space hopefully leads to better insights into the astrophysical processes that are work. I mean that’s what astronomers do all the time. We are just doing it at a much larger scale. All right. So the promise: big data leads to big insights and new discoveries. We thought that this was kind of fun the KDD conference starts today in Beijing, so get on the plane quick Alex. All those who just came back from Beijing head back. Okay the scary news is that big data is taking us to this tipping point. So it’s coming down at us and old tools are not going to work, like that guy there. The good news is that big data is sexy and by that I mean we can really attract really great minds and great thinking people as evidenced by the people in the room today. But also we can attract these citizen scientists. We can attract people to our problems because they see it’s really exciting and interesting and it’s pretty cool to work on this stuff. So if you can’t read the cartoon it says, she says to Dilbert, “So what do you do for a living”? And Dilbert says, “I am working on a framework to allow construction of large-scale-analytical queries on un-structured data”. And she says, “I’m a little turned on by that”. And he says, “Settle down. It’s just a framework”. >>: [indiscernible]. >>: This is Dilbert, he is very sexist, and it’s not my fault. >>: [indiscernible]. >> Kirk Borne: So there are many technologies associated with big data, including approaches that are computational science, an approach that are data science and as we are now saying approaches which are citizen science. Okay so a crowd sourcing data. So a colleague of mine put a slide together somewhat similar to this and I sort of enhanced it a little bit. Some of you have seen this slide before and I know George has presented things like this before, so modes of computing to sort historically, computational and numeric computation and silico computing. Okay, that’s our high performance computing computational science paradigm. I like to say to my students who are just learning how to do science I talk about when you build a model for something first you have to sort of parameterize the problem. And by parameterizing the problem you immediately injected subjectivity into what you are doing. So I collided pairs of galaxies. I used to do this a lot in my younger days. And I would make all kinds of assumptions about the properties of the galaxy field, the distribution of dark matter, how much the ratio of luminous to dark matter in the galaxy, the recipe for star formation, all these things were knobs and they were characterized by some parameter on the model. So basically I was parameterizing my ignorance of how it all really worked, okay. And so the model in the sense was very subjective. That if I didn’t have a good understanding or representation of a certain astrophysical behavior to put into that numerical coding I was going to get garbage in garbage out situation. Okay so even though it’s very powerful and you can do an enormous number of things with numerical computation and I spent half my career doing this stuff so I am not dissing it in any way, I am just saying it has subjectivity associated with it, that’s all. In the realm of data science or computational intelligence, now the ideal situation is that it becomes objective and data driven. And this is where I like to focus mostly for myself on unsupervised techniques where we are not trying to apply a label previously learned because there might be something new going on. So I guess in the surprised discovery space people call that sometimes semi-supervised learning, because you try to classify an object into a known class. To put a known label on it, but its behaviors and feature space are so distinctly different from everything else. You need to create a new class, or new cluster out of that data point or cloud of data points. So you try to do a supervised algorithm on it that is classification. But you end up having to do something unsupervised with it. So its data drive, it’s subjective in that sense that it’s the evidence itself. It’s a forensic based approach to the science. All right. So this is great, but it only works as good as the algorithm that you have working on it. And again, if you are applying the wrong algorithm or you don’t know the right algorithm to apply then you might be missing something. So human computation sort of takes, fills that gap in there where we haven’t quite figured out what to ask the data yet, you know, which featured extraction algorithm to apply to the data yet, and then we have people. And so when I say human computation it’s not like citizen science, but actually members of the scientist team. So when I say scientist it can be any science member now; people who look at the data and exploit the capabilities of human cognition to recognize patterns, to recognize anomalies, outliers and data. And this is the power that they bring. So you think about the discovery of Hanny's Voorwerp. Some of you know the story about this blue blob next to this galaxy and sort of a traditional algorithm for galaxy morphology --. Oh, okay, the building is not on fire. So the traditional algorithm for a data pipeline for galaxy morphology is that you scan the image until you find an extended source, then you scan those pixels until you find the peak and the distribution of that extended source, and then you measure all the brightness and sort of the matrix of pixels until you sort of reach the sky brightness. Then you stop and now all those pixels make a galaxy and then you start measuring shape and color, orientation, asymmetry, and all kinds of other things. But if something is outside of that box it’s no longer considered part of the galaxy. So Hanny’s Voorwerp was one of these things that were outside the box of the pixels around that galaxy. So when Hanny was asked to classify this galaxy one morning, a nice looking small galaxy, her as being human you know she wasn’t trained as the algorithm was to look only in that matrix of pixels, she looked and said, “What is this”? So she was providing context to the data. And this is what the human can do for us. Okay, again, so whether the human is a trained PhD astronomer or one of our volunteer citizen scientists they, the human naturally will look at the context of what is being presented in front of them to understand what it means, okay. So I have this problem when I come to conferences. I go to too many things and I will see someone, and of course I know their name in real life, but at the instant they arrive in front of my face, who is this person? And I say, “Oh yeah, I am at an astroinformatics conference it’s so and so”. I mean it’s sort of like I had that happen to me Monday morning to me a couple of time. People walked up to me and I had to look at their name tag and it was very embarrassing. Sorry Joe. But it’s just like and then Hanny was understand it, but there is something once you have the context then it sort of fits. Okay, so providing context to understand this anomaly. Well not to to actually be the first one to say, “Hey look at this different here”. All right. So Galaxy Zoo is an example of crowd sourcing. And so I just want to mention that we have this project which [indiscernible] is the leader and there are many, many citizen scientist projects within this universe. >>: I re-launched yesterday. >> Kirk Borne: Re-launched yesterday, yeah, yeah, yeah. There is a whole bunch more new galaxies; a couple 200 thousand from Hubbel or something in there. >>: Oh and [indiscernible]. >>: And from [indiscernible]. >> Kirk Borne: Oh yeah, from [indiscernible]. >>: Now is it the same galaxy platform for the biology? >>: Yeah, it is. >> Kirk Borne: Yeah, there are lots of different things there. >>: Because there is platform for [indiscernible], visualization and processing which is called Galaxy [indiscernible]. I don’t know [indiscernible]. >> Kirk Borne: No, no, no, no, no, that’s something else. We are talking about real galaxies, not a software package called Galaxies. All right. So just a brief statement which will lead up to a more specific thing I will show you. And that is of course there are two types of galaxies normally in the universe. There is the spirals and elliptical. Here are some elliptical, here are some spirals, but there is also lots of peculiar galaxies; things that don’t fit. Okay again, coming back to this power of human cognition to discern anomalies and things that don’t fit the normal pattern. So this is where discovery becomes possible when you have people looking at the data, because the algorithm may want to claim that something like this is an elliptical or this is a spiral, or this is a spiral, but it’s really quite a bit more complex than that. Okay, so there are lots of things you can do with peculiar galaxies. For example one of the other things you can say Galaxy Zoo announced recently is that you can write out any phrase you want with Galaxy. So I wrote my name last night. All right. So you just put in a phrase and it finds the galaxy alphabet to spell out whatever you wish. Well you can also do real scientific things. Okay, so galaxies gone wild and that is what I spent the first part of my career doing, which is understanding the astrophysical process of two galaxies passing each other, in space over the age of the universe, transference of their orbital energy into internal energies, and those two galaxies merging and becoming one. And this merger process, this assembly process of galaxies may explain why we have two generic types of galaxies, because the spirals may merge to become the elliptical. So the study of this has developed very sophisticated theories like this equation at the top of the slide here, 1+1=1; the only equation in my talk, so the merger of two galaxies to become one. And so way back in the day, so I am really dating myself. So back in the late 70s, early 80s, I personally developed a computer algorithm, a numerical simulation algorithm that would collide two galaxies and I would then explore the shape of that those two galaxies and their merger product as time went on during the simulation to see if it matches one of these observed things. And I would tweak the orbit parameters, and the viewing parameters, and the mass ratios, and all kinds of things to find the set of galaxies in my simulation that best matched an observed pair. And that search process took quite a bit of time so over the course of the four years I worked on my thesis I probably simulated roughly a thousand simulations; and solved the orbit and mass ratios, and internal shape parameters, and all the complete solution in some sense for two pairs of galaxies. pairs of galaxies in four years. Okay, two interacting So along comes Galaxy Mergers Zoo. So now the idea was, so colleagues of mine at George Mason University led by John Wallin, we put together one of the tasks on the Zooniverse site, Galaxy Mergers Zoo where we now present to volunteers essentially a Las Vegas sort of slot machine, sort of user interface. Okay, so there is a 3x3 array of galaxies. The one in the middle is the actual Sloan image. Okay, so an actual image of a pair of colliding galaxies from Sloan. And in the other boxes, the other eight boxes, would be eight independent simulations which you can watch them run, like you watch the oranges and apples spin on the slot machine. So you push go and you see eight new simulations. And if you see one that looks close you click on it and if not you push goes again and then everything spins and you see more. And so after the first day I think we had 20,000 simulations viewed. Okay, so my thesis was done 20 times over in the first day. So what people were doing is they were clicking on and discovering the simulations that looked most like the pair there. And there was --. Thank you. There was a lot of randomness in the parameter selection, but it actually improved with time and there is ways which we changed the interface to enable people to start selecting their one parameter rangers and so on, but that’s another talk for another day. But I just wanted to show some examples in the next few slides. So they will all have this sort of pattern in the next few slides. There will be the Sloan image which is reproduced in gray scale in this corner and then just three examples of simulations that people found just by this inspection. And I should say that we have done about, I already lost track of the number, like 60 or 80 galaxy pairs now. And about 10 million simulations viewed. So again, I spent four years looking at a thousand simulations and our volunteers 20,000 or so volunteers have now viewed roughly 10 million simulations and found really good matching pairs. And as another feature of this site, which again I could talk about over the break called Galaxy Wars where we take some of the best ones that people have found and pit those against each other. And so basically say, “Okay here’s, here’s the Sloan image and here’s what some people thought was a good fit. Here’s what other people thought was a good fit. Which one do you think is the best”? And so we did all these [indiscernible] test one against the other. We pitted these simulations, one against the other. And the ones that were promoted to this slide were those that won all of their galaxy war competitions. That is every time it was; a particular simulation was compared with another simulation that people thought was a good match, this particular one always won. All right. So there is so many of these we, there is a [indiscernible] problem with getting them all to compete with one another. And so there is many of them which have unanimous votes, but not too many, like a handful. In other words there really are three that won every time in their competition. So anyway, here are some examples --. Yeah? >>: Are there any snapshots of the simulations of them in real time? >> Kirk Borne: Oh you actually can see the galaxies move. We are doing real time. It’s not full in body, okay. It’s a, it’s a, it’s not, it’s sort of a restricted three body where the force field is actually calculated from the actual galaxy star distribution. So people are actually watching the simulation take place. And it goes quickly. >>: Yeah, but that might actually make it more difficult for them to --. >> Kirk Borne: No, people are having fun with this. I don’t know if it’s --. No the simulation runs and then it stops at a certain point, because we know the projected separation. So it will stop at that point where it reaches that. But anyway, we can talk about that separately, but. But the simulation doesn’t just keep running and then they have to find the moment. It runs to the point where the projected separation and the tilts of the galaxies as best as we can tell them look something like the final output. And so there is a lot of pre-processing that the science team does before it even, one of these even goes up on the website. That is figuring out what those end points are of the simulation so that when it’s presented to the user it stops at a point where the separations and orientations correspond to what they are looking for. Okay. So just to show you some examples to show that simulations, just like the real universe can produce a wide variety of outcomes and it’s really remarkable that people can find simulations this way that actually match a whole range of peculiar morphologies. So again the goal is we want to parameterize and characterize these pairs of galaxies for their [indiscernible] of the revolution, the mass ratios, the likely chance of merging, how soon will the merge, and a number --. We are actually doing some star formation in the simulations now and actually using the, the end states that the, or using I should say the best fit models that the humans have provided for us to do full end body simulations with star formation. You know tree code plus SPH. And so we have a graduate student who is actually doing this for his thesis. He is using these initial parameter models to feed more sophisticated simulations. And one of the interesting discoveries from this is that the best fit orbit, okay to fit into a narrow tube of in orbit parameter space. That if you look at all the possible orbits, so we show the trajectories of the collision for all possible orbits that were presented to end users and it just fills this volume. But then when you turn off all the trajectories that people didn’t click on and you show the trajectories of just those that people thought were good matches they tend to follow a tube. They fit into a well defined tube in this three dimensional space. And then when you pick out those simulations that won the Galaxy Wars, the head to head competitions, they fit into a much narrower confined set of trajectories. So it really is, they really are finding a unique or hopefully, considering the constraints we have, of the number of parameters we are looking at, a unique solution there. So I am supposed to stop. I will just run past some slides. So again we are trying to train the automatic classifiers with this inspection. Again the human inspection may include again members of science team or millions of citizen scientists. And at the end of the day we would really like to annotate, and tag things and curate them so that discoveries can be made. And this is really applicable to these events that LLST will discover so people can start describing things they see in the time series. And all of these words pretty much say the things I have already been saying. That we want to use this service to actually enable scientists to explore that parameter, that feature space and start discovering anomalies, outliers, or as I say surprises and better characterization of known events and discovery of the unknown, unknown events. And so we are really addressing these challenges both through data science and through citizen science. Okay, so human computation, which includes the human providing the tag. Or we want to move from this space down here to autonomous tagging. So better and better algorithms here so that the data which are shown to the scientists, the humans, becomes more focused on those that really need attention and not those that are obvious. All right, so. Thank you very much. >>: Yan Xu: Okay, so questions? >> Kirk Borne: Yes, Joe. >>: You may have already answered it, but when you were doing the, I guess the gnawing panel comparisons with the actual Sloan images, did you look to see if there were any biases evident that humans were tending to pick images preferentially on the right than on the left, or to the bottom? >> Kirk Borne: No, I have not done that. >>: There was a documented case of [indiscernible]. >> Kirk Borne: No, no that’s a different question. >>: But it is a similar bias. >> Kirk Borne: No, what happened in that case was the buttons were, you know is it elliptical, you know counter clockwise, clockwise or anti-clockwise and so people will tend to gravitate to the middle button when they are confused. So they tended to gravitate to that middle button, which led to more anticlockwise galaxies. But in this case it’s symmetric all the way around. That is there is no preference where that simulation --. So a lot of the same simulations are presented to multiple people. And there is no preference to where we place that in this array of 3x3. So in a given simulation it will appear anywhere at any given time. And so if there is some preference to click on the left all the time it’s going to be washed up. >>: What George was referring to is a really interesting thing where people tended to preferentially select right hand spirals over left hand spirals. And when they dug into the data --. >>: You are from Australia right. >>: Yes, that’s right. But, no they found that right handed people tended to pick right hand spirals and left hand people tended to pick left. So in fact there is a Galaxy Zoo psychology paper published on this. >> Kirk Borne: Yeah, right, but in fact the solution was not about, it was neither astrophysics which people initially thought that the universe had this handedness. And it wasn’t psychology, that is it wasn’t a perception thing, it was a user interface going right back to Ben Shneiderman’s work which we heard about earlier this morning. And that is how you place the buttons on the screen makes all the difference of where people will click. Anyway, Ani had his hand up first I think. >>: As far as scaling up through the LLST event numbers or even other kinds of classifications, do you think citizen science will get there to [indiscernible]? >> Kirk Borne: Well, if we were to launch a citizen science project today where we asked people to classify 20 billion galaxies instead of the 200,000 new galaxies, no, it won’t scale. If we ask people to look at 2 million events every night verses what Planet Hunters does today which is some few thousands of [indiscernible], no it won’t scale. But the goal is that we are learning from the current citizen science experiments, the light curves, you know the timed series light curved stuff with Planet Hunters and so on to train the algorithms better. So by the time we get to LLST we will understand a lot of the anomalous, “anomalous” things well by then to automatically classify those. And what will be left hopefully will be the things which are still new and different above that. Okay, so the whole goal is to move the unusual, the anomalous, and the stuff that doesn’t fit our algorithms into the space where we have an algorithm that we can put it into the pipeline and then focus on things that are left. So it’s that multi-fingered thing we saw the other day. We are trying to move that envelope of what we know how to label and classify up into that space of unknown unknowns. >>: George? >>: Just a comment to expand on that, I think harvesting human pattern recognition and domain knowledge during [indiscernible] is exactly what we ought to be doing, because on that scale there is not enough human time and attention [indiscernible]. And we have been trying to do this for some time now. The approach we have been gravitating to is not just open and critical citizen science, which has its own good users. But trying to get the communities of a certain level of expertise; for example amateur astronomers who can answer much more sophisticated questions posed to them. And also dynamically change the level of the inquiry from the basis of what that particular citizen scientist has done in the past. And overall I would say that this is continuing on the path of collaborative human computer discovery; where a computer can suggest something that the human can agree or not agree and just evolve towards a much better solution. >> Kirk Borne: So, in fact, on the LLST team we are having that very conversation, because when we discuss our user interface we have the science user interface people in a room along with the education people. So the idea is how do you recognize, if you can, what type of user you have just by their interaction? And then give them different tasks so to speak in this volunteer space that are appropriate to their skill level. >>: I think it was kind of interesting, your pyramid and you know sort of putting the hard problems sort of up at the top for citizen science. I think that what often happens in processing is that the stuff that you put up to the top so that people can look at will confuse the baseline processing, because of the volume of data. You know it’s not that the algorithms are just so good at putting out the anomalous stuff or the outlier stuff, it’s sort of forms it into what it’s looking at and confuses, confuses the processing. So I think there is actually a challenge down at the bottom. And then the other thing that I just want to say is that it’s kind of interesting, and I know you have got a volume problem and I know you have got a lot of data coming out, but we have always worked hard to do just the opposite. When we have to do validation or quality control we have kept it to one or two, or a very tight group of people just to get the subjectiveness out of that dread. But you know, this is a different scale of things. >> Kirk Borne: Right. >>: And I guess there is a concern about that subjective nature; not only in the computational stuff, but in the citizen science stuff. >> Kirk Borne: Well you hit, you hit on a very important point there I think, thank you for doing that. And that is the quality assurance. So this type of human interaction with the data, by the science team is all important for the idea, you know for actual pipeline and detector quality assurance type of issues. You know of finding these anomalies in the data; sort of Q and A of the data. And so we are not necessarily asking volunteers, these volunteers to do the quality assurance for us right, because hopefully it’s already gone past that process that we know it’s not an image artifact. We know it’s not a glint in the optics; it really is some astrophysical thing there. You know let’s put this in front of the end user community who is really good at detecting the pattern. So I think, so I was talking earlier about outlier detection and how I like to call that surprised discovery. I mean the surprise might be that there is something wrong with your pipeline or there is something wrong with your, with your camera or it might be a really truly astrophysical phenomenon that is causing that six sigma deviation from the rest of the data. So in a sense you might be sitting --. What I am --. Again, you sort of hit the nail on the head there, is what we are trying to do is instead of doing outlier removal, which most of our pipelines are doing where they do this three sigma clipping or whatever we do and throw that away. No, let’s take a look at what we are throwing away before we assume that it’s just, you know just some statistical deviation in the data and it might be an astrophysical deviation in the data. >>: Last question. >>: So, I have always wondered about a couple of ways that machine learning might help the citizen science [indiscernible]; so one is optimal combination of [indiscernible] labelers, some high expert labelers and image labelers. The images can still be used in [indiscernible] ways, but maybe in different ways with different ratings or different objects. And like you said like you said just choosing which objects to present and which labelers can be done [indiscernible]. >> Kirk Borne: Right, well certainly this whole concept of ensemble learning, and multiple re-classifiers comes into play here where you have, in fact I saw an interesting title for a paper about this topic recently called Evil Teachers for Good Learners. And the idea is that these individuals made by them do not have a very high accuracy. You know they may be only at like 55 percent accurate. But, so if you have lots of these classifiers of voting and they are all voting in the same direction then you sort of have a pretty good idea. So yeah, so the, so the algorithms themselves provide some vote to what we think the thing is, but also the humans are in the loop. And so again; as you say, it is sort of that interaction between the two where we hopefully will get the power of what, of finding the right weights of those votes so we come up the best interpretation at the end. But again, this is exploratory research so this is an exciting field right now I think. >>: And a proposal like that has been turned down three times in a row. >> Kirk Borne: Well you know all good proposals are rejected. >>: Yeah right. >>: Okay, I am afraid I have we have got to stop there. again. So we are running a bit behind schedule --. Let’s thank Kirk