>>: All right, everyone. I guess we'll go ahead and get started. I'd like to introduce our next speaker, Peter Fox. Peter is Tetherless World Constellation Chair and Professor of Earth and Environmental Science and Computer Science at Rensselaer Polytechnic Institute. Previously he spent 17 years at the high institute -- High Altitude Observatory, sorry, of the National Center For Atmospheric Research as chief computational scientist. Fox's research specializes in the fields of solar and solar terrestrial physics, computational and computer science, information technology, and grid enabled distributed semantic data frameworks. Fox is currently PI for the Semantic eScience Framework, the Integrated Ecosystem Assessment and Semantic Provenance Capture in Data Ingest Systems projects. And if you'll take a look, many, many other varied experiences and background. Quite an impressive bio. I'd like to now hand it over to Peter to tell us about semantics for innovation in visualization and multimedia. Peter. >> Peter Fox: Thank you, Lee. Thank you for the invitation from a variety of people. It's very nice to have an opportunity to come back and talk to the ICSTI crew, the crowd. Because I remember the days when we used to get booed out of this room. And you'll remember it too. So what I'd like to you do is please buckle your seatbelt because I'm going to run you through a series of slides and a series of concepts around this idea of innovation, around visualization for science. And I've got to introduce a few concepts. And I'm going to show you a few visualizations but I'm actually going to talk more about visualization and some emerging opportunities. So and this beautiful little graphic there which of course is an artist's rendition of how the sun-Earth system works, and that's really key to some of the things I'm going to say. The artistic element I think is one where there are substantial opportunities. And in these new means of conduct, I'm going to tell you a little bit about linked open data, the semantic Web for real and then some new opportunities in open source realtime software development or near realtime software development to try and invoke a new way of conducting science. And it's a little -- it's a part of data science which you've heard about, but it gets a -- it digs there a little bit more. ` And then we back into when I wrote the abstract I called it the semantics of portrayal, and of course I sort of didn't think about it carefully enough. It's really the semiotics of portrayal, and if you don't know what semiotics are, I'll show you what they are and just briefly describe why they're important. And they do include semantics. And then I want to talk something very specifically about in science that we seem to be utterly failing at, and that is the representation of things that scientists really care about. And maybe general public would care about too, if they knew they were there. But they're not. And then a little bit of a speculation, tell you where exactly we're heading. So many my working premise is that my laptop is old and slow. So these slides will take a little time to come up. This is my working premise. Some of you have seen me give a talk will have seen this slide many times. And really the top phrasing and the two bullets that are -- appear underneath it are really where we should be because we do have a lot of technology capability, and we do want this access to some of the things we've heard about in this workshop today. Distributed knowledge base scientific data. But it's got to appear to be integrated and it has to by to be locally available. And this is really important. And you'll see this is where the shift of the burden has to come. We have over of overextended ourselves and pushed a lot of responsibilities out on to users. And we have to make it look as though it's integrated and local, just like it was yours. And that has a lot of implications in it. But we have this problem. And you can read all that. Of the really important part about this is the red statement, and this is something I bumped into a long time ago, and that is really all data and information's created in a form to facilitate the generation and not the use, except by accident. That's why all of us are in business. If it was facilitated to make the use easy, we wouldn't -- most of us wouldn't be in a job. It would be easy. And we have complications. There's heterogeneity, there's large-scale systems, there's all sorts of complications that go with it. So you might at this time say ut-oh, but really this statement was made around eight years ago. And I still carry it through today because we're making deliberate progress on this particular challenge. Now, because this is a meeting about visualization, I just want to give you a view of something that Jim Hendler and I have been talking about for about two years, since I went to RPI, and now we were invited to give a perspective article for science called Changing the Equation For Scientific Data Visualization. Now, it's embargoed until Friday, so I can't tell you about it. Especially out ->>: [inaudible]. >> Peter Fox: But I'm being taped. So I read the fine print. [laughter]. And the fine print, you know what the fine print says. But there are really three important points. And that's unlocked data, which you should be convinced of and what we call visualization for the masses throughout the lifecycle of data. And I'm going to argue that there's perhaps being too much attention being paid to the curation aspects of creating really nice visualizations. And really what they're do is they're filling up the majority of the time we take in doing science. We can generate data really quickly. We can actually turn out because of many of you here, get it published, but it's this middle piece that's consuming more and more time. And it's getting more and for difficult. The tools are not scaling. And so our premise is to do smarter data and smarter visualizations. We are, however, presented with diagrams like this. And this diagram is in fronts of really every scientist almost in the world. And so you can start to read it. It says data has lots of audiences rising up from science public, museums, educators, policyholders, decision makers and it becomes more and more strategic. So there's always this big emphasis on producing visualizations for a wide variety of audiences. And we've seen and heard a lot about that so far today. But scientists are getting crushed by this pyramid of extra use. So I'm going to -I'm really trying to intend to bring it back to visualization earlier in the research lifecycle of data that will also if we do it right, allow this scaling up to more strategic but less level of detail, less use of jargon and terminology, more integration if you like, more aggregation. And the only way to do that is to do this. I want to explain that. When I present this slide in classes, I tell my students this is their job security slide. And I'll explain it. So in the early days, remember the early days in the Web i. To make data available we put up static HTML pages, wrote the code ourselves and had listings. Who did that? Yup. When the common gateway interface came along and scripting languages we could generate pages on the fly from databases so it was a little more maintainable. Who has done that? Yup. Now we have Web services there which we have rich additions and annotation and merging of datasets and all sorts of things that you can do. How many of you have done those? You've written a Web service, Walt? Fantastic. So you're pretty representative. He's a renaissance man. I love it. The key for computer science and information technology is the complexity of the data structures increases as you go that way. And the level of skill and resources needed to create it go up this way. And that's usually why, you know, nowadays your graduate student or even you can only sort of sit down here. But the key is there's a decreasing level of resources required to maintain it. So what we're doing is we're shifting the burden, we have to shift the burden away from the users to the providers. Rich services that work together. Now, just to emphasize that we of course have too many diagrams, I'm going to show you another diagram which I also like quite a lot because initially it might paint this myth of a data lifecycle such that there's a usually you see this diagram and there's this progression from data to information to knowledge. Well, I completely disagree with that. The -- what it really is it's a set of spheres in which sometimes things are data and sometimes information. This green ellipse of a contextual ellipse is really a great definer of what is -- when it becomes information instead of data. So this diagram is actually really interesting as are the arrows back in this direction. There's the producer community. There's the consumer community. There's an overlap in the middle. And as you all know, just to make it very clear, this is the visualization. This is the information space. This is where when you get people in the loop, this is the important part of it. So it requires bridging on the knowledge side, understanding of the data and data structures, representation. But really it's about this presentational aspect of data. And this is the bit that's actually pretty hard. All right. So those are the contextual setting slides. I want to flip now to -- so who's heard about linked data, linked open data? Who has heard about data.gov? Okay. You can't really get away from it. So as some of you may know, we've helping the data gov team in the US Government bring in linked data. And this is just a screenshot, quick screenshot from the linked open government data page at RPI Jim Hendler is leading this activity. You've heard his name a few times now. And the reason I'm showing this is with this unlocking of the data I've talked about, you suddenly have the opportunity to bring datasets together in a relevantly simple way using really off the shelf technology, service based technologies, not downloading the data but actually accessing it and mashing it up on the fly. And you get -- and this has just happened to be, what is this, GOP, per capita US and China comparison. But I encourage you if you haven't seen some of these things to go to this site, click on the demos, look at the videos, check out the datasets, check out the tools and technology. The idea is to push all of this out into the community. And the reason why it's important is it really underneath what it really looks like is the idea behind linked data is that it's in an Interlingua. The Interlingua is RDF, the resource description framework, or variant of RDF. And it has -- it's a first-class citizen. It has a URI. It lives on the Web. Very important. And it's either behind -- accessible directly or queriable behind a triple store interface. A triple store is just like a regular database except the current currency is a triple, a subject, a predicate, and an object. And it looks like this. Data gov sits here. We convert it to RDF to make it useful. We load it in a triple store. We convert it to JavaScript object notation. And we pump it into Google graphics, pivot, anything you like. That's a very simple curation procedure. But what we didn't anticipate was that the really important thing about data gov is really none of this. It's all about this. Because it's an information resource. So the graphics part of it became fundamentally important and we were really failing because of this curation problem around getting together decent graphics. And if you look on the demonstration side, you see tons of demonstrations. And I know this is being recorded, but I'm saying it. I really dislike all of them. But they're very easy to put together. So we need some more new means. And the new means that came to us was through a creative exchange. We're very lucky to have an experimental and media and performing arts center. This is just a snapshot. It sort of looks like a big wooden whale suspended in a building overlooking Troy. It's really fantastic. You should go check out the website, which I'll show you at the end. But we bumped into some digital artists. And when talking to them and sort of describing what my ambition and goal for visualization was, and I said I need to be able to visualize as quickly as I can think and experiment. And they said, well, that's what we needed for art. You know, when we made the transition from physical art playing with pens and brushes and paints and building things, when they went to digital art it nearly killed them because the tools were horrific. So they built a lot of their own tools. So they needed to be able to do -- they needed good creative visual tools out of the speed of creative thought, feeling, intuition, mental representation. And fortunately they loved programming. Otherwise we would be in -- this is a little hard to see. Can we -- just for this purpose. This is probably the only one that needs it. And it's a little hard to see. So the idea behind this activity is that the artists, when they start to create, start in a small creative space but they want to go to a performance face. They want to go to a large space. So one of the key characteristics of the Experimental Median Performing Arts Center, the intent is to scale from your flat screen, many of you are looking at now, up to a black box studio. That includes any form of projection you can think of, any form of multi-dimensional sound, 360 theater, 360 projection, a whole range of multimedia capabilities, all four -- all completely configureable at your request. Thank you, Lee. So what we have done is we have initiated a collaborative project with a group called the open ended group, which is one of the cooler names I've ever heard. And they have this very magical piece of software called Field. I think because the primary author lives in Chicago and loves the Field Museum. I think I just worked that out. I haven't asked him yet. And don't worry, it looks boring here. I'm going to show you just a quick video of what Field can do. I encourage you to go to their site. They've got videos on Vimeo that you can go and look at their artistic works. But what we're doing is we're really bringing the artistic realtime development of visual artifacts that can be displayed all the way from laptops all the way into creative spaces all the way to scale. So and there's no sound on this. And it's just intended -- now, this is realtime. We're going to draw some lines, going to write some Python code up here in this little box. But all these interfaces know what each of the other interfaces are doing. So it draws a line. You can go over there and move it around, add a point in the middle, have lines run through it. Now, this is for artistic purposes, the intent is to have the data under this so that you can come in at this pace bang out a little bit of script code. It gets more impressive towards the end. And really play around. Experiment and see the impacts right away so you can see colors flipping back and forth adding lots of lines and we're going to expand it here. So you can start to imagine if this was a seismic wave form to get in and be able to play around with this, at the scripting level -- now the key is the key difference between this and some of the tools you saw before is that you can only do things in the tool that were intended to be built into the tool. By giving script level access with a very large toolbox of services for which you can plug in an wrap any service or any tool that you have, for example Matlab can be wrapped in this as well, you can start creating visualizations like this. And this is all done in the realtime that you saw there and export to PDF. What we're doing then, we've just been funded for this exploratory grant for the National Science Foundation. We have a set of tasks. I wish I could have shown you the first visualization but it didn't come out in time from it. But what we're going to do is we're porting Field to Linux, so it will both run in server and client mode. We're basically linking the linked data with Field. So for example, I'll show you some diagrams on this in a minute. We can feed the current graphics that come out of the linked data directly into field for manipulation, distortion also to things. Then we'll unscrew the Google graphics and the pivot graphics, unscrew JSON, query and consume raw RDF, which is where the semantics come in. And with the idea of visualizing at the speed of thought with this idea of scale. And this is really where the semantics start to reenter. And so I'll just run through that. All right. So you remember that previous diagram the graphics was in here. We're going to unscrew that, put in Field and it will consume JavaScript object notation which is nice flexible language. The next version will talk to the triple store directly. And the next version we'll both be able to access RDF content from anywhere on the Web, so this really semantic Webifies it. This has this open world nature of bringing in related content as well as content that's been populated into a triple store such as many of the stores that are being made available. And then the other part is to make this dynamic rather than static so that if the data changes, the RDF is updated automatically so you now have these tightly covered graphics, interactive graphics. So that's the plan with the linked data and Field project. Now I want to turn to, fairly briefly, is to -- the changing the means of conductive science. You know, it sounds scary. Whenever I say this in scientific audiences they're saying what the hell are you talking about? Well, there are really two modes of conducting science. And these are very much ingrained in us. There's the deductive approach and the inductive approach. So theory through hypothesis, comparison to observation and confirmation or observations, patterns, tentative hypothesis and theory. And that's good. We've conducted science that way for a very long time. The problem is that all those means of induction and deduction have been built in to our information systems. And so I say what about abduction? And I don't mean the criminal meaning. But I mean this is where semiotics comes in. So Charles Sanders Peirce formulated this idea of abductive reasoning. And it's how we used to do science. You would have a hunch. You sort of have some idea that there's something there you want to explore. But our current information systems guide you either inductively or deductively. They don't allow for you to find things that you don't know are there. And it's hard to write the tools to be actually able to, you know, confirm hunches. And so abductive reasoning starts when you've got a set of unrelated facts. That's the worst possible scenario for an information system. But you're armed with an intuition that's somehow connected. And that's a great job for visualization. And the unlocking this possibility of exploration through visualization. But to leverage open world, semantics and the Web. So really open open world. And this is where the information theory comes in and semiotics. So semiotics is this nice encapsulation of the study of signs and significations of those signs. And it's the superset of syntax, semantics and as soon as you really start to care about it, pragmatics. Right? So rather than -- you can see the definition also of them, and the slides will be available. But really let's just look at something more concrete. So you have a sign. It's a -- it has a signifier -- it has a signifier and something that becomes signified. When you group them together you've got a code. And those codes are executed according to paradigms. And the syntax is important. So just for my Schenectady [colleague]. >>: [inaudible]. >> Peter Fox: It's a little hard to see. So this is 87 going north near the town of Troy and this is 87 north Montreal, 7 east, Troy to [inaudible] and then there's another thing here that says Schnectady in [inaudible]. This is a symbiotic system. This is a set of signs, system of signs with syntax combined in a way that conveys meaning, structure, and use. All right? So if you want to go to Montreal, you know you stay on the road. If you want to get -want to go to Schnectady, you take that exit. But how do you know how to use the sign? Anybody? Intuition is one. Usually experience. You see someone else use it first. Okay. So this is great, but this is completely an analog system of signs. And one of the big problems we have with circumstances is that we're -- we've tended to sort of think in this analog representation of visual objects instead of in a digital one. And so this is where the semantics of portrayal really come into it. So we're talking about a digital world, and not just talking because annotation and hard coding, I'm talking about declarative relations between elements of these signs and what they mean. And this goes beyond the separation of content from presentation. So we have means for content semantics, but this means for pragmatics, the way in which actually bring these visualizations together, which is actually manifest when a person sits and writes sets of scripts in Python and changes them around, that's all about use. And we have to, we're in the process of coming up with vocabularies for portrayal that really take into account all these different factors. And in addition to capture appropriately visualization provenance, the order of what happens while line choices were chosen, why colors were chosen, why representations were chosen and their relation to each other. So that's where semiotics comes in. And how am I doing for time? Lee? >>: [inaudible]. >> Peter Fox: Five. Okay. So I want to give you an example then turn I think finally to this example for science visualization. The big problem as we started to implement a lot of these more advanced visualization capabilities, more advanced semantics, mashing up data, all of a sudden we're running in to the problems that people really care about. Things like data and information quality, data and information uncertainty, particularly bias, and the need to have evidence for when these things start to occur in visualization. So let me give you an example. This is my current favorite. These are set of four correlations. Don't really know to know very much about what they are. Of NASA satellite data. So longitude, latitude, correlation. Red is 1, purple is minus 1, zero is green. And so you can see in this plot of two quantities that you would think would be the same. That's what's being chosen here. So this one is choosing the cloud top pressure. The same satellite measured by two different instruments. And this one is measuring the -- from the same instrument on two different satellites. And you'll notice these artifacts in these graphics. No annotations, no indication that anything's wrong. So any guesses? International date line. Does the satellite know about the date line? Does the atmosphere know about the date line? No. Similarly, if you know orbital tracks, this is centered around the date line with the shape of the orbit. So there's an explanation for this that's actually a straightforward one. But, you know, if you look at this, you just throw your hands up and walk away and don't tell anyone. So the explanation for this is because of a combination of how the day is defined, the fact that one satellite the descending while the other is ascending. And the time they cross the equator is different. So that in reality -- though here's that other plot. This is the explanation. This is the difference in time between when those two measurements think they're the same. All right? So blue means there's no time difference. You're comparing reasonably. Red means there's 22 hours difference. So they're almost a day apart. You wouldn't expect them to correspond, all right? So my problem with this when we say known issues, the difference of equatorial crossing time and daytime node are modulated by the day-to-day definition, causes the included overpass time difference which introduces the artifact. But why are we saying this in words? We get so wrapped in creating these really authentic representations of the data and forget to include a representation of all the other things that are important. So, you know, why isn't this overlaid with something that's false color red? Or a big cross on it? You know, we just don't do things like that. And we've got to be able to do it. We've got to be able to understand where to put it and what it should look like and what additional information. So what I want to do then is I want to really push on this idea of an abductive information system. So what would application tools look like to let you explore your hunches? And the real idea is to allow for abduction before you go on with the more detailed analysis of either induction or deduction. And I'm pushing this idea of open world and integrative information is the way to go. But there are lots of things you have to take into account. I'm running a little status report on time. So I've got two last slides on to be speculative. But I want to go back to big data and the need to turn things like -and this is where the artists, you know, just open your eyes, says why do you have a wall? It's just a bigger screen. Why don't you have an exhibit that you can look through, walk around, and this is not -- we don't have one, but a digital exhibit of all different and related and integrative facts around a particular topic? Why don't we have that? It's actually pretty easy to do. We're going to implement one. And I don't mean immersion but experience. And we are really not taking advantage of synesthesia. So synesthesia you taste a sound, you see a noise. It's the intermixing of senses. And we haven't taken advantage of the multimedia. We tend to use sound for sound and visualization for visualizing but we don't mix them together. And we need to be able to do it rapidly. Last -- second last slide. We need to do this at scale. Scale meaning from when you develop it all the way up to large spaces. Because when you involve the human in it, that's where you can start to explore things. You can look at them from different angles. There's a perspective aspect to doing things at scale that you just don't get sitting looking in front of a flat screen. Stereo. I haven't heard this mentioned, but this idea that most of what we're looking at in the real world is actually multidimensional and we collapse it back down into two dimensional and then we teach the computer to try and trick us into thinking that it's multidimensional. Why do we do that? We have to sort of think through that. And so the goal is to restore this idea of abductive reasoning and so it's going to change how science is done both for specialists and non-specialists. And this has to be an informatics approach. You know, there's a tendency for the techies to get in and try and bang together some cool tools, but it has to be integrative. It's cognitive science, social science, library science, computer science, the science itself and the engineers. So there has to be collaboration. And my view is we're certainly ready to play. And if you'd like some more information that's my e-mail, our website. Open ended group really worth checking out. The link to open government data and Empac. Thank you very much for listening. >>: Thank you, Peter. [applause]. >>: Very thought provoking talk. Are there questions for Peter? >>: When you said in one of your examples not an immersive environment but nonetheless a multidimensional environment, I'm not quite sure what you mean by that. >> Peter Fox: So immersive environments tend to be directly related to an individual getting immersed in something in and experience whereas an exhibit style there's interaction with other people looking at the things the same time they are. So immersion is direct to -- is really intended to direct your view. Exhibit is meant to broaden your view. And so it lets you see things that you weren't necessarily intending to see whereas immersion really is intended to go in that other direction. A little bit of a generalization but that's the way we're looking at it. Ut-oh. >>: [inaudible]. >> Peter Fox: Yeah, here. Job safety. >>: No, no. I've been thinking a lot about linked data and using RDF sort of implies you need some sort of ontologies for the word you're putting in there. So are you defining ontologies for each of these government datasets? How much effort does it take to go from a dataset and raw data to make an RFD version? Isn't that the critical thing? Isn't that what you've got to automate? >> Peter Fox: It is. And so actually really because of experience we started all the way from literally just translating comma separated values with headers in there with meanings on each of these columns into RFD representations with very -- so basically translating the schema, whatever the schema was, and just reproducing the names of the schema. And then looking at the metadata to establish minimal relationships between the class concepts or property concepts. As you can tell, that only gets you so far. Now, if Jim Hendler was here, he would just keep saying lightweight, lightweight, lightweight, don't -- you know, don't overboard, don't overdo the semantics. And I would agree. And the reason is one thing that we've done is we've tended to be too rigid in how we define the knowledge relationships. And that goes against what I said about induction and deduction. You want to put in as few relationships as you can and be able to explore. Especially for large data. Once you start to build up the knowledge, then, yes, ontologies can come in, you can get richer integration, you can do way more things. So there is a -- there's a very broad spectrum we have to be able to tolerate. And the good news is the tools on that low end of the spectrum is really getting pretty good now. >>: So you don't have to stop by defining ontology? >> Peter Fox: No, don't do that. Definitely not. >>: You don't have to rewrite the schemas of the comma separator or the relation of database in terms of vocabularies? >>: [inaudible]. >>: You do not have to rewrite the ontologies, but you do have to rewrite the schemas in terms of the RFD vocabularies, right, Dublin Core and [inaudible] and what have you? >> Peter Fox: If at all possible, that gives you lots of leverage. The reason why this point is important is that occasionally people develop schemas to structurally store their data that actually are completely logically inconsistent. And you probably don't want to translate something that's logically inconsistent to a form, another form, because it's still going to be logically inconsistent. So we do -- we do -- now, back to this -- we have tools for doing sort of very rapid automation of converting CSV into RDF. Those tools are all on the side. Would love you to try them out. So that you can start to investigate exactly these things, run some queries, see if it makes sense. It's up to him, not me. There's not a hand up there, but I'm going to presume that ->>: My question is along the vein of what she was talking about. Is there going to be like efforts and there's like the W3C is there groups that are working on creating the name spaces and reg registering them and maintaining them. Is there going to be ->> Peter Fox: Yeah ->>: -- cooperation as the project goes forward? >> Peter Fox: Yes. So W3 is coordinating name spaces in particular, URIs. There's a group in the government coordinating name space namings at the moment. I encourage everyone in the community if you're interested in this to get involved and have a say. Because it actually is a fairly formative time because we want these URIs, these first-class objects to be there and continue to be there. And getting the naming as right as possible is actually pretty important. >>: So I'd like now to introduce my [inaudible]. Lin is a senior software developer and test lead at Silverlight, Inc., here in Microsoft. Having spent several years building a variety of data visualization tools within the Live Labs group at Microsoft, the PivotViewer project she leads has recently graduated to become part of the developer division Silverlight software development kit. Before joining Microsoft, she worked as a developer with IBM supercomputing division. She holds a BA from Harvard University in biomedical engineering. And I'll turn it over to Lin to tell us more. >> Jennifer Lin: Thanks. Thanks so much, Lee. This is great. All right. It's really fabulous to follow Professor Fox because so many of the motivations for the research and development that we've done for PivotViewer follow right in line with what he was speaking about, basically being able to translate data into -from something that's just abstract and unformed into something that's more like knowledge that can be acted upon. So what I'm going to be introducing to you is the PivotViewer control. And let me be totally up front that none of this data is mine. So basically I have some colleagues that have been kind enough to share some fabulous datasets that really highlight the ways that you can use PivotViewer with scientific data. All right. And this is quick screen chat of PivotViewer in its glorious reflective form. And our first dataset. This data is provided by Professor Ilius Sazlaski [phonetic] of UCSD. He is a environmental informatics researcher and has been working with numerous different organizations. This particular dataset is from conservation international. They're working with the Bill and Linda Gates Foundation to investigate climate impact on various species. So what we're seeing here is output from an array of camera traps around Tanzania, and they're motion detector triggered. And basically data around each of these animal sightings is collated and gathered to kind of just look at different trends in what we're seeing around that sensitive part of the world. A quick tour of PivotViewer. We have a filter pane here which -- hold on. This is a little bit off screen, unfortunately. There we go. That's a little bit off screen. Let's go back. I'm trying not to touch that too much. All right. So continuing our tour. This is the [inaudible]. It contains metadata around the visual representation you see here. And each photograph represents one citing of one animal on one occurrence. And when you're zoomed in, you can kind of get some metadata around basically latitude and longitude of the camera in the array. The time and information about the animal site. They broke down their data according to class. So basically they have various different classes each of these buttons here. I'll just stick to the first one. So what's nice about PivotViewer is that it will under -- rerender your data according to the facets that are provided in the metadata. So the researchers that were working from Conservation International found some interesting trends in their data. You sort according to for example the temperature at the time of the spotting. There's a pretty reasonable bell curve of what temperatures animals are spotted at. But if you actually get to the genus level of -- species, here we go. It's even better with species. If you look at the granularity of different species, you find that they have distinct distributions according to temperature. So basically the researchers were able to look at this data and say borrowing from Professor Fox I have a hunch. You know, there must be something to this. Perhaps the comfort zone of these animals will center around certain very precise temperatures. The fact that it was very precise and specific was surprising and the fact that it differed between species was surprising. And [inaudible] that will be published in an article in nature in the future. Similar exercise but also quite interesting. If you look at each species in the moon phase when they were sighted, there's another fairly strong and surprising correlation. For example, this is a pretty -- this is a relatively even distribution but still clustering around the end of the moon phase cycle. This is kind of bimodal. And this is -- this probably would be [inaudible] if you had more data points. So hopefully that gives everyone an introduction to the control and kind of how it can be used when you're looking at a dataset that wasn't in any way, you know, prepped for a specific visualization. Let's talk a little bit more about PivotViewer. Okay. So PivotViewer is posted inside of Silverlight which is in turn posted inside of a webpage or a scripting engine. So therefore all of these visualizations live in the cloud. So some early attempts at similar visualizations were done by the team using WPF and a client app that required installation. And major feedback we got was this is not platform agnostic, you know, let's get this into a Web form. All right. Now I'm going to move on to an example that kind of comes from the genetics field. This data is courtesy of my friend Beatrice Dias Acosta, who is in the back of the room. So thanks, Bea. Just going to give you some background in genetics because we're all from many different backgrounds. The human genome has 23 chromosomes and contains three billion base pairs and 25,000 distinct genes. To get the sense of the scale between a chromosome and a gene -- or actually in a nucleotide, which is on the far left -- right -- left, left for you, you have to do many, many, many levels of zooming to get from something as, you know, microbiologically large as a chromosome to the actual base pairs, the A, G, T, and Cs. Of. What's very difficult about this problem, when you're looking at a data visualization perspective is that there's a lot of noise. The signal to noise ratio is problematic and it's one dimensional. So basically you want to be zooming through what could be essentially a garbage to a geneticist until you actually get to the jewels of the genes hidden inside. So we have an approach which presents the chromosomes using -- that represents the genes using a trade card. That's what we call the block on the left. It has some information about the name, the location within the chromosome, a description, and a protein sequence -- or a series of protein sequences. So each of those colorful blocks is an amino acid which is a trio of base pairs. Basically interchangeable as far as the information they contain but just giving a more concise visualization. And I just want to give you kind of a sense of the different experiences from pivot through another MSR based tool called Genozoom and the much more common UCSC genome browser. Or more commonly used. Okay. So here we have a representation of the human genome. Each of these items represents one chromosome. So basically each have the numbered chromosomes to the X and the Y sides gives kind of the relative scale between the different chromosomes. And the colors represent a density of genes that are found in each of them. So I want to show you something interesting about -- a few interesting things about chromosome number two. One quick way to kind of drill down into this data is to look at it according to the starting base pair, which is a nice analog to its position within the chromosome. And once you're in this experience you can just kind of do the zooming, panning exploration to see what -- you know, what kind of patterns might emerge to look for that kind of intuitive hypothesis discovery. I've done this before, so I know what I'm looking for, but I'm not finding it yet. Okay. So I want to go at it backwards because it's a little hard to see on this monitor. Okay. So let's say that I am interested in looking at collagens for my research. All right. So in chromosome two here are some examples of genes that can -- that have collagen in their metadata. One thing kind of strikes me about this particular gene, and that is that the sequence of G is very diagonal. It seems to be following a pretty set pattern. And what's interesting is, you know, just playing around. Looks like the next one has a similar pattern. Perhaps -- oh, yes. And here, this one even has some kind of down in here. So perhaps there's something to this. So out of curiosity, I want to test my hypotheses against this dataset, just kind of see is there something about striping patterns and collagen. I tend to like chromosome 9. Let's check this one out too. All right. So again you kind of see this pattern of in our representation stripy white regions. And, in fact, when Beatrice kind of went through this exercise herself, she went and talked to a geneticist to say why does collagen keep kind of showing up in this way and the response was well, it's a structural protein. So of course there would be some sort of repeating element to it. But that kind of made sense and jived from a genomic perspective. So compared to other visualization tools, just being able to kind of search through metadata that's a convenient feature, the way that these set up data you can also search for base pair sequences. So let's say I wanted to find repeating -- just some sort of repeating G sequence. Okay. So that's pretty common. Maybe look for even more repeating G sequence. And then you count a certain narrowing down to say okay, these are basically representing similar characteristics, can I find something that's also common between these genes. Okay. I wanted to give some context around similar tools that are used for this kind of application. I'll start with looking at the UCSC genome browser. So basically this is based on the same data, same database as we were just looking at. But these experiences vary widely different. So in order to zoom in -I didn't mean to zoom in that far, basically there's many clicks required. So say 10X, 10X, okay. Let's keep going in. Keep going in. Maybe somewhere along this line if I click at something here. And then finally I come to some metadata about that region of the gene that I was looking at. Just wanted to kind of give you a representation of what people are using today and how much you lose the context as you're clicking through items. When we're looking at the genome browser you could see that there were things laid out according to the metadata that you've provided. Here this is just click click click, discrete interactions, much less pleasant. And the user often had to wait because it takes time to load Web pages. And I also want to show you Genozoom, which is another MSR product. So what Genozoom tried to do was was to look at the same kind of contextual fidelity as we had with pivot but with the experience that was more tuned to the one-dimensional nature of a gene -- basically you know, chromosomes are just basically one long, very, very long string, so how do we kind of take advantage of that fact? So this is an e-coli chromosome. Kind of pan around a little bit. Zooming. And we get this slice represents what you see in this view. And then this view can be moved around to kind of look for something in particular. And then this is down here giving you the level of the base pairs that we were talking about before. And what's nice about this view is we have the percentage of Gs in that area of the genetic material and that tends to be a marker for a higher likelihood of finding a gene. So, you know, maybe it would be worth just kind of panning around and saying okay, where are these hotspots with the high red G quotient and kind of investigating whether or not there's opportunity for investigation. So this is kind of giving you a little bit of contextual information about the state of genomic data visualization. I want to talk a little about PivotViewer itself. There's an API that provides 2-way interaction between the control and your Silverlight application. So basically you can learn about the user interaction with the control and you can also provide data to the control. I will say this, that in the version that's currently publically available, everything is static based. So all the data that I was showing in the control earlier is represented in an CXML file or is built just in time on the server and provided that way. So basically everything is either server intensive or static. And what we're going to be doing for the next release of Silverlight 5 is providing an API that makes the entire experience programmatically drivable. So you can add items, remove items, change properties on the fly. This will hopefully ease the barrier of entry into producing experiences in PivotViewer that we found that just static data doesn't cut it with the services that we're trying to work with. This is a little bit about the collections. I apologize if I went through the demos too quickly and just made too many assumptions. So just walking through what constitutes a PivotViewer collection. The zooming is made possible by using the Deep Zoom format for the imaginary. That's something that's Silverlight specific platform piece. And the first example I showed you is a simple collection. There's no connections between different datasets. It's just a single dataset. Second one I showed you if you remember there's a chromosome representation and then when you click on it, it brings you to the detailed gene for presentation. So this is what we call a linked collection. There's lots of possibilities to intermingle datasets using either. But just kind of giving you a frame of reference for how we think about this. And then I just wanted to give you some links for resources to take advantage of PivotViewer now. And I'm sure I'm over time because I had my little snafu with the projector, but I would love to hear questions if anyone has some. >>: Questions for Jen? First a round of applause. [applause]. >>: We love Silverlight PivotViewer so it's great. Great tool. >> Jennifer Lin: I can't take all the credit. There's some people back there who have to share it too. >>: But we wouldn't to ask in terms of -- it's amazing to see the there was some magnitude change from the chromosomes all the way down to the genes, the nucleic acids. I was wondering from an example like a query of a virus genome versus the database, how that would look to do those kind of queries, you know, rather than just looking for the repeat patterns, more sophisticated searches. Is there anything that has been discovered or through this visualization or -- what are the advantages when you do that type of work? >> Jennifer Lin: Yeah. I definitely think that once we have the dynamic capabilities it will be fairly seamless to take the metadata in richer sources and integrate it into the experience and just do queries that dynamically on the fly provide results. The limitation of this example is that it was statically collated. So basically there was a richness to the data but it was a static richness. And I think that once we have our next release it will be much easier to provide experiences that can query a database for something specific and come back with like here's collagen across different species, like here's what they could look like, things like that. So I guess I was looking for kind of examples that showed some serendipity of like here's things you can explore and see just almost as an amateur looking at these sciences. But the experience should be richer with an expert eye. >>: Do you happen to know what happened to the getpivot.com site? It's gone. That URL doesn't work. And there were all these terrific collections out there, like the Sports Illustrated and the classic dog breeds and the cars and all those, and they're just gone. And a lot of our demos are broken as a result. >> Jennifer Lin: Okay. I do apologize for the transition. This was developed under Live Labs and then we moved to Silverlight and in all the enthusiasm some of the Web resources were not lost. I won't say lost. But they were moved. I believe that all the collections still exist. But the link is long. And I believe if you go to the download page -- hey, Angela, I'm sorry to bother you during the presentation. If you look at the download page is there a link to the collections from there at this point? >>: No, but there is the Microsoft.com//silverlight -- I can get that URL. >> Jennifer Lin: I'm sorry. Yeah. >>: We can get the URL to you. >> Jennifer Lin: Yeah. I will get you the URL. That's totally fine. Yeah. Unfortunately with the transition some things got jumbled. But definitely it's all still -- the collateral is still there. And I apologize about the demos getting broken. >>: I don't know. >> Jennifer Lin: Hello, professor. >>: Thank you. I don't know if you can do this but do you have any instrumentation capabilities in the PivotViewer to trace sort of usability patterns and sort of see how people browse and select and where they pause and where they don't. >> Jennifer Lin: Yeah. >>: And do you have stats on that? >> Jennifer Lin: Yeah. We've done usability studies in house. So we do have some data about how people interact with it. If there's a specific site that has data that you want to see the usage of that site and that data, there are events for when items are clicked, when you filter, who changes the filter state and basically when the collection view changes. And so you should actually be able to do some home grown infographics or information about that. There's nothing that I could think of that's kind of more global that we've produced, but I will take that as a future request for maybe some simple code in the future. >>: Okay. Thanks. >> Jennifer Lin: All right. Thanks. Thanks, everyone. >>: Thanks very much. [applause]. >>: I'd like to introduce Jeff Falgout. He's the senior systems administrator for the US Geological Survey Center for Biological Informatics, CBI, in Denver, Colorado. He manages [inaudible] infrastructure that supports five district programs and over 100 websites. He's responsible for daily infrastructure operations, including the configuration and support of servers and applications in compliance with government security regulations. In addition, Jeff leads long-term infrastructure planning activities that support the needs of a diverse USGS bioinformatics community. He has over nine years experience working in government. Prior to joining the USGS in 2007 he worked as a system administrator for both the Bureau of Land Management and Jefferson County Colorado Information and Technical Operations. He holds a bachelor of science in biology from Northwestern State University in Louisiana. >> Jeff Falgout: Okay. Thank you, guys. I am the last thing standing between you and going home, so this is a pretty precarious situation, I guess. But thanks for having me. And I appreciate you guys being here. So quick outline what we're here to talk about. Quick organization background. Most people don't or aren't aware that USGS also handles biological data, ecological data. We're not just earthquakes and water. So of course biological informatics challenges, we are living some of the challenges Peter was talking about a few minutes ago, over 100 plus years of data. And some of the things we're trying to do to solve these challenges. We're headquartered in Reston, Virginia, outside of DC. Responsible for informatics activities with the USGS. We support a diverse group of programs including the National Biological Information Infrastructure, which I'm sure a lot of you have heard from -- heard of if you're involved in biology. GAP analysis program, which is an effort to keep common species common; in other words, the species that aren't any sort of management plan, we want to keep them off management plans for like endangered species. We do some cooperative work with the National Park Service to map the vegetation characterizations within each of their parts. And some of the other programs we support clustered Integrated Taxonomic Information System which is one of the taxonomic authorities. And it's in partnership with the Smithsonian Institute to actually be one of the top authorities for genus species lookups. We have international partners and global partners along with DataONE, which is an up and coming effort and a pretty big deal for us. 60 people across the country. For those who aren't familiar with informatics, we of course this is stolen from the eco informatics stock but we are the intersection between modeling, analysis and synthesizing of ecological data, the raw ecology or biology science and then information technology. And I'm looking at this from the information side. The task within or the as the et cetera of biological informatics, I'm not going to read through this, but it gives you an idea of there's so many different parts to bioinformatics. What is the National Biological Information Infrastructure, marking stuff. Contributors and users. We go all the way from federal governments and international governments all the way down to private citizens sometimes. We run the gamut. And of course that leaves disparate datasets, like you can imagine. We are a distributed network. Some are based on regions, geographic especially, some are national themes. Fish, some birds. And then we do have some information or infrastructure nodes including places that provide computing infrastructure for us. And one of our big ones is the Oakridge National Lab in Oakridge, Tennessee. Some example projects you can see. We have a ton of stuff what comes down our pipe. I'll get into that. Our gram of what we try to do. As you can see, we have a lot of data holdings, not necessarily holdings within our infrastructure but remote partners. We try to provide some data access and geospatial services and then some visualizations services on top of this. And if you look at other visualizations, they would be inserted into this between the top two layers there. And we can show some modeling results towards the end which isn't our work but it's some stuff from DataONE. But we're trying to move more towards expanding that distributed services model there. Of course we all have challenges, and ours are not small. Metadata, metadata, metadata. Everybody for the most part throughout the day has been talking about metadata. And we really, really, really rely on metadata. We do a lot of work. We have a clearinghouse with approximately 100,000 records in there point to datasets. The tools you see there, sometimes there's way too many, and deciding on them is a challenge and you end up going down a path and got to rerun that path because a certain tool you come to a point where a tool just didn't quite work out for you or tools aren't adequate enough. So sometimes it's too many, sometimes not enough. The culture issues, biologists are known for not wanting to do anything outside the field. So we have to change that culture. And that's -- that's becoming real -a big challenge, and it's bigger than we ever thought it would be, simply because they're not focused on dealing with data and data management, they are out there doing research and money's tight and they want to get that money for the research and get the -- and data collection and not necessarily deal wit data management. So we have to change culture. And part of that is providing tools for them. Standards, of course everybody relies on standards. Infrastructure. And then lacks a lots of stuff. And then data silos, sometimes these silos are made of titanium and you can't get through them. So we are trying to bust through some of those. And then of course our nemesis is large numbers and small datasets because each one of these datasets has their own schema. And then sometimes they're in Excel spreadsheet with three different sheets on them. Sometimes they're a little bit bigger than that, but they're unique as each personality. So we struggle with that a lot. Some examples of the projects and the ways we're trying to address these challenges. Species and mash-ups. I think of mash potatoes every time I see this, but essentially that's the point is that we're talking information from different data sources and trying to put those all into one view. Here you can see lookups on -- I can't even remember what this is I think it's a bull frog or something. And then drilling down to information on that taxonomy or that species. And some of this stuff is pulled from our data holdings within MBR. CBI. And then some are pulled from other places like GBIF. Species of concern by state. So each state typically has to file a state wildlife action plan for species of concern can. And what we found is that as we visualize and we're presented some of the -- what states were doing for their wildlife action plans, other states didn't realize that the state next to them was doing the same thing. So we kind of expose that to go to them and the states adjusted to what they were doing. Because you know, animals don't pay attention to political boundaries. So sometimes those action plans overlapped each other. And of course we can find even more information on what's called GBIF, Global Biodiversity Information Facility. And that's a framework that has over 200 million specimen record. So it's a massive data store so we certainly don't want to take on that task of holding that information. But we do link out to the Web services so we can direct the information on it. And these are specimen records from museum occurrences not necessarily observations in the field. Furthermore into these species mash-ups you can see the ecology of the animal and then what other datasets are referenced to it. And I mentioned clearinghouse before. And this is that metadata information where we're pretty hot on. And that's the result of a decent metadata records that we can find some of those datasets. In addition to what we showed here, we're working on pictures for stuff and then you can see the Google results, also. Another big deal for us is the Oceanic Biogeographical Information System, and that's true to get views into datasets from ocean -- or marine work. We have information from west coast, east coast. We went -- we just for an example went down to Woods Hole in 1903 to 1909. We have just large numbers of datasets. So what this is trying to do is dissect the data so it's easy to wrap your head around instead of just looking at spreadsheets or links to data. This gives you an idea where things are. Another view of that. And then you can drill down to taxon or -- yeah, so you can just drill down to species if you need to. And this gives you a geographical view of data observations for a single species. So you can see it's a lot of information there and graphically represented seems to be the most efficient way to do it. We run something called Ekey too or Ekey also. And what this does is give people and easy way to identify a fish they found or the fish they've caught. Sometimes they look the same. And then once you idea, then you open the world for further research and what you've got and you can find links to other datasets. A geographic base representation. This one happens to deal with exotic versus native species and the number of observations and the location on map. Again, graphically representing information seems to make a lot more sense to a lot of people, and especially decision makers who don't have time to dig into this stuff. This really comes in handy. And of course we don't forget our metadata people because they are critical to us. So what we've done is create visualization tools for the metadata QA/QC process. You can see on top left density or top right density. You can see where our data provider or metadata record providers are and the data providers also. Of course there's a Many Eyes reference to what our clearinghouse holds. We also have a dashboard for the metadata QA/QC people. They can see broken links, they can see what searches are popular the at a certain time. They can see missing fields. So that's really increased a lot of QA/QC capabilities made those -- making sure that metadata record is better than it was when it came in. They can go back to the initial submitter and say you need to provide this information for me before we put it out there because nothing's more frustrating than incomplete metadata record. And of course the other advantage of the clearinghouse is now once you've done that search for that metadata record, now you can go off and find all kinds of information, of course linked back to GBIF, Global Biodiversity Information Facility We've also looked at revamping the way data mining is done. We've decided to bring some visualization into it as far as geographic and pictures and also allowing dynamic refining of results. If you see what's called clusters up there we integrate some thesaurus. Web service is in there to help you say you searched on ecology, did you want ecosystems, also? So that provides some suggestions and it gives you the capability to refine. Also a geographic search sometimes is very helpful. A different view into that. You can also drill down into who is providing the data, who has published it. And then of course link off to more records and more information. Of course USGS everything is a map. And what this does is give you visual -- or map representation of species information we have. So we've also gone in and looked at IBM's Many Eyes to see what we could do there. Fortunately still -- I think it's still considered a research project and not really for production use. But we've done several things with it, just to give you an idea of what we've done. And these are some of the comments we have with visualization. We tried to enable visualization more than actually produce those, but we can't get away from producing them ourselves. My next three slides are some work that's done at DataONE which it actually shows the visualization process and the data integration in process. EBird is a site that allows citizen sight to submit individual bird sightings. And I believe right now they're collecting about a million records or more a month. And so that gets the end user or the citizen involved in a data collection process. And they've also taken land cover, imaginary, meteorological information -- boy, I can't say that, and MODIS information. I believe the MODIS information they used to process was 200 terabytes alone. And they've taken all the intersection of data with the help of the TeraGrid, which is basically a cloud based supercomputer. A lot of that comes from ORNL and some of the other supercomputer providers. And they've built this model. And what this model is representation of a prediction of the distribution of indigo bunting. And you can see the potential uses there. But what they've done is try to ground truth this model, and they've found that it's highly accurate in the prediction of where things should do. And this is important because now they can play with values. What happens when the drainup occurs two weeks earlier, two weeks later. What happens in the event of a drought? So as far as the ground-truthing goes, you see the traditional distribution map that's been accepted previously. And this is the estimate based on that previous work, eBird, the supercomputer MODIS and meteorology work. And you can see how accurate that is. And it's almost even more accurate because it gives you the center distribution or the highest density of the occurrences of those -- of the thrush. Now, as they run these models, the first thing you see, right, is why is that happening? Why is central California becoming the bird showing up earlier in the year than the rest of the country? So of course with new discoveries come new questions. And they've gone down and researched and they believe that it's a change in land in the agricultural use in central and South America that's allowing these birds to be closer to North America. So without that visualization, that data, how would you really spot that and make that apparent? That's obvious in two seconds. So that's pretty much all I got. If you have any questions. [applause]. >>: Questions for Jeff? >>: The eBird project is part of the DataONE NSF project. Are you talking to them about their -- they have a plan for looking after bio and ecological databases? >> Jeff Falgout: So eBird is actually a project of Cornell but the principle investigator, Steve Cowling is on the DataONE team. And Mike Frame is a principle investigator with the DataONE team. He's also on the leadership. So we are intimately involved in DataONE promise right now, uh-huh. >>: Any other questions? All right? Well, then I will hand it over to Roberta to wrap up. And thank you again very much, Jeff. [applause]. >> Robert Shaffer: Well, I want to start with thank yous. I want to thank all of the presenters today. I think it was an excellent day and we all learned a lot. I'd like to thank our wonderful host and inspiration, Tony Hey. And of course what we could have done without Lee, I don't know. So we're very grateful. And then of course the ICSTI team, Brian and Tony and Herbert and Bernard and Elizabeth and everyone. So I really want to express our gratitude for everyone and particularly all of us you. I was quite impressed that people were up and attentive to this very moment, watching the birds. It's been a fantastic day. I think of what I know they're now calling techno-tourism, where you travel to different places but you actually stay in one spot. And just to make a quick review for you, we have in the support span of this day been to the moon an beyond. We've been several hundred feet underground. We've been bird watching. We've been dissecting human beings. We've been in the operating theater. We've really been all over. We have been speaking many languages. And not only in the sense of foreign language but in the sense of disciplinary language. And we've been wowed, we've been inspired and I have a sense, but I'm not quite sure, that we may have been abducted to Troy, New York. [laughter]. So now the time has come to travel home, and I want to wish you all a very, very safe travels and say that I hope to see all of you again in the near term in Beijing. Before we actually leave, though, I'd like to open the floor for any closing comments. So anything you'd like to say about what you've seen, any announcements you'd like to quickly make about things you know that could connect people to what they've learned today, this is the opportunity, and I open the floor to you. I though that Robert has something to share with us. So, Robert. >>: Is this being recorded too or are we off the air? >>: Probably. [laughter]. >>: Well, let me thank you all. For all the other speakers too. I mean, I've never honestly heard of this organization before a few months ago. And it's really great to meet many of you. And I just made a connection this morning that many of you may not be aware of, and I just wanted to share that with you. There is an organization called the Gordon Research Conferences. And one of -- these are typically in chemistry and math and physics and biology. But there's a very interesting conference that a lot of people don't know about it. And I've been going to it for maybe the last two cycles. It's an every-other-year conference. It's in the summer. It's in July, the 10th through the 15th in Bryant, Rhode Island, every -- it's every two years and every other occurrence it goes either to Oxford in England or Rhode Island in this cycle around it's in Rhode Island. But the -- its title is Gordon Research Conference on visualization and science and education. It's quite a unique conference. It's not huge number of people. It's about 120, 100 to 120 participants. So it's very small. Great format. You have conference meetings in the morning, then you have afternoon all free to discuss and meet people. And then there's a happy hour and a poster session and in the evening more talks. So it's just a wonderful week. And I would just encourage you to check into that if that interests you at all with broad, wide variety of people from museum curators, graphic designers, chemists, biologists, all sorts of people interested in visualization. I just get this sense that maybe there's some people in this room who might be interested in that. So thank you very much. >> Robert Shaffer: Tony, I'm sorry. Oh, I'm sorry. >>: Just a little more housekeeping, too. We're going to be posting as much of this material as we can on the ICSTI.org website, so we encourage you to go back to that and just a reminder to all the speakers if you have files that you haven't sent already, send them to Lee and -- or links, in which case the files may be too big or if there were videos in some cases. >>: [inaudible]. >>: Okay. That's good. Find Lee today. And there's a -- there's an agreement I think between Microsoft and University of Washington where this kind of proceedings can be streamed and you can watch them and share them with your friends. And we -- I know ICSTI will be putting out some communications about this, [inaudible] organization will be obviously putting out some press releases about our launch of science cinema. And Microsoft will be tweeting and putting out some blogs of their own. And so I encourage everyone to try to use the social media to spread the word about this as much as possible. Thank you all for coming. >> Robert Shaffer: Anyone else? Safe travels and thank you again. [applause]