1 >>: Good morning, everybody. Welcome to the Third International Conference on AstroInformatics. I expect this will be a fun meeting like these usually do with a lot of good discussions, which is, of course, the most important part. First, let me introduce the co-organizers. They're Pepe Longo, sitting there. Dan Fay, over there having coffee. Ashish Mahaba, who will join us tomorrow and, of course, our host Yan Xu from Microsoft Research. And I will yield to her to introduce the Microsoft Research logistics and our first speaker. She will chair the first session. Let me remind everybody, we got a Facebook page, which is intended to be an online forum for any questions, discussion and all that that you don't do in person. And since we yielded to the decline of western civilization, we have a Twitter account now. So we do that too. But feel free to interact and at this point, I'll just yield to Yan. >> Yan Xu: So I'm Yan Xu. On behalf of Microsoft Research, a warm welcome to all of you. Fortunately, both Ashish Mahaba and Robin Inaba, who have been working on the lock sticks, both are not available today. But I think we've got pretty much everything under control. And if everybody's here, I feel like we're missing some of the people on the bus. But we'll figure it out. So anyway, I have been working with you, the astronomical community, for a few years. And I think some of you have heard me talking about Microsoft Research multiple times, probably, because I have one slide that I permanently kept in all my presentations. But today, I don't have to do that, because I have the honor and the great pleasure to introduce you to someone who has the vision and the capacity to talk to you about Microsoft Research at a totally different level, and who is our corporate vice president of Microsoft Research Redmond, Dr. Peter Lee. >> Peter Lee: Thank you, Yan, and thank you all for coming to our campus for this conference. It's really, really a great honor to host all of you. What I was asked to do is just to speak just for a few minutes, for ten minutes, about Microsoft Research and why we are interested in astroinformatics and in science in general. And then I thought I would do that and if there are questions I would be happy to take some of those. 2 So first thing to explain is why on earth does Microsoft invest in basic research? And it's a hard thing to explain. It's a hard thing to explain even within Microsoft and sometimes even within Microsoft Research. Our own researchers sometimes wonder why do we do this? And the best explanation I can give actually goes back to a lecture I used to give to freshmen when I was a professor at Carnegie Mellon University. And in that lecture, I used this image, and this is the kind of image that you use to try to keep early morning freshmen awake. And this was a lecture on the problem -- the classic computer science problem of what's called prefix reversal, sorting by prefix reversal. More popularly called, if you're 18 years old, the pancake flipping problem. So if you're not familiar with this, there are many variations of the pancake flipping problem. But a simple form of it is to imagine you have a stack of pancakes and imagine that they're flawed, that they're burned on one side so you don't want to stack the pancakes showing the burned side up. So the burned side should always be down. We have one operation, which is to stick a spatula anywhere you want within the stack of pancakes and then for all the pancakes that are on the spatula, you can flip them over. So with that one operation, the challenge is how many pancake flips does it take to reverse the orders of the pancakes and attain a final state where the pancakes are reversed but still the burned sides are down. And so that's a really classic problem, and this was the subject, actually, of a whole week's worth of lectures to freshmen at Carnegie Mellon. And it's quite difficult. There are, actually, now today closed form solutions for stacks of pancakes up to 13. But beyond 13, it's still an open problem, and turns out to be quite interesting in terms of pedagogy for young computer science students. And, you know, with an image like this, you can keep them interested at least for the first 15 minutes of the lecture. One thing that's remarkable here is that the seminal work on this problem was actually done by our very own Bill Gates. And this was published in a joint paper with Christos Papadimitriou who some of you may know at Berkeley, who we still collaborate with actively here at Microsoft Research. Back when Microsoft was still a very small company, still in Albuquerque, New Mexico. 3 And so the point here is that really from the very beginning, from the very origins of Microsoft as a company and its founder, there was a deep belief in the value of basic research, participation in the scholarly traditions of academic research and open publication. And in this case, of course, the problems were directly applicable to some scheduling and resource allocation issues in the early MS-DOS. So from that start, I think there was always, I'm guessing, an ambition in Microsoft to be engaged in basic research. And, of course, it took some time until about 1991. It was about the time that Microsoft finally exceeded the $1 billion mark in annual revenues. Still a small company, certainly by today's IT industry standards. But it took until then, when finally a formal step was taken to create a basic research lab. And so where you're sitting today is in Building 99, which is sort of the mothership of a global basic research organization of about 850 Ph.D. researchers, about 300 of them housed here in this building. Now, from that start, this laboratory does really a substantial amount of basic research it is primarily in computer science and computer engineering but we are also deeply engaged and committed to advancing our understanding and understanding, uncovers new truths in all human endeavors of science and engineering, including mathematics and biology, chemistry, physics, as I'm showing here from a recent paper in Science from two months ago. And, of course, in astronomy, cosmology, and astroinformatics. It is not just our own desire to be involved directly in research in these areas, but also, I think, we're very proud to be supporters of external research. And, in fact, while we in Microsoft Research have a fairly substantial team doing theoretical physics, particularly in the condensed matter and quantum domains, this work is actually not done by Microsoft Research but is the direct outcome of our funding of external experimental physics research in the field. And so by providing both infrastructure and computer standards as well as direct support, we really seek to support the most important new frontiers in science and engineering. Now, when I'm talking to senior business leaders in the company, sometimes, especially for people who are new, I don't have to do this with Steve Baumer, 4 but there are new executives who wonder, you're doing quantum physics? You're doing astronomy? You're doing medicine? Why? What is the purpose here? And for that, I draw map about our research investments in the lab. This was something that Yan asked for specifically here, so bear with me. If you imagine this blank screen being the space of all kind of research activities in the lab, we can actually look at the dimensions of this. So along the X axis, we have research activities that span kind of short-term, like in managed risk activity, research activities that are only sensible if we can assume that we can get some type of answer relatively quickly. As we go out, research activities that demand patience or require some patience. And then on the Y dimension, we have kind of choices of problems. And near the origin, we have what we call reactive problems, classical societal challenges or challenges from the scientific community or our product groups coming to us directly with problems that they need help to solve. And as we move up the Y dimension, we have problems or we approach the more classical open-ended search for truth and understanding and beauty that marks classical basic research. And so then in this space, I cut this into quadrants and attach names to it, because the names are very useful things for senior business leaders to hang on to. And so in the upper right quadrant, we have the classic kind of long-term, open-ended research which I've called here for the benefit of our business executives, Blue Sky Research. In the lower left, we have very mission-focused research activities. In the lower right, we have what you can think of as our drive to do the best at what we do, continuous improvement or times marked by fairly deep commitment, long-term commitment to grand challenges such as computer vision or machine translation of natural languages and so on. And then in the upper left, our desire to be very surprising and disruptive and produce kind of game-changing new ideas. And so in the lab, what we try to do here is try to embrace the diversity of different approaches to research and have a very open attitude and try to reward researchers who play a role in any or all of these four quadrants. And then for me, when I work with our management team, I try to challenge our managers to tell me, strategically, how they are investing and supporting research across all four quadrants. 5 Now, this is important because you can then justify a lot of what we do by looking across these four quadrants. And these little pictures are just meant to show just one little story. But there are hundreds of these stories where in the upper right quadrant, dating back to 1999, we had some early work on MVDR beam formers, adaptive beam formers for audio array processing, just a purely theoretical exercise for this laboratory. At some point, there was a project stood up called Project Natal that was attempting to bring beam forming into living rooms that eventually got green-lighted as a product effort, which resulted in the microphone array in Kinect that allows you to talk to Kinect even in a very noisy room. And then today, we're looking at more and more refinements and extensions of the technology. And so the idea here is that there is a pipeline where all four quadrants are really essential for making good things happen. And so there's sort of a philosophy there that's important to us. And so for things like astroinformatics, this plays out all of the time. A lot of the infrastructure, a lot of the database technology, a lot of the visualization technologies, even the programming support that we develop in support of the astroinformatics community, in support of basic science in this area ends up having direct impact in many other ways. And, in fact, as some of you might know, even in things as mundane as the new version of Excel that will be hitting the market in a few months, concepts from the worldwide telescope have direct impact and will be realized directly also new features, even in things as mundane as Office. So the whole thing is sort of a nice cycle. So that's about all I have to say. If you have questions, great. If not, if you want to get along with your conference, I'm happy to let you go off and do that. Thank you again for being here. It's really, really great for us to see you all. >> Yan Xu: Thank you. >>: What was the disruptive technology in the upper left? pictures. You had four >> Peter Lee: Oh, yeah. So this is a test rig. There are several of these made, three or four of these made with nine microphones. And the nine 6 microphones were used to develop both adaptive calibration in arbitrary living rooms, up to four meters in depth using pleasant tones. And so actually if you buy an Xbox with a Kinect and you turn it on for the first time, you hear a music tone. It's very kind of uplifting and inspirational. That is actually a calibration tone. And then this was meant to then test whether MVDR adaptive beam forming could actually be used at four meters' distance to shut out all of the surround sound noise and all of the other voices in the room and just focus on the one voice that is speaking. And so that was very successful. This tested in about 500 homes in the Puget Sound area. Now, of course, the challenge for moving there to here in the Kinect, there was a 30 cent manufacturing budget and that meant only four microphones. And, of course, the device is much smaller and so the distance between the microphones is much smaller. And so it really, physically can't possibly work. And, in fact, the early academic papers we submitted, the reviewers uniformly rejected the paper on that ground. But if you actually own a Kinect, it works remarkably well. And so we're pretty proud of it. >>: [inaudible]. >> Peter Lee: Right. And thanks for asking the question, because we love to brag about the Kinect so it's good. All right. Well, enjoy the conference. Really looking forward to seeing what you all do. Thanks. >> Yan Xu: Thanks again. >> Dan Fay: Couple things. Again, I also would like to thank all of you for coming out here, coming out to our nice weather out in Redmond. Unfortunately, you came on the one day in the last 50 that we actually had rain. So you got to enjoy that success. We're all depressed because we actually wanted to go a couple more days so we could set the new record of consecutive days without rain in the Puget Sound area. But my lawn's happy for the rain, I should say. So as Yan mentioned, I work here, also at Microsoft Research in our research connections group, and I head up what we do kind of around what's called earth, energy and environment. And I joke sometimes, we actually cover everything 7 that isn't allowed to deal with humans. So we're not dealing with health and well-being or bioinformatics or anything like that. So it's a fun area to kind of play with. One of the things I wanted to kind of go back, about eight years ago, almost nine years ago, we held an eScience workshop here. The original one that we did called Data Intensive Computer Workshop, and there was a key note by Jim about 20 questions to a better application. And it was a really interesting kind of looking back to see what we did at that time in this cramped little Sheraton room down in Bellevue. So it was kind of an interesting thing to see, as we started getting people across different disciplines, what actually happens in this. And some of the stuff that came out of, like, Jim's talk and others was around the online science and how to deal with computational science. There was also this nice talk by Alex at that time about astrophysics with the terabytes of data. So it was interesting even seeing what was happening at that time with some of the work that Alex was doing with Jim and others and the stuff around the SDSS at the time and sky query and some of the other pieces that kind of changed how we do some of the science and also had an impact in the way we here at Microsoft and Microsoft Research work with different science communities as well. So I took these slides, these are actually originally from Jim's deck from that talk, and part of what I did, which is kind of interesting, is if you look at some of the challenges that were talked about back then, they're actually still challenges right now. We still haven't solved the information avalanche or the data tsunami or the -- what is that, the data flood or, you know, whatever the new term is that we've all assigned this issue of more data as it keeps coming. In fact, I'll also say that we won't, because the challenges with the speed of the commoditization of some of these sensors and the devices and the computation is always increasing faster than we can actually process it and/or even store it. And in the astronomy space, you guys see that more in spades than also in any of the other disciplines. Some of the other pieces are still very [indiscernible], even to the publishing of the data. How do we do it, what are the right ways to do it, keeping the provenance of the information in the datasets. And then, you know, how do we refer back to it. Can we even keep all of it. 8 As well, along with the, you know, one of the key things back then, we talked about a lot also was about the global federation of different datasets, which is still the case, especially with more collaboration across not only universities and research centers, but also across the disciplines as well, as we go into more multidisciplinary science. And this was something Jim had talked about, which was actually really interesting at the time, really breaking out the roles of different types of scientists within the spaces. And at the time, you know, there were still a lot of, especially in other domains, thinking oh, no, we do it all. We're doing everything. But as you look at where computation has come and the use of informatics in a lot of these areas, there are folks that are very good at collecting data and analyzing within their space. There's folks that create new types of algorithms and new statistical methods to actually analyze their dataset. How do you bring all those pieces together? How do you keep track of all those technological advances in those other domains as well? And then there's folks that actually understand the plumbing aspect of actually how do you get things to disk, what are the types of disks and actually the networking portion. So it's actually been kind of interesting to see this is still the case. We're seeing this breakout happening a little bit more within the areas. The challenge that we see, especially in a lot of the academic spaces, is how do you reward folks in these different areas for their expertise and them get credit for this ongoing work. So Peter kind of covered a lot of this, which is overall research. Down in the bottom corner, we have a little map of all the different labs as well so we're kind of spread, kind of here in Redmond, but different locations around the world. And so we, in our group, when the research connections pull from a lot of these different not only areas, but also domains of research that we can use for scientific interaction. As I kind of mentioned, in the earth, energy, environment side we cover three kind of main areas. I'll say three kind of main focus that we look at. And the first one is kind of around the visualizing and experiencing the data and the information. And really, what that is about is how do you make a connection with a scientific information in the same way that people have a 9 connection with a great piece of artwork or maybe a red Ferrari or, in the case of my wife, nice pair of shoes or handbags. How do you actually have that emotional kind of guttural connection with that information. So that's one of the things we look at. And a lot of this work came out of the work from the worldwide telescope of how can you make something beautiful that people can actually interact with and actually kind of mesmerize them. And why can't we do that with scientific information when we can do it with gaming, we can do it with all this other information. The other key thing we look at a lot is around what we call accessible data. So there's key pieces in there, which is around the discoverability of the information, the consumeability and also the accessibility of the data and information. So do you even know where anything's at? If you do, can you actually get to it through the FTP site or through some sort of password or it's hidden on some sort of share. And then once you can get to it, can you actually do anything with it. And this, as you look in the different science spaces and based on the technology acumen of the different scientists, it needs to be done in different ways. How can I quickly get up to speed and have science ready information and data to be processing. And the last one we look at a lot is around enabling just scientific collaborations. How do you do that in a way because it's just part of the overall process. So this is the Fourth Paradigm part. We've kind of talked about this a lot, but it's always, you know, kind of good to revisit that this is where we're at. All of this data sets in here, more data being captured and, in fact, too much. Then there's this corollary problem and idea as well of we need to capture all the data and we need to save it all. And when you start looking at, you know, are we actually going to be able to do that, we see on some of the telescopes where you actually can't, you need to throw it away, you need to process it as it's going by. How do we handle the information at the -- you know, just in time. And what is the data that we should be keeping. Part of that is it's not just about doing more and more brute force processing. It's about thinking about the problem earlier. And one of the challenges has 10 been as we move to more this kind of instant access of a lot of information and data, is this idea, oh, we can just run it anytime. And sometimes people forget to think about the problem and what is it they're really trying to do and what's the long-term benefit of the data that they're trying to capture and how will it be used. And so setting up the problem correctly. So overall, the book's available. We always do a promotional for this since it's something we helped edit and author. It's available for free. You can go to fourthparadigm.org and it's under a creative comments license as well so it can be reused for other uses. So overall, kind of on the eScience problem, this is still going on as well. How do you ask questions for the information, get answers. How do you pull it from all these different data sources and deal with the different levels of precision from historical information on down. And so how quickly -- how can you do this and hopefully in a smart and intelligent manner get to the information query. And the ideal is, you know, just being able to sit, as we kind of joke, in your lounge chair, able to ask questions, having the results come back to you. Then being able to write up your paper, submit your next grant request and actually get funded. One thing just to highlight on the fourth paradigm part, Jeff Dozier, who is here, wrote a good article for the EOS a year ago, year and a half ago, highlighting the use of kind of the overall concepts of the Fourth Paradigm and the work he does around show hydrology. And this idea of using both remote sensing and local information, combining all those together. So it's actually a quick, three-page read, but it's a very good kind of overview. One of the other things that's also really key that's kind of come out of some of the work we've been doing with different groups is kind of just the overall value of information, and this kind of idea of the value of it versus the amount of time processing it versus, you know, who's doing the work on it. And the increases as more work's been applied to it. And that's kind of obvious, but one of the things, sometimes, we forget is how do you reward also those people along the way that are doing that hard work to get the data to the science ready point. And a lot of times in areas outside of -- especially in some of the ecological areas and outside of large institutions, there's a lot of people that are doing this painstakingly on 11 their own, or they're hard-won data sets because they actually had to go out in the field and get them and then process them and you have other people aggregating them together and doing the work. But as it's kind of moving up, how do you kind of -- how do you add that value to it or make sure that that value's usable for different folks. And so this is something we kind of also look at to say are there ways we can help with the overall academic community, provide ways of rewarding different groups in this space for their activity. One thing I want to kind of just talk about real briefly is something that we've looked at when we were dealing with some of the environmental spaces and kind of position as the environmental, the ecosystem. But it actually can be applied to a lot of different spaces as well. And it's this idea that you have kind of the knowledge in the earth, in this case, but the real scientific data and information and based on that, you also want to have some sort of action that's being taken. So you want the knowledge that's kind of gained from the scientific understanding of that real-life area to actually inform policy making decisions or some sort of use cases that you want to have changed. Well, you know, as we kind of looked at it more, we were breaking it down into more areas, it kind of evolves into these kind of two pieces. And one part of it is there on the left-hand side, which is the traditional, say the scientific face. Collecting data or processing data, having running models and actually having output of the information, the data and doing more analysis on it. And then you have this really interesting part, which is the thing the humans come into, which is actually the insight of that data and the information. And then that insight gets written up in maybe a publication and then gets submitted or gets, you know, made available to others and then goes back to kind of producing more data and we kind of keep a lot of that area. And maybe somewhere in there, the data gets made available to others or the publications there. But for any of that information to actually have an effect on, I'll say, that action side, which could be the policy decision making or actually on the general public or other areas, there has to be communicated in a way that they can actually consume it and actually understand it. And so a lot of times, it can't be done in the normal publication mechanism of a paper, 12 but are there other methods that could actually be done. And on the right-hand side, on the action side, there's also that decision making portion. So they look at the information. They make some policy decision or some sort of -- or even a consumer might make a change in behavior based off of some information they've learned, and then they will actually hopefully change their behavior and implement it. Well, how do you actually track that those implementations are being done? You actually have to go back and do more tracking of it, the information. And this is especially the case on policy decisions, and some of those the last thing you want is a policy implemented or written up and implemented and actually no one knows if it works. And one of the things, we really look at this as technology hopefully helping a lot of this out of the area. But one of the other things we've found was that by making that data and information available to somebody on the right-hand side to be able to consume it, in different ways, increases the credibility of that data and information, whatever you're trying to communicate, right. So whatever the information. Even in the fact if they can't really understand the data, but the fact that they can get back to and actually see it ask. Go play with it increases the overall credibility of it. So finding ways to kind of connect between those two areas and communicating it in ways that make sense to people. And not all the times does it, you know, have to be dumbed down or things like that. But put in the context that they can consume it. That's a kind of key thing to think about. And it gets even more complex when you look at the overall picture. So if the boxes in this case were policymakers, they're getting bombarded by different messages from many different people and constituents and things like that. So how do you make sure that the message or itself information you're trying to communicate to the policymakers or the general public actually gets through? Again, it's got to be in a way that they can consume it and understand it. So cover a couple other things. We talked about some of these issues around the ecological data flood and more and more data that's coming, we're seeing this happening, we need more algorithms helping with some of these and processing those earlier. And part of the challenge that we look at is how do you help this across the domains. So what can we take from one domain and 13 utilize it in another one. You know, trying to deal with everything from field-based data, manual measurements without the precision of some of the digital instruments that we all kind of deal with all the way up to satellite model outputs, you know, counting of information and dealing even back to the historical photographs on some of these ones on the ecological ones. So you have a lot of types of different data and different data sets within here. So dealing with all the challenges related to those, about combining those together, is one of the key things that we try to look at, find where can we make that easier and have tooling to help as well. So kind of why is this kind of important? Well, because, you know, traditionally you want to understand where the data came from and how it got processed and what are the right ways it got accessed or made available as well. What's the uncertainty of the different data sets once you start combining them together, and does that propagate all the way up? Also, the part on the data sharing. Sometimes when we deal with the environmental datasets that are with organizations that may not want it totally public all the way or the exact location where all the information is. If you have information on where all the teak forests are, do you really want to have that published on the internet for everybody to find so somebody can go log it and clean out the remaining teak forests. The same as the case on tracking of different animals. Do you want all the koalas, where they're at at every moment. Other endangered species, even. So there's a lot of challenges also within there. How do you actually handle that data, make it available for others to use, but also the same with some of the human datasets, how do you do that in a way that's maybe anonymized or not giving exact all the information. But one of the things that's interesting is this part down here about, you know, we see it as the science really happens when you start bringing together these multi-datasets of information. And part of this, we learned this when we were looking at some of the astronomy information and what was happening, which was it wasn't just about a single telescope or a single bandwidth or type of dataset you were getting down, but it started moving into combining across all the different wave lengths and the information you were getting. 14 So if it was visible or infrared or microwave, you get information from each one of those and how do you find actually the signal through all that noise of signals, right? Combining all those together is the key part. And so you're seeing this even more now happening in kind of these environmental areas as well. So we look a lot to things that are happening in these other disciplines, that are a little maybe farther ahead. How do we learn from those as well and apply those. And this was just one of the projects we ended up doing early on a number of years ago, we were using some of our cloud computing early at the time, Azure, to process some of the datasets from, in this case MODIS, some of the reprojection and some of those pieces and the calculation on it to get the EVAPO transpiration. So we were doing this in conjunction with UC Berkeley, and Youngryel at the time was trying to do this on one machine, and would have never been able to actually expand it out to do the 30 years of datasets at the 1K resolution that we're looking at. And so we were looking how could we bridge that gap. And then what can we learn from that to also make it easier for others to do this in a more easy way. And what technologies need to be in place to allow that to happen. And so it's kind of an interesting one to look at, going what is the real process that they're doing, what are they really trying to have as an outcome and how can we help and how can the technology help in that way. But actually make it about, again, the science making the work run or about the insights coming from the science and not about what the technology could do. So not trying to over expose the technology, but how do you actually make it do what the scientist wants. Couple other things we look at now with some of the stuff that's happen, we have come work going on in our SQL team as well on the Azure side, trying out new ideas. So we have things we try out here. They're actually trying out some ideas, testing them now. These are available to use even right now. But putting some numerics available now so you can cull those remotely, utilize some of the numerical engine and the libraries that we have available. There 15 is also some work that they have going on looking at exploring data. So can you help people mash up the information in the data sets together. Again, one of the things we look at is can we span everything from the lower end of folks on the commodity -- on the technology curve to folks that do programming. And where is it that they need some of that help. So are there visual ways to help clean and organize it. One of the things that's also interesting, there's a piece going on that they're doing around what's called data hub, which allows organizations to have their own marketplace for information and data. So it's a way to actually publish it within a community or maybe within an organization as a whole or an enterprise and share that information just within them. So you can really kind of restrict access and do some of those pieces as well but not have to worry about people finding it and what the access points are. And then there's some work also on trust information to be able to encrypt some of the data as well. So there's a bunch of work also going on on their side. So on our end, we focus, as I mentioned earlier, a lot on this overall piece of discoverability, accessibility and consumeability of the data. And, you know, how can we enable that both from the user or the person wanting to consume the information, but also from the person that wants to publish the information. And this is a lot -- we find this a lot in the space where you have users that have small amounts of data, and they want to make it available because it's very useful information that others could use. How do they make it so that it's available and discoverable? And what are the right formats? And how do you move that so it's earlier in their collection cycle? So they're not actually having to go back and add metadata to their description of the information and blah, blah, blah. All those things that no one wants to do after the fact. So how can you move it earlier, and are there ways to tag the data and information. One of the things, just as an aside that's really interesting is if you go and we'd like to actually have it at the same point of where you're teaching people how to deal with digital data at the beginning, the same way we do it for teaching people how to collect physical samples. So in the geological areas and some of those different areas, we teach people how to deal with physical specimens, how to collect it and how to make sure you're getting all the information and how you would do that over time. How do we do this for 16 digital-based data as well. So we also have a bunch of tools coming from different groups as well. This is kind of an example of one from our group in Cambridge. We have a computational ecology and environment group there. In this case, we've been looking at Fetch Climate information. This Fetch Climate project, which is climate information and how do we make that available to others. But bringing together all of the different data sets that people have used, a lot of the different records of them, and then apply behind kind of the scenes, behind the interface, ideas of how to deal with the uncertainty. So if you're asking for information on a certain area, how do I provide that to you in a way that I'm limiting the uncertainty on the different datasets that it's coming from. And then be able to very quickly look at the information and maybe some images online and then also be able to download it. So as we look back at that discoverability and the accessibility and then really quickly consumeability. Quick, easy way to do it for a lot of the folks. And you can also add this in, in their case in a programatic way as well. So you can either do it in an interface or programatically. We're trying to find ways that you can make, again, this easier. We've also been doing, looking at things from what we call around the visual informatics framework. This is something Yan's been doing a lot of work on. Looking at how we can help with some of the space of making access to a lot of these datasets quicker and easier. This is really a lot of the times the case that we find in environmental and other areas. And you end up having tons of different applications on many different platforms and you have access to different services and different types of datasets. And so one of the things we looked at was this protocol that's been kind of developed called open data protocol, which is a way to actually send the data in almost a self-describing way across the internet, across the network, so you can actually do queries remotely and access different datasets without actually having to have a library installed as well locally. And it's available now, it's gone into oasis for standardization. But one of the things, why it's kind of interesting to us is when we look at the history of how a lot of the protocols have come across within the internet, so 17 web-based ones, SMTP, other ones, having simple ones that could be implemented on any platform and be usable right away is really key. And so O-data is based off of atom feeds and RSS feeds so it allows you to subscribe to information and data so you can actually refresh the data later on when you want to get a new update. You're not having to go back out and try to figure it out. It stays with it. There are some other nice pieces in there, and they've added, in the case of the environmental space, geo spatial data support. And while it doesn't solve everything, we looked at it, one of the things it does is it actually moves it forward, above people sending CSV files or common deeliminated files that actually have no type information when you send it. Just has the data. So, you know, this at least sends type information with it so you can do some sort of human reasoning about the information as well. So it, again, as you look at as protocols build up, this was kind of interesting to look at right away so you can quickly start getting to it. So we have a bunch of also products and technologies that are building on that as well, but it's just an interesting space. The other thing I just wanted to cover was kind of looking at new ways to analyze and communicate data. So this, you know, everyone here should know about the work on the Sloan Digital Sky Survey. So some of the work kind of coming out of that as a new way publish information and datasets. But one of the things that we actually found very interesting about it as well, once we started looking at it, was not only was it a way to publish information out and you could have it for many different communities, but then it could be reutilized for things like the Galaxy Zoo as well. So you could actually use the same exact datasets, position it a little bit differently, put a little bit of boundaries around it and actually make it available for others to utilize in another type of activity. But bring more, in this case, science users and others involved in it. So not as much to, you know, do the deep research, but actually get them interested as well, but also have them participate in it. So, you know, are there other areas we can do that in and we look at this in other environmental areas and datasets, applying some of these same ideas and saying, you don't need to have a completely different dataset. You can use the 18 same dataset, make queries against it differently and provide it in a way that can be consumed by that constituency in a way that they can actually handle. And then there was the work through Jonathan and Curtis and many others, actually, looking at, you know, bringing together the work for Worldwide Telescope. How do we bring these datasets into one area and actually how can you experience the data. So when I was talking about it earlier with our group, looking at kind of experiencing and visualizing the information data, this was part of the thing we look at. How do you experience it and through that visualization of it. So there's been a bunch of work that's gone on on the integration of the datasets within there. Easy access, quick overview of being able to look at the data, have it accessible at your fingertips and not have to wear where it's at. We've done some bunch of work on adding a new API for the extensibility of it. We've even added and done an Excel add in so you could very easily, for a lot of folks, publish data out. It wasn't as much originally developed for as we were looking at it for the astronomical side, but we see that it could be used there. Was also looking at how could people get information and put it directly on to the earth from data points that they have. And one of the key things that really came out of it was this idea of the tours. So for sharing the information, the data. But it's really about telling a story about the data and a way to share that data. The experience that you have as a scientist or a user of that information and sharing it with others. And that is really kind of not only powerful, because it gives an explanation of the information that somebody doesn't have to read a long paper on or others. They can use many different types of media types to get to it. And then there's a lot of work that's being done lately using it in planetariums. So Jonathan's done a lot of work making it easy for it to be used within the planetarium space. And, in fact, the Cal Academy has a project going on around earth quakes, a show going on. And it's being used in there as well for their planetariums. Just a couple, some of the visualizations we've been playing with lately to say how can we make it easy for folks to share this data and the visualizations of 19 it. On the lower left, this is a slab model that we were working on with the folks from the USGS. So they create these slab models of where they think the slabs are, and then being able to visualize it with the earthquakes as well to see are they actually in range or not. And its ways that they hadn't been able to visualize or deal with the data beforehand. We also, this last year, did some work around this thing called Layerscape, which was a way to actually publish out the tour so none can publish these tours in the space and share their interpretation of their data and their datasets. Wanted to kind of be a way to have a quick, easy way to share this information. But it also, what's interesting about it, is when you do it that way, you can -- you're not only sharing this tour and the interpretation of it, but by sending the tour to someone else, you're actually sharing the handle to the data as well. So because the tour has the data within it, you can go into the layer manager and actually right click on the data and actually copy the data out or get to the information. You can also put your data in with it so you can see your data in conjunction with the other datasets. And so for us, it's an interesting way too of not only sharing your interpretation of it, but also a way of sharing the data as well, to others. Talked about the Excel portion, and then the other piece we played with as well was kind of this -- hopefully this will play. There we go. You know, looking at how we can use things like Kinect to actually interact with Worldwide Telescope in this case but interact with the scientific information and data. How can you take advantage of something like Kinect to do that. Something, really, that we wanted to look at to say, okay, you know, we have these devices and as Peter mentioned, something we did. It's being used in gaming and in living rooms. Can we use it for scientific information and data and actually allow someone to actually very quickly navigate through some of these datasets. So it's something we want to keep playing with and looking at, and we actually had some interns over the summer playing with it as well in different ways to look at other datasets. On a lot of this area, some of the next steps we're looking at is we're adding in more functionality for some of the earth-based datasets, the NETcdf and other datasets that folks have been asking us for. We're looking at new 20 clients, new implementations and some new interfaces, both for kind of making it both easier to consume and to create tours. And then we're also doing some stuff around looking at Azure for different pieces. So one of the things that we're really interested in is this platform as a service and some of the work especially around using things like python directly through Azure. So there's some of the -- there's a lot more of the work going on in python and some of these codes. Can we take advantage of those as well. One plug real quickly so we have an eScience workshop coming up in Chicago and Ian will be there for the overall one as well that's in conjunction with the IEEE eScience workshop conference. Folks want to come join us, we'll be there to hopefully do some of the cross-domain discussions. Couple things I just want to also cover kind of in conclusion, which I always think is an interesting one I have as kind of a discussion point but with all this digital data, it can be opened, but who ends up paying the cost for all the spinning disks and the bandwidth and the cooling and things. And so what are some of the mechanisms that can be used, and should we be looking at other areas, like tolling that you see on roads or, you know, in the case of other countries in Europe and other, where you do licensing of TV signals so you pay for the broadcasting of the infrastructure. It's just an interesting concept of seeing where does this go, because if you keep looking at the size of the data increasing, and yeah, we'll get more, able to build off more of the commoditization of some of these things, get more and more per drive. But you still have all these other costs going on and how do all those get covered and how does that happen if everything's kind of, you know, online in an area. It's just a thing we look at to see are there interesting economic models that can be brought into it as well to help with the not only to cover those costs, but also to help with the economics of different groups who are creating these datasets. So, you know, again, going back to the original slides on looking back from eight years, you know, still some of these same challenges on algorithms that will scale across many different datasets, especially as the datasets increase. We still think there's areas for thinking about the overall data and the retention of it. Do we keep everything, how long do we keep it, where do we 21 keep it. And dealing with other visualization of information. One of the key things, just I think in any of these is dealing across the domains and sharing a lot of these best practices, which is not something you get in the traditional conferences for the domains. You just don't get that. And the other thing is where do you find places to actually not only share the information but not have people just talk about what I call the chest beating. This is how great we were. This is how everything worked. Everything worked perfectly. And then you go later on, well, you know, we had these little hitches and gotchas. So where do you find the place that's okay for people actually to have those discussions where it's not going to be about, you know, the paper or those things, but really about here's still the challenges that are happening. We need to solve these. Does somebody else have an idea or what's worked in other domains. The last slide, just kind of covering some of the other things we're looking at, just the balancing of these things, everything from data to the bandwidth to the storage and processing. It's like a three-legged stool that's out there. And they need to kind of be in alignment for it all to kind of work. And the challenges are never going to be in alignment because you're always going to have more data coming in from more and more sensors so you have to increase the bandwidth speed. Oh, then can you deal with the processing and the storage at that time. How do you handle all this in kind of a way that makes sense. One thing we look at as well is can we push a lot more of that computation back towards more the sources so how can you actually do that more directly at the sensor area so you're not only just filtering information and data, but maybe processing and aggregating it closer so that you're not just stuffing it into a data storage and hoping someone later on will process it. And then there's the challenges on how do you create some of these new types of scientists to actually deal with the different data sets in different ways and process it and coming up with new ways of applying algorithms and others to it. And the other one we looked at is how do you continue to ride the commodity curve. If you look at what datacenters and cloud computing is really about, it's about riding those commodity curves of disks and processing and 22 networking. And so how do you do that just in the scientific area, both from sensors and other datasets that are there so you can get the benefit from that. So that's kind of my quick overview of some of the stuff that we look at here within our area and there will be other talks later on about Worldwide Telescope and some of the other ones. If there's any questions or anything, be glad to answer those. Otherwise, thanks very much and thanks so much for being here as well. So thanks. >> Yan Xu: Questions. >>: Well, an important part of your presentation was obviously visualization of [indiscernible] data. Have you developed anything like that handbook of principles, something like [indiscernible] work, only expanded for the data challenges we have now? >> Dan Fay: No, we haven't. There's many of them out there from like the visualization communities or the large visualization analytics community and some of those that have done a lot more within those spaces. What we're also just looking at is just how are we doing it in some of the areas we are and can we write it up in little quick either blog articles or some of those. Maybe putting those together into a more expansive way on here's some of the things we have just learned would be good as well. The other thing we do look at, though, and Jonathan's really good about this, is looking at how do we take advantage and make the most of some of the GPUs and how do we take advantage of that technology that everyone has on their devices. Because that's really where the magic kind of happens to give you that experience. It's still not the thing you can get from doing it through a browser completely yet or remotely. So there's things like that that we should probably also communicate. Actually, Jonathan does it, because he does some of that through like the AMD conference and some of the other ones. He's actually talked about some of those, how to use it, what they've learned. But yeah, it's good. >>: There's nothing like [indiscernible] book. 23 >> Dan Fay: No, no. >>: You mentioned cross-disciplinary work. So in your opinion, which are the areas of eScience the most closely aligned with astronomy that astronomers can work with to push forward together? >> Dan Fay: You know, so that question comes up a lot, especially when I talk to the astronomy community. And it's ->>: [inaudible]. >> Dan Fay: I know, and I'm not sure why. It's actually interesting because -- there's couple things that are kind of unique also that we look at when we look at the astronomy community and kind of the physics as well just in general. A lot of the work goes around, let's say, many or small amount of big sensors. And so there's a nice piece about that, about actually having all the focus on those and how to deal with those at one time. So you have a lot of people dealing with them at one time. One of the challenges with a lot of these other areas in the environmental space, except for, again, some of the ones that do satellite imaging and some of those is that it's a lot of small amounts of sensors. And so but the techniques that you guys have kind of come up with in some of stuff, just like in the sky query idea, being able to query across multiple datasets, using and doing cross-joins and crosses in a unique way is the types of thing that we look to say hey, this could be replicated in other areas as well. One of the things we also look at is you want the people who actually deal with the data to curate and actually to own it, to keep updating it. There's something about having that ability. So you don't want it all going to one big, huge repository where it just sits. And so to kind of get back to your question, there aren't very many others that are doing that similar piece that you guys are doing. So I think that's helped. And we actually utilize a lot of you guys, the astronomy community, as examples of how you can do it across the multispectral type analysis, about bringing those together, about having distributed datasets, about having inclusion of the data with the papers as well and being able to get to both of 24 those. So I don't have a good one that you could -- maybe someone else does. Jonathan? >>: I think part of it is depending upon is it data, visualization, analysis. There are different analogies. So other folks have put together, like, medical imaging, some sort of visualization analysis. Other people looked at data with physics projects and cross-correlation. The other thing that's also interesting is whenever you're dealing with any cross-discipline, the other -- the grass is always greener. Here are all these people, oh, yeah, the astronomy people, they have it all together. You know, oftentimes they kind of see the work that's done and they see often the good parts of it seem successful, but they don't always know how much pain was involved in it or maybe how much still needed to be worked on to make it work. >>: There are some ways in which the data is simpler. Also, because it's always been large scale, the arguments that you have to spend money looking at it has been one [indiscernible] and that's a radical thing in some other areas. >>: You have to really think about it. >> Dan Fay: And there was something else about that, the samples. Oh, well, they just sit there. They sit on the shelves, yeah. But the challenge, I think, on a lot of these is how do you -- you want to have those long, longitudinal studies, and how do you deal with those and keep them for a long time. And I think all the sciences are still struggling with that. >> Yan Xu: >> Dan Fay: Last question. We have a discussion on this later. >>: I just want to make a comment. Don't feel like you have to respond to this. When you're talking about the sustainability issue, it reminded me of a paper I saw last year on data furnaces, and I just looked it up again. It has two researchers from the University of Virginia and four from Microsoft Research. And the idea is that instead of worrying about cooling with your data storage 25 devices, install them in people's homes and use it to heat people's homes. So they're not spending additional energy to heat and you're not spending additional energy to cool. It's the heat that's generated is naturally used in that environment. And people get some kind of a discount on their taxes or gas bills or something. I don't know how it works, but it was a pretty interesting little idea. want a fun thing to read, look up data furnaces. >> Dan Fay: >> Yan Xu: If you Good. Thanks again, Dan. >> David Reiss: All right. Well, my name is David Reiss. I am a research scientist at the Institute For Systems Biology here in Seattle and I'm excited to be here. I actually just maybe for a little street cred, I got my Ph.D. in astrophysics back in 1999, working on super nova searches. So hopefully, I can offer a little bit of fodder for discussion about the things that I think biologists can really learn from you guys as well as potentially the other way. So when I was first offered to give this talk, naturally the first thing I did was to look at Wikipedia to see what bioinformatics actually meant. And as you can see, it's a pretty broad -- it's got a pretty broad description and I decided that I wasn't going to be able to talk about it all in the 20 minutes of time. I actually added a couple of things that I think important at the bottom there. But there's some -- so given my particular area of expertise, I thought I'd focus on one, what I consider to be important integrative aspect of bioinformatics, or computational biology, as I often refer to it. But it's also going to talk -- it's also going to cover a wide range of the other specific areas of informatics that are dealt with in biology. So these are the challenges that I kind of came up with that I thought were particularly in contrast to the way I think of astroinformatics or astrophysics data analysis. In particular, well, the increase in data size isn't necessarily different, but in biology you're typically dealing with a very wide range of different types of data that need to be dealt with in different ways. 26 And often, for the different types of data, there are different informatics experts that know, that have come up with methods for modeling them, understanding the noise, understanding the experiments that we use to come up to create that data and it's important to understand, whenever you're looking at a biological data, biological information, that oftentimes you're confounded by things that you're not aware are going on in your biological samples. That you're observing things that are -- that what you're observing is often not as important as what you're not observing. So to kind of start with a specific story, I thought I'd just kind of go over some basic biology. Most of you probably remember the central dogma from your high school biology classes. That was about as much biology as I knew when I started at the Institute For Systems Biology. It basically describes the basic way that the cell uses the information in its genome to drive the creation of the molecules that do all of the rest of the work for the cell. So as you all know, the information is arranged in a linear sequence of letters, along the genome, and these contain coding regions or genes and non-coding regions. Those regions are transcribed into messenger RNAs, which are molecules that are then -- that carry the message in order to do additional information processing. And the standard theory basically then says that those messenger RNAs are then translated into proteins which are basically the machines that make up the activity of the cell. These are receptors. They do signaling, they do additional information processing and regulation. So that's the central dogma and, of course, this is a very, very simple overview of what actually goes on in the cell. And so there's actually a whole lot more going on. These proteins go back and bind to the genome to regulate what genes are then transcribed and what genes are turned on or off, essentially. Additional proteins bind in combinations, and it's important to recognize that these, each of these combinations occur many thousands of times across the genome, and they're confounded by different cell types, different environments, different experiences that the cell has received, and so bioinformatics is basically trying to use data in which all of these processes are observed to 27 integrate it and make sense of what's going on in the cell. >>: Is there a reason you're calling it dogma? >> David Reiss: I think it's a pretty appropo phrase. And I don't think -currently, most biologists think of it as kind of a historical paradigm. So as I'll talk about, there's a lot more going on and biologists recognize that. But the difficulty is trying to actually observe it, trying to make sense of it. So in terms of -- so one of the reasons that I brought this up is because these messenger RNAs, which are -- I don't know if there's a laser pointer. The messenger RNAs, one of the things that we've been actually able to measure reasonably accurately for a longer period of time, thanks to the sequencing of the genome, we have these things called micro arrays or gene chips you may have heard of where you can essentially measure relative levels of messenger RNAs, of all of the messenger RNAs in cells in a sample. And you can do this across varied conditions or varied cell types and typically these are converted into a matrix like what I show here. And then once that matrix is done, that's where the information processing occurs. So a typical analysis involves cluster analysis or support vector machines or other types of more sophisticated analyses that I won't have time to go into. But this has been going on for about ten to twelve years. This has really kind of led the informatics data processing breakthrough that's been going on in biology for the past ten years. So as I said before, so this is the standard model, but there's a lot more that's going on, and in many cases, those additional bits of processing, or those additional types of interactions are basically ignored, more or less because we don't have the adequate data observation technologies to observe them. Micro arrays are basically our best tool at this point. So we can observe these messenger RNAs and we can measure them by the thousands and we can use these to infer what's typically referred to as a transcriptional regulatory network, which is a network of interactions by which these regulator proteins turn on or off other genes at the messenger RNA level. 28 And basically, that's done by just assuming that by measuring their messenger RNA levels, you're measuring their protein levels. That's a whole other issue, because there's a whole lot of processing that goes on between the levels of messenger RNA and protein and these are all things that are basically ignored. But surprisingly, and this is a recent publication that just came out in Nature Methods. There have been hundreds of publications from people just taking this type of data and trying to infer the networks of interactions that are going on in the cell. And what's been recognized now is that different types of analysis methods do a better or worse job at different aspects of this problem. And so ensemble approaches are really starting to gain traction in biology. And so what I show here is an example network that was inferred by an ensemble of methods that were developed by over 70 different teams of bioinformaticians to create this network of associations in e. coli. This is the best network that has been able to be inferred to date. And depending on your perspective, it's either great or pretty sad because it only infers that the interactions for about one third of the genes in this relatively simple microbe, and the estimation was that the predictions are about 50 percent precision. So here's some examples of how some of the different algorithms did it at making these predictions in e. coli. And you can see that by integrating them, which is the black bar on the right, the groups taking all of the different methods and integrating them do basically a better job than each of the individual groups were able to do on their own. Now, one of the things that this paper showed was that if they gave these teams some synthetic data, where they basically simulated reactions and created data for these people to -- for these teams to infer these networks, they did about three times better on the synthetic data than they did on the real data. And that basically shows how many things are going on in the cell that we basically don't have any idea about, and we're missing. And then it gets even worse when you get it to even slightly more complex organisms, like Baker's yeast, which is about two times more complex. Most of the teams do no better than random at inferring these networks. So in terms of these computational methods, there's still a long way to go. And as I said before, there are -- the reason for this is that there's a lot 29 going on in the cell that we cannot -- that we do not yet have the technologies to measure at the same rate that we can measure the mRNAs. So we would like to be able to measure translation and protein levels and small molecules and get quantitative measurements of what the cell is doing at different times and those shapes there are supposed to represent rates so we'd like to have rate equations for all of these interactions. And basically, the types of data and the types of analysis that we can do are endless, even for these simple organisms. Fortunately, we have lots of new technologies that are still being ramped up that, thanks again to having genome sequence, we can measure now protein levels using mass spec. It's not as well developed a technology as micro arrays, but we can use mass spec, very expensive mass spec machines to measure the spectra and identify what proteins there are in the cell and what small molecules there are. We can actually measure physically what proteins are binding to the genome and this is very, so important in terms of the regulation. We can measure -- we can observe proteins interacting with each other, binding to each other in the cell and create these networks. And one thing I want to point out with this slide is each of these data types has their own dedicated journals, basically, with teams that develop methods simply for analyzing the data, processing it, converting it into information that modelers can then use. And visualization as well, network analysis and things like that. The possibilities are endless. So this is only going to get worse. So genomic data is getting cheaper and cheaper. I think the rate is probably surpassing what is going on in astrophysics. When the first human genome was sequenced, it took about ten years and three billion dollars. And now, we can -- we'll soon be having desktop machines that for a thousand dollars can sequence the human genome in a matter of a day. And each of these sequencing runs result in a couple of -- probably 50 -probably 50 gigabytes of data, if you include all the images and things that are generated. And so if you imagine, this is actually kind of a different paradigm from what happens in astronomy, where it's very Democratized, very decentralized. Every lab or researcher is going to be generating massive amounts of data and they're going to need to know how to handle it and how to 30 store and it how to process it. >>: What happened in 2007? >> David Reiss: In 2007, that was essentially when commercialization took off. So there's a number of different companies that have developed their own methods for sequencing that are where the technologies are largely -- are essentially a different paradigm from the way it was done using the -- from the original sequencing. >>: [inaudible]. >> David Reiss: >>: Pardon? Capitalism, learning how to make money from this. >> David Reiss: Exactly. Of course, there's huge biomedical implications, you know, going to your doctor and being able to submit a blood sample and get your genome back. And what they do with that is a different issue altogether. But that's the goal. So the thousand dollar genome has kind of been the goal for a long time. And we're almost there. So these new high through-put sequencing technologies are enabling a whole wide range of additional technologies. I showed this. This is a plot of a type of data that I actually developed a method for analyzing that was kind of inspired by my astrophysics knowledge. It involved deconvolving measurements that were made at high resolution across the genome to identify, at high precision, where proteins are binding across the genome so the X coordinate there is a genomic coordinate. There's additional technologies for measuring transcript levels at high resolution across the genome. Now, there's these genome-wide association studies, which will be using these very cheap, very inexpensive human genomes to try to identify mutations in the genome that might be associated with disease. One of the main issues with this is that there's a huge number of or high rate of false positives. And so every day, you probably see a newspaper article that talks about a new discovery of a gene that's associated -- of a gene mutation that's associated with Alzheimer's or something. And then a couple 31 months later, you don't see the retraction because it's actually not been shown to be statistically significant. So there's still a lot of work going on there. And, of course, now there's new high-throughput imaging that's going to -- where there's a lot of potential cross-talk to be had between you guys and the biology world where imaging cell cultures at high rate in three dimensions. And there's, I think it's a far more complex task than classifying galaxies or something like that in astronomical data. So there's a lot of work, a lot of things that we can learn from you guys on this front. So that being said, one of the tasks that I see of computational biology or bioinformatics is taking all of these different types of data, understanding that there's going to be false positives and false negatives and different type of systematics in each of them. But by integrating them, we can hopefully try to get rid of or try to identify where the false positives are in some of the data by kind of cross-referencing them with what we call orthogonal types of data. And by integrating it into kind of -- into complex computational model, we hope to generate a -- I guess here, this is an old slide, so a circuit or a picture -- an idea of what's going on in the cell so that we can make predictions. Now, of course, biological systems are not electronic circuits and this is a recent paper published, a recent opinion paper published by a neuro scientist, but it's equally appropo to biology, showing that for even simple cells like e. coli, if we wanted to go out and measure all of the potential interactions that are going on in the cell, given reasonable expectations for the rate of increase of measurement technologies and computational technologies, it would still take more than a million years. And for the human genome, the case is even more daunting. So obviously, we're not going to be able to do it this way, and this is where systems biology comes in, as I see it. So systems biology has been around for about ten years and the mantra is that basically it's a multidisciplinary science in which we have biologists, classically trained wet lab, wet, you know, wet bench biologists, technologists who are typically engineers or bioengineers who can develop new technologies 32 for measuring these molecules, and then computationalists or informaticists who can take all of the data and try to make sense of it. And it fills this, what we call this virtuous cycle where one -- where each of them feedback on to the other. But one of the issues that I think we're still dealing with in terms of integration of all of these multiple disciplines into systems biologies the completion of this loop. Basically, the problem is that as computation -- as computational modelers or informaticists, we can make as many predictions as we want, given the data. And typically, we don't have sufficient biological knowledge or sufficient kind of -- sufficient -- I guess we're not part of the biological question enough to be able to prioritize these predictions in a biologically meaningful way. And the biologists, we can present these predictions to the biologists, and the biologists say, well, these are too many predictions. We can't test them all. How would we rank them. How would we prioritize them. And additionally, we don't understand the way these predictions were made. And therefore don't necessarily believe them. So what typically happens, this is kind of a common story, is that we make predictions, some of which are novel and they're either biologically novel predictions or they're just wrong. And in many cases, we do make predictions that are right and have -- and correlate with what the biologist already knows. So then the biologist looks at the predictions and says okay, these make sense. And they -- so we can write a paper saying that we've made one round of this cycle. We've made predictions based on your data. The predictions make sense. But what we'd really like to is to be able to use information that's gathered from these data to prioritize new experiments and really fill out our knowledge about what's going on in the cell. And so that being said, I think I'm running low on time. So here's just some kind of final thoughts. So essentially now, over the past ten years, computation has really become a central part of biology. It's not just seeing -- many aspects of bioinformatics is seen as a service in terms of storage of the data that comes directly off of the instruments and so on. But now it's an integral part of the research, and biologists are becoming more comfortable with interacting with computational biologists and vice versa. And at least in multidisciplinary institutes like the ISP where I work and other 33 places as well, this is -- it's a slow process, but it's really starting to happen. You even see this happening at academic institutions or universities where people -- where there are these new multidisciplinary systems, biology programs where training is happening both in conducting and performing biological experiments and doing some programming and some analysis so that the biologist can actually talk, speak all of the languages that are necessary. So that being said, I wanted to just throw this in. Many of you may have seen this paper that came out or multiple papers that came out in Science and Nature last week based upon a consortium of over 450 researchers that essentially mapped at high resolution, using a lot of the technologies that I described earlier in the talk, the functional elements across the human genome. This was a massive undertaking, almost as massive as the original sequencing of the human genome. As you can see, this author list is kind of reminiscent of what we've seen with Sloan Digital Sky Survey and other things, and the Higgs discovery and so on. So biology is really kind of starting to get to the point where we're going to need to learn from you how to deal with publication issues and authorship and all those aspects as well. But I would say probably one third of these authors were computational oriented. I think I'll end with that. >>: Lots of questions. [indiscernible] 450 authors. >> David Reiss: I don't actually know the logic that went into that. It's interesting, just the culture of the authorship lists are ordered in a different way generally in biology than they are in astronomy or physics. Typically, you have -- now, these days, you have multiple first authors and then the senior authors or corresponding authors are at the end of the author list. And I think the way that this collaboration worked was that this was the main publication, but that there's 160 more detailed publications describing all of the results and each of those publications have their own sub-teams, I guess. 34 But this is kind of, I think, really the first foray into kind of the survey world and kind of the making this huge amount of data available to the rest of the world. They have a wide range -- a wide number of databases and there's even an iPad app that I downloaded the other day for exploring all of this data. And I think those are also things that we can learn, we can learn from, from you guys. >>: A comment and two questions. Comment is that it scares me when I hear from a field that's funded at several orders of magnitude better than astronomy that we have much to learn from you. Two questions about the data mining. Clearly, biology data are vastly more complex and heterogenous than astronomical data, let alone something so trivial as [indiscernible] physics. There's a high dimensionality problem. Is there a special effort in developing better algorithms that scale well with high dimensionality? We worry about tens of dimensions. You probably have tens of thousands of dimensions. And also the visualization in highly dimensional spaces. And my second question is it seems to me that this is more like a text mining problem than numbers mining problem, because AGTC are letters and genes are more like words. And so is that fundamentally different kind of data mining than what we do, say, with large tables of numbers? >> David Reiss: Yeah, those are all good questions. And I think they're all very valid points to take on -- to take on your last question. That was, that issue was one of the things that I really struggled with when I tried to make -- when I made the transition from astronomy to biology was understanding the -- trying to understand there's a completely different sets of statistical models that go into modeling genomic sequences, sequences of letters with fixed alphabets and aligning genomes and so on and so forth. And there is definitely -- that aspect is different, but I think there are other aspects, and I hinted at those in one of my slides, there are other types of informatics that are going to be just as important, if not more so, that are really essentially measurement, you know, measurements. They're going to have, essentially, the dimension -- the domain dimensionality of the measurement is going to be along the genome and trying to correlate the measurements with signatures that are in the genome is one of the things that we're struggling with right now. 35 >> Yan Xu: Three more questions. still have a question? >>: You go first and then [indiscernible] you Identical question. >> Yan Xu: Oh, okay. Pepe go, then. >>: [indiscernible] real complexities, not so much [indiscernible] but in a community of thousands of people, thousands of different [indiscernible] completely different problems [indiscernible]. My idea is that in bioinformatics, the solution is [indiscernible] finding missing parts and so on. What is the level of complexity of the problems which you encountered in informatics? In other words, [indiscernible] there's a very large variety of tools and [indiscernible] which are required. >> David Reiss: Yeah, I try -- so I tried to give an idea of the different types of data that are involved in bioinformatics. So, for example, there are teams of researchers that are working solely on trying to segment and cluster images of cells and culture. And so obviously, you can imagine that being a very complex domain of research which is significantly different from trying to match genome sequence and identify the evolutionary tree, for example, from genomic sequences of bacteria. And so there is a huge range of different -- of fundamentally, as I see it, different types of data that require different types of expertise and, in many cases, different types of background than -- and, you know, correct me if I'm wrong, than the way I see astronomy data being. So I think there does need to be, you know -- one of the issues with data is that sometimes you deal with the issues of communication between just the different methods of -- the different computational methods. You know, you have people who publish in journal of proteomics because they analyze mass spec Tra of protein data and people who publish in IEEE symposia journals who are more associated with, for example, the imaging question. And then there are people who publish in bioinformatics solely journals who are more interested in the analysis of genomic sequencing data. 36 So oftentimes, just the publications are different enough that you don't get as much cross-talk between those. >>: Just this morning, someone forwarded me a message arguing from a biologist, arguing that end code was a waste of money because -- the statement as far as I could tell is that biology is so diverse that you contrast astronomy, you can't -- the dataset is collected using a particular method is unlikely to meet the needs of many biologists. Do you have a comment on that? Do you think that's a valid statement? >> David Reiss: I think that perspective comes from what your approach is. I do see some value in the end code mapping, and essentially the way that they sell it is that they're trying to identify potential regions of interest across the genome. Up 'til now, only 98 percent of the genome, we had no idea what it was doing. There are only about two to three percent of the genome is pard of these coding sequences that make up genes. So what they were trying to do was try to identify the functional regions of the genome that are outside of these coding regions. And I think it goes a long way. It's not quite as helpful as just getting the human genome sequence was, but I think it does take us a long way, and people like myself will be using that data to help constrain our models in significant ways. And I think where it really comes in helpful is potentially by integrating it with the data that's coming out from a particular lab for a particular domain, especially if it's associated with human disease research. It won't be helpful to a large number of biologists who work on bacteria or things like that, but ->> Yan Xu: So thank you again, David.