>> Lisa Green: Good morning. My name's Lisa Green, and I'm from Science Commons. I'm really excited to be here today. I hope you are too. And if you're not, I'm sure you're going to be once we get started. We have a spectacular lineup of speakers. And the things we're going to be talking about today, I mean, these are some of the most important ideas in science right now. And their impact goes well beyond even science. So I'm really excited, and I'm sure this is going to be a very stimulating and rewarding day. Before we get started, I'd like to give out some thank yous. And my first thank you goes to my coorganizer, Hope Leman. Many of you know Hope. If you don't know Hope, I encourage you to introduce yourself to her today and thank her for this day because it's not hyperbole to say we wouldn't be here without Hope's work. So thank you very much, Hope. [applause]. Also, I want to thank you our video blogger, Chris Pirillo. Chris has a tremendous following, and he's really helping us get the message out to as many people as possible. It's being live cast now, and it will be packaged so you can download and watch it again after the fact. And the biggest thank you of all guess to Microsoft Research and Lee Dirks. [applause]. So Microsoft Research is a long-time supporter and partner of Creative Commons and their source of innovation for over two decades. And speaking of who we wouldn't be here without, Microsoft Research is who really made this happen. Lee is who really made this happen. I would get e-mails from Lee at two in the morning and then again at 5:30 in the morning. So he had a lot of passion about making this happen, and we're really grateful. Later today, you'll -- well, pretty soon, you'll hear Peter Murray-Rust speak a little bit about what's going on at Microsoft Research with his collaboration. But right now I'd like to bring up Lee and give him a very warm thank you from all of us. And he's going to tell us about what's going on at Microsoft Research. Lee. [applause]. >> Lee Dirks: Okay. So, yeah, I do send e-mails at weird hours, there's no doubt about that. But it's definitely Lisa and a lot of people from Microsoft Research that made this happen. It certainly wasn't me by myself by any stretch of the imagination. My name is Lee Dirks. I'm from Microsoft Research, specifically a team called External Research. I'm going to speak a little bit about that in a moment, some very brief comments. But first off, thank you very, very much for coming out bright and a early on a Saturday morning. It's a tremendous day, so I'm very sorry to lock you in a room. But we have in conjunction with Lisa lined up an amazing, amazing group of speakers. So I drove all the way from Seattle to see them. Sorry. I don't know about you guys, but -- no, but many people have flown in. Obviously some of the speakers are here from the UK. So it's a tremendous occasion. And I think it's going to be very, very exciting. There's some things that were announced yesterday that I think will be delved into today that are going to be of interest to everyone. So and auspicious milestone. But I'll leave it to the speakers to chat about that. I did want to cover a few logistics for the day. First off, wireless. Everyone, if you don't have the wireless code, it's written up here. We also have slips of paper at the registration desk. So if you need to get access to your Internet, we should have no problem. If there's any problem, do let me know, track me down any time during the day and let me know. Chris, as Lisa mentioned, Chris is live streaming this, so that's available right now. In addition Microsoft Research is going to be taping and capturing the entire day. We have a -- kind of a relationship with the University of Washington and research channel that we put a lot of our talks out on. So this will be made available after the fact. And we can send a link out. So this whole day will be captured and you can definitely share that with everyone. So the speakers have all signed release forms except the two I haven't been able to track down yet. So if you are a speaker who hasn't signed a form yet, come find me. But that will be available. We also will be having a reception at the end of the day. So at five p.m. until about 6:30 out in the atrium of building 99 we'll have a wine and cheese reception. Everyone is welcome to join. Also at that time, we will be giving everyone free copies of the Fourth Paradigm which is a book that Microsoft released in October that is related to kind of the future of eScience. And it was, we're very proud to say, Microsoft's first Creative Commons' release publication. So all of the content the openly available, thanks to obviously all of the authors for making that available. But -- I do have a copy. I'll show you what it looks like. So everyone will be getting a copy of that. And yes, there's the Creative Commons' license right there. So everyone will be getting copies of this. And we've got a commitment from John Wilbanks, he was one of the authors in one of the articles, that he'll do book signings. He said if people want my signature, I'm happy to give it to them. So pin him down, okay. The other kind of logistical point is obviously this is a Microsoft building. We're not normally open to the public on Saturday. So I think many of you have dealt with the security guard to get in. We need you to kind of stay in this area. This is a public area. But if you need anything, again, come to me or find the security guard if you get outside of the building or something like that, you know, kind of -the security guard should be there all day. But if you get locked out, knock and wave at him. And rest rooms, hopefully you found them down that direction. There's going to be refreshments all day on the back. We have a box lunch for you and then again the reception. If you need other beverages other than the ones that are provided here, outside of this hallway and in the back corner, you can actually go out through either of these doors as well, there's a refrigerator with a selection of beverages there as well. So if you need a water or Coke or a little caffeine to keep you up, you'll find that. Perhaps some of you are saying wait a minute, why -- what does Microsoft Research have to do or why are they interested in Creative Commons or Science Commons? And so I wanted to give a little bit of context. Our legal team has been working very closely with Creative Commons for many years. So Tom Rubin and Lawrence Lessig go back a ways, and so we've had a long-standing relationship with Creative Commons overall. But mainly over the last three our four years Microsoft Research and External Research has developed a relationship with Science Commons. And what I wanted to do is go ahead and give you some background. So Microsoft Research overall is a group of about 900 researchers, about 450 of them reside in this building. And the remainder sit in these locations around the world. Research, as Lisa made reference to, has been around for almost 20 years. We'll be celebrating our 20th anniversary shortly. And we, you know, and look at in all areas of computer science but specifically our group is called -- that we reside in is referred to as External Research. And so our team is very focused on applications of computer science and specifically applications of computer science in health and well-being, in core computer science, in Earth energy and environment, and then the area that I'm responsible for which is education and scholarly communication. And so we look at the applications computer science in this area, and our team is called External Research because we actually -- our team doesn't do the research here, we do research in collaboration with external parties, typically most often with academic -- with academics and you'll see one or two examples of that over the course of the day, some of the projects that we have going on. But we've done a couple of partnerships with Creative Commons that I just want to reference specifically out of my group, one of which some of you might be familiar with. We actually about two years ago released an add in for Microsoft Office which allows you to embed the Creative Commons -- let me go back this for a second -- the Creative Commons license in either Power Point, Excel or Word. This is actually one of our most popular downloads. We've had something like over half a million downloads of this add-in alone. Another one that we did about a year and a half ago was the ontology add-in. We originally did this work with Phil Bourne at University of California a San Diego, and then we actually did some work with John Wilbanks and his team that is responsible for NeuroCommons. So this is an ontology add-in that allows you to import any ontology. Whoops. It's on a timer. Shouldn't be. Sorry about that. That allows you to import and embed an ontology into the Word document and have that travel and then mark up specific words in XML and embed those in those tags in the document itself. So just a couple of examples of the engagements that we've had. What I would like to do now, just to give you a little bit of context, but what I'd like to do now is actually turn it over to my colleagues Stewart Tansley and Kris Tolle. And they were going to go into a little bit more detail about the Fourth Paradigm. And I think this is a -- if you're not familiar with it, I think you'll find it an intriguing book. And I think again John's got an article a in it, but there's a lot of -- a lot of similar concepts that I think we'll be addressing today that were addressed in this book. And so I'd like to hand it over to Kris and Stewart. >> Kristin Tolle: Thanks, Lee. Those of us who -- Tony Hey, who is the actual main editor of this book and is also the vice president for External Research, likes to describe himself as somebody who practices management by walking around. Those of us that know Tony a lot better and work with him actually would say that Tony likes to first send out an e-mail let's go do crazy idea X and then he wanders around trying to find out why nobody responded. [laughter]. And that's pretty much actually how the Fourth Paradigm started out. He really wanted to create a book that would show how computer science was facilitating each of the pillars inside of his External Research team. And those pillars are health and well-being, they're scholarly communications, Earth energy and environment and then core computer science. And the book itself is actually structured this way. It has these four different themes. Now, there was one person who actually responded to the mail and said here's a list of authors I think would be really good, because she'd already been thinking about creating a similar book but just localizing it to health and well-being, and that person actually happened to be me. And so once that mail hit his door you could almost hear the elephant-like footsteps pounding their way down to my office saying you think this is a good idea, we should do this, and of course I agreed. So a book was born. From the beginning, we really -- we have three tenets that we wanted to make sure that we hit. And the first tenet was that we wanted this book to honor Jim Gray and his memory and the good work that he did. Secondly, we wanted to illustrate how it was that science was really transforming -- computer science was transforming science and how it was dealing with the deluge of data that we deal with every day. And lastly, we wanted to make sure that this would be openly and freely available to everyone because we thought Jim really would have wanted it to be that way. So since I had already had my list, then the next challenge was for us to go around to each of the other pillars and collect those lists. And then we got down to the real core business of writing a book. And I won't go into detail, because I'm sure many of you know what it's like to edit a book. It's a considerable amount of effort and a considerable amount of challenge. So along the way we picked up Stewart. And believe me, Stewart was a godsend. Not only was he helpful with the day-to-day stuff, but he also brought a very unique perspective to the book. He brought the perspective of how do we get there from here? So that's useful if the day-to-day work where you're having to collect papers, get them back, get them edited, send them back, go for review. But more importantly, it also -- he brought the perspective of how would you think of the book as a whole and then you move that into the space where people could actually make it actionable. And so he really helped us see how we could take this book forward and make sure that remained a living document, that it would be something that would be owned by the community. Stewart. >> Stewart Tansley: Thank you, Kris. That's kind of you. We're Microsoft, too, right, we're Microsoft, too. So I've got my slides. I thought I'd show you some slides too. Thank you for the introduction, Kris. The -- I'll press the right keyboard first. What is this book in performs of the gritty detail? It's 250 pages there. You've seen a copy. You'll get a copy later if you so wish. It's the first Creative Commons publication from Microsoft Research which we're really proud of. As Lee said, took us a while to get there. We've been working with Creative Commons for a number of years. But this has really been a break through, and we hope to see other publications come from there, from Microsoft Research. It's an interesting collection of 26 papers, short technical papers, just 2,000 words on average. There was some editing involved. >> Kristin Tolle: Some editing. >> Stewart Tansley: There's about 70 leading practitioners from around the world. Many of their names I hope you'll recognize, a few you may not recognize, but we think that you will recognize them in the years to come. It's not all Microsoft, but a significant Microsoft, but it's mostly not Microsoft people. It's published under Microsoft Research but mostly it's about 45 authors, scientists and some computer scientists from around the world. The four things I won't go into them again. We've highlighted that. And just to reiterate, our own External Research group is somewhat structured along similar lines because it maps how we think this field is panning out. Similarly Gray, who together with Alex Scalay and Tony Hey really formulated this concept of the Fourth Paradigm. There's some hints about the other three paradigms if you're not familiar with this at the bottom of this page. It really was a testament to -- Jim inspired us all who interacted with him. And there is a certain labor of love. Kris was very keen to even use that phrase in the book. We're very proud to represent Jim's legacy going forward, but as a living entity, not as something that just is a milestone. This is something that Jim would like to have seen going forward as an idea, a mean in the community. We launched in eScience Workshop, I hope you're familiar with that workshop, in October. And let me show you the cover. You've seen that. We've got that -- doesn't have the Creative Commons' license on. This was a prepublication version. But I assure you it is on the real one. I won't have time to go through all of the details but here I do put up some of the papers. You see we have structured it in the fourth sections as being described and you'll recognize some of the names already there. Next page I think John is highlighted on this page bottom right. Yes, no? >> Kristin Tolle: Yes. >> Stewart Tansley: And so without further ado, that's what it looks like inside. I hope you enjoy the book. You can get to it from this URL. If you don't want to carry a heavy copy home, if you traveled a long way, it's downloadable, too. But it's printed nicely, so do take a copy for yourself. Okay. Thank you very much. [applause]. >> Lee Dirks: Very good. And with no further ado I -- well, one bit of ado. Sorry. I did want to pass along regrets from Tony Hey. Dr. Tony Hey was unable to be here today. It was something he was very passionate, very much intended and wanted to be here. Unfortunately he had this pesky event that he was being awarded a fellow in the AAAS today in San Diego. So he had a conflict. So we decided we'd let him go on that. So he definitely does wish he could be here and passes on his regards. So now with no further ado, I would like to hand the podium over to Dr. Cameron Neylon. And he will be speaking to us about science in the open. Why do we need it and how do we do it. >> Cameron Neylon: Okay. So am I amplified yet? Or do I just need to talk loudly. So thank you again to Hope, Lisa, and Lee for the invitation to come; to Hope in particular for e-mailing me I think about once every 12 hours over a period of a couple weeks saying are you coming, have you sorted the logistics yet. And thank you for our hosts and thank you all for coming. I put this slide up hopefully to somewhat frame the point of today's discussion. I should add you are also free to take notes, to think, and to disagree with me and indeed to publish that. But while I'm thanking people, I want to thank a whole bunch of people. And this is actually for those of you who have seen the slide before, actually updated a it now, so it's not quite as out of date as it used to be. If I have seen any distance at all, it is by standing on the blog posts, the tweets, and indeed on the formal publications of others. We would do well as scientists to remember that -- whoops. It's coming back. We would do well to remember that even the biggest paradigm shifts, the biggest breakthroughs in science are really a very thin veneer over what has come before. And sometimes a little humility perhaps might be effectively applied to the process of how we think about communicating science and how we manage it. And all the people up here are not necessarily people I've met. Some of them are people I've met online. Many of them are people I disagree with profoundly, but they are people who have influenced my thinking and to a very large extent I can no longer tell which of the ideas I'm presenting are my ideas and which have come out of these. Think of this more as me filtering the things that I've seen. Though you shouldn't hold these people responsible for what I'm saying, obviously. Okay. So who am I? I live in Bath, which is a lovely place to live, and I work at a place called the Rutherford Appleton Lab, which happens to be about 60 miles away. So I have about a two hour commute each day. And the organization I work for is called STFC. We're a UK infrastructure organization, however, we are also a research funder, so I need to put up this disclaimer saying that I'm not presenting any policy here at the organization blah, blah, blah, blah. So I get up relatively early in the morning, unfortunately and I catch a train, and then I get on a bus and then eventually I get to work usually about quarter to 9 in the morning. The things I work on are more or less could be described as structural biology. So we're trying about trying to determine the structures of biological molecules and in particular I'm interested in trying to solve the structures of assemblies of biological molecules using a range of techniques but primarily small angle scattering. For those of you who are interested I can whitter on about that for several hours but I will avoid doing so at this stage. I also get to do a lot of really, you know, cool experiments. We do some stuff with protein labeling and connecting proteins to other stuff, so I get to take cool pictures of fluorescent stuff, which is always a lot of fun. It's an interesting mixture of small lab work. So this is actually a picture of me in the lab. And also I work at a large facility so we do experiments that kind of do involve the big iron of experimental facilities but also we have some experience of the problems of handling and looking after data. As a scientist you spend an awful lot of your time reading, you spend an awful lot of your time in meetings, and probably too much time travelling. This is my second home. The departure terminal at Heathrow terminal 5. Often I'm doing that to give a talk which usually involves me preparing the talk at the last minute often on a Saturday morning. Which I should say for those of you following along at home you can find the slides to a similar talk at slideshare.net/CameronNeylon, and they're not quite the same slide, but most of them are there. So show a picture of myself in the lab but often students are actually quite keen to keep me out of the lab. This was not actually my fault, but these kind of things do happen, and of course these lead to more meetings about safety and then more stuff about more reading. But, you know, at the end of the day, I get on my bus and go home, unless of course I'm travelling somewhere. So the question one might ask about this kind of lifestyle, the lifestyle that many scientists choose to lead is why. And there are a series of levels to this question. The first perhaps is why does somebody actually pay for me to do this? Why governments are funding this kind of work. And in fact that's really the wrong kind of question, because it's not the governments that fund research, it's the wider community, the public. But of course we should really ban the term of the use the public because we are the public. So the question is as the community of people who pay taxes, whether that's directly or indirectly, why do we think science research me getting to do cool stuff in the lab is something that's worth paying for? And the answers to that are a number of things. People are very keen on seeing medical advances, cures. Prestige is a big issue. This is a close-up of a Nobel Prize medal. Countries are actually very keen to get Nobel Prizes. The GDP of a small country can be significantly increased by the winning of a Nobel Prize, surprisingly enough. And we shouldn't forget just the idea of -- just the pure excitement, the idea that we can talk about exciting stuff like galaxies, like the origin of the universe, like how our biology actually works, and that is something that appeals to us. It appeals also to the community and it particularly appeals to children who may be the next generation of sciences. So there are a lot of reasons why as a community we fund research. Why do I do it? Why do I put myself through the rigmarole of running around a place, doing all of these things? Well, simple answers. I have a mortgage to pay. I need a job the same as anyone else. Many people have said that I'm far too curious for my own good, that I stick my nose in places where it's not really wanted and probably not really advisable to put it. And of course it can be fun. And again, as a researcher often stuck in meetings or stuck in dealing with stuff that I don't really want to know about, it's worth thinking back to when this was just sort of an amazing thing to look at and just really cool. And, you know, I do get to do these things. I do get to come out and listen to the rest today's speakers, which is just going to be really great fun. And that really comes to the core point that as a scientist this is a privilege. I have an immensely privileged life. I get a good salary to do stuff that I find interesting. It's certainly not a right. And so the question that I ask is how do I deliver the best on the public investment in my time? And a I'm not going to get into an argument about metrics and how we measure things because that would be another whole day of talks, and we probably wouldn't agree on the outcome. But that's perhaps not the point. I think the key thing is that this is not a right. As a scientist, I have an obligation to the people who fund me to do useful stuff. And if we can't talk about the details, then we can at least say that when we do this kind of thing we should be maximizing the value, the return on the public investment. I don't mean the economic return, I mean of the things that I'm generating papers, results, drugs, media coverage that gets kids into science. These are the things we should be maximizing the delivery of for the amount of money that we have available to us, especially in a situation where that amount of money looks like it's going to be taking a bit of a nose dive over the next couple of years. That's easy to say. Less easy to know exactly how to do it. But I think there are some obvious answers. One is simply that we make sure that the science is available for people to build on. I said before this is a thin veneer. And we need to leverage the ability of as many people as possible to build on it. I mean I see this basically as a no-brainer. We need to make sure that the widest community possible has access to the results so they can build on them. And we can talk about how best to do that and several of speakers later in the day will talk about how best to do that. So I'm not -- I'm not really going to talk about open access to the formal published literature because sometimes, and I would argue often, formal publication, this process that we go through of traditional peer review prior to publication formatting is really overkill. You do not need a sledge hammer to take down a snowman. Though sometimes it's fun, particularly the amount of snow we've had in the UK recently it's been really a little bit difficult to cope with. But, again, that's a slightly different story. Let me give you a quick example of that. If you had done a Google search for solubility of Boc-glycine in THF at 9 a.m. on the 4th of September 2008, you would have got some not-very-useful Google results, none of which really had the answer in them. Which is a little disappointing because the day before that, Jean-Claude and I had actually been in the lab doing an experiment measuring the solubility of Boc-glycine in THF. It doesn't matter what that is, what matters is we didn't experiment, we got a number out. But this was not available to the rest of the world. Except when I did that search, I know that Jean-Claude was sitting in his hotel room actually writing up the experiment and putting it online. I'm not going to talk about the details of this because Jean-Claude can do this much better than I can. The point is when I did that same search the same evening, the answer is up there. We haven't been through peer review, we haven't gone through a process of waiting nine months to put a number in the public domain, it's just there, it's available. Now, I don't know whether the following morning some chemical student somewhere in the world benefited from the fact that this number was suddenly available and it made it easier for them to do their experiment, but I do know that there was nothing gained by holding on to it for nine months. The point is the web makes publishing in the senses of making public extremely easy. And there are a lot of services, systems available for putting your wide variety of data, documents, and media on the web. And again, Tony will talk -[inaudible] Tony will no doubt be talking about one example of that later in the day. We can put this stuff on the web. We can put our lab notebook on the web. Now, inspired again by the work of Jean-Claude, this is my lab notebook, it is on the web, you can go and look at it. It goes up, it's available, the data's there, it's indexed by search engines. You might ask the question actually anyone looking at this can actually understand it, and then that really raises an important question. So perhaps it's better to say of the web that broadcasting is easy, putting material out so that people can look for it is easy, actually sharing it effectively is a much harder problem. Both because you have to make the choice to put that sign on your table and because you have to make the choice and put in the work to put it in a form that other people can actually find and use. So I would argue, many others have argued, John Wilbanks perhaps key amongst them that the really important thing to focus on in all of this is interoperability. It's making sure that I don't have to bring that ruddy adapt-a-plug which sparks every time I put it into an American socket whenever I come to the US. And certainly don't end up in this situation where you do in various places in Europe where they -- the poles are fine but the size of the plug is wrong. We need technical interoperability and then we can talk about formats and vocabularies, and there are other people who are much more -- much better equipped to talk about than I am, and that requires work. We need legal interoperability. We need the ability to be able to be sure that we're allowed to use data, to use ideas, to use images for the purposes of repurposing that we want to do that for. And again, Creative Commons and Science Commons have done an awful lot of work on this and fundamental conclusions we come to in most cases is that we need to use very liberal licenses to make this work properly will. It tends to involve putting things in the public domain and putting them under Creative Commons attribution licenses. And as Lee has alluded to, one of the things that's really proud to be able to talk about today is the idea of trying to come up with principles, approaches, tick lists that make it possible for people to be sure they're sharing data effectively. So the Panton Principles that were published yesterday, which like all good things that come out of English academia involve Peter, myself and Rufus Pollock going to the pub and having an argument. And where this came out of, so for those of you who don't know, Rufus Pollock is one of the founders of the Urban Knowledge Foundation. The Urban Knowledge Foundation is an important organization promoting open culture, open science, and open source software. And they have a slightly different perspective on the type of licenses that should be used than Science Commons do. And what was really important about this was that rather than trying to come up with a broad and overarching legal principle about what to do, what we did was focus on what we could agree on and what we thought other people might be able to agree on. So the idea here is that fundamentally if you you want to publish data to publish science in away which is actually useful to other people, in a way in which they can reuse it. You need to be able to go through a series of almost tick lists to be able to do that. So this is not a statement about when you should publish or if you should publish, it only applies to when you decided to publish some data. So if you want to sell data, you want to do that in a proprietary way, this doesn't -- this doesn't apply. And we're not trying to cover this. I would say you're making a commercially silly decision if you do that, but that's for me to say. What we're talking about here is when you decide to publish data, please do these four things. Be clear, make a clear statement about what you want to do, make that absolutely explicit about what you want people to do and if there are things you don't want people to do. And the best way to do that is to use the legal instrument that actually applies to the stuff you're doing. So if you're producing data, please do not put a Creative Commons attribution license on it because it's almost entirely useless. Use something that works. I should say these are the shortened versions, so sort of headline version of the three points. Do not use non-commercial terms. And a discussion about why that is, but effectively using non-commercial terms obviously blocks commercial use of data but it blocks the use of this data to make money and return money back to the process of making more data. But this is really the absolutely key point. This is the point that we bring. If you publish data, if you decide to publish data, place it explicit in the public domain, particularly when it comes from public science. And this is really the key. And there are instruments, legal instruments for doing this. So I encourage you to go to the website, look at the whole thing if it's whole. If you agree, then I would also encourage you to sign up. If you disagree, if you think there are issues with what was said, please take part in the conversation. This was a really attempt to find common ground and find the things we can talk about. And there are going to be other issues about when to publish, how to publish, what sort of policy conditions there should be on that. And Peter will talk a little bit more about this later. So I've talked about technical interoperability very briefly, I've talked about legal interoperability, but I would argue that these are actually subsets of the whole thing. We need processes, and we need process that actually allows these things to interoperate. All of this governs about four things and tick boxes and all of that. That should all be taken away from the scientists. They should just have to make the choice do I want to make this open, yes, no, and have it all taken care of them because they're busy people. Tell you that for nothing. So we need systems that actually work with the existing processes that scientists actually are using, and we need systems that work with the people. I've said in the past, if you're using the user as your API, then something's going horribly wrong. So presenting the scientist with a tool like this is not incredibly helpful. And this is, to be fair, what a lot of the stuff out there actually looks like, when what they're really looking for is something like this: Something straightforward, something simple where it knows who they are and they know what they're doing. I just want to get on and do my experiment. That's the key thing. Kick the box, move on. And would I argue that what we need to do is we need to capture the objects, the things that happen as part of the research process and then add the structure on later and provide the tools that help people to add that structure on. And what I mean by that is to map the processes that I use in the laboratory at the computer when I am doing things on to agreed vocabularies, on to these interoperability things. But map them, don't insist that I use them when I'm doing the work, when I'm doing the stuff that I do about which I hope I'm the expert. Map these processes on to those vocabularies when we tell the story, when we have a narrative that we want to fit this into. There's a real problem with a lot of these systems. They make the assumption that I know why I'm doing the experiment. And I can tell you most of the time I don't. Most of the time I don't know what the data's going to be used for. That's the whole point of making it available so somebody can do something totally unexpected with it. Machines do structure, computational systems do structure. And they need structure, and we need the machines. The scale of the data that Lee has mentioned, that Steven will talk about later in the day, are such that you cannot handle this with a pen and paper. We need the machines to be able to do anything useful with this. But we don't do structure. We tell stories. So any tools that capture the pieces of the research record, the samples, the data, the little scraps of text that you wrote down on a piece of paper in the lab and tools that help us structure that and pull those down into a story when we write a paper, when we write a report, when we're doing a presentation, and tools that actually are aware of the structure that's already there, the capturing and leveraging the structure that's already there as part of the process. So Lee mentioned the work of Microsoft in this space, and Peter will talk later about -- about systems that do this in other places. I want to show an example of some fairly preliminary work that I've been doing. I know I shouldn't possibly be showing Google products in a Microsoft space but that's what I work with because I'm too dumb to work in C# basically. And so the point here is that I'm taking a notebook, I'm writing something about an experiment that I am doing, just typing it in. But what I'm doing is bringing in information from other places. These are just RSS feeds. So this robot, this system is just grabbing information from this feed. And it's using it to populate a drop-down menu. So as I go along and I want to talk about what I'm doing, talk about the inputs, the ideas that led to this experiment, I can just select them, insert them, and the system creates the link. The system puts that information in and then my tweaks come up. You know, again, if I'm referring to the literature, I might have some literature online, and again, I can just create the link. Ideally if I've generated some data or if I'm going to generate some data, that already is available somewhere online. Maybe it's a cure, maybe it's fully available. But again, it's being dumped somewhere without my intervention and again, there's an RSS feed. So this is a beta product obviously. And in fact, this is about to crash. But data, and an image could be data, can be inserted automatically into the process. Again, this is a fairly crude example. But all I'm doing is typing away and inserting objects that are inputs, inserting objects that are outputs. And so the question is how do I then get the structured data out? I want the system to have captured what's happened. And so this has been automatically generated. This document, this thing which has an identity on the web, has inputs and it has outputs. And I've captured that information. Something's coming in; something's going out. Of course, if we're doing Semantic Web, then we should be generating that in Semantic Webby stuff. And I've generated this RDF automatically. Now I faked this actually. The name spaces don't exist. I haven't put these up online. But the point is, this is a snippet of RDF generated automatically from my typed record of an experiment and selecting a couple of things from a drop-down menu. It knows who I am. It knows who the authors of the document are, and I put those in automatically. It knows what the inputs were. It knows what the outputs are. And those can be more sophisticated vocabulary terms which I might have selected from the drop-down menu. But we can capture what I was doing as I was doing it and automatically create a structured record which is then available for machines to do things with. So what can we do? We're talking about open source. Well, let's say we're actually technically able to share. And I would say going out on a limb, we're actually -- if we made the choice, we could choose to share the entire research record. We need to do a lot of building. We need to do quite a lot of work. And it would cost a bit of money. But we could do this today, if we chose to do it. The question is whether we choose to do it as a community, both a community of researchers, specifically communities of scientists, and as a community that pays for this research either directly through taxation or indirectly through products. Those are the choices we can make, and the answer that you fairly resoundingly get back at the moment is that people do not want to do this. The mainstream response that I usually get giving this talk in a fairly conventional conference looks something like this [laughter]. In fact, slightly more commonly, it looks something like this [laughter]. The pram has definitely -- sorry, the rattle has definitely been thrown out of the pram at this point. Which leads to a lot of these kind of conversations. And I'll leave it as an exercise to the reader as to which one of these is the scientist, which one of these is the funder or the member of the public or indeed the institutional repository operator. So the question becomes how do we actually persuade? How do we with -- if we could make a case, and I believe we can make a case so we can do science more efficiently, we can do research more effectively, we can make stuff more available, and that would be a good thing. How do we persuade community to do that? And to be honest, I'm actually not at all worried about this. And the reason I'm not worried is because of graphs like this. And this is, yeah, the reflexive lazy example that everyone gives of the data deluge problem. But it's a fairly apposite one, and it's got some interesting more recent wrinkles on it. So these are submissions of DNA sequences to Genbank over the past 20 or so years. This is essentially exponential. The sharp out amongst you will note that as it gets towards the last couple of years, it's no longer exponential. The reason for that is that 99.9 percent of the DNA sequence data generated in the last two years has not been put into Genbank because it can't cope. This is a scale problem. And you could draw this graph for protein structures, you could draw it for astronomical data, you could draw it for chemical reactions, you could draw it for just about anything. Everything is scaling exponentially. And we're generating more data from the exponentially greater set of experiment that is we're doing. And this creates a problem because of this graph. We're not getting smarter or faster, the computers may be, but we're not. Which leaves us in a situation like this, where the average scientist, the person actually doing the research in the grant is running faster and faster and faster in a fairly futile attempt to just catch up. The point is that the human scientist, the person who remains at the center of a scientific research, if at least for the moment, singularity is not upon us quite yet, just doesn't scale. The only thing that really scales effectively in a technological world is the web. Governments do not scale. Policy generation from the top down does not scale. See under various current UK government acts, not to mention Australian government acts. Research groups don't scale either. You take the average research output of a research group of 50 people, it ain't even 10 times the average research output of a group of five people. And network theory tells us there should be an exponent in there, it should be more than linear. Research groups do not effectively scale. You do not get more research out by just having bigger research groups. You do not get more research out by simply concentrating effort in a small number of places. The web scales by distribution. The web scales by exploiting network effects. Which means that just to survive, just to be able to keep up, a scientist is going to have to be web-native. Which means connected. Which means wiring yourself into a network that provides network effects and doing that effectively and doing it in a way that creates outputs. And that means sharing. This is not a new concept. It goes back, perhaps most eloquently to Merton in the '60s and '70s and '80s. But it comes back to Bacon, it goes back to the beginning of the royal society. We're sending letters to each other as a way of describing the latest findings. Wasn't scaling. So what did they do? They created the journal. The scientific journal, philosophical transactions of the royal society was the web of the 18th Century. And it's not done a bad job for the last 300 years. It's just not cut out to deal with it anymore. You have your effect in science, going back to that slide with the names on it, you have your effect by letting other people build on your work. If you don't do that, you're not having an effect. And we're used to these networks in research, we're used to this concept of the journal of networks, of papers which may be the only piece of data connected up at some level. But this work isn't just papers if we're going to be trying to do this effectively, it's the images, the ideas, the thoughts, the presentations, the lab notebooks. All of these things build a network that if we can build it effectively will give us the network effects which will let us do science more effectively. And you can choose whether or not to make these things available. But if you choose not to make these things available on the network, them you're not connected. If you're not connected, you don't exist. It doesn't matter how good this idea is if no one knows about it. When was the last time any of you actually cracked open a paper leather bound version of an encyclopedia? And how many of you haven't done a Web Search in the last 24 hours? Where are people going for information? So it's open content that builds this network that will allow us to build it, that will allow us to make it interoperable and make it effective. The network is the only way that scientists are going to be able to keep up and to be able to function effectively in 21st Century science, in 21st Century research more generally. If we build these tools, that help researchers to manage and build these networks, then I think the rest of it just follows from pure competitive interactions. People want to be at the top of a game, they're going to have to do this to be at the top of the game. So I need to thank a number of people for contributions to this talk specifically in terms of images and inspiration for how I've given it. Thank you for coming. And I'm happy to answer any questions. [applause]. >> Lee Dirks: We have five minutes for questions, if any. >> Cameron Neylon: Yes? >>: How do we filter out the garbage? I mean, not all data is created equal. >> Cameron Neylon: Certainly not all data is created equal. >>: Could you repeat the question? >> Cameron Neylon: Yes. Sorry. So the question was how do we filter out the garbage. And this is -- this is a general problem. It's not restricted to research by any means. There's an awful lot of garbage out there. The bottom line, at least at the moment, is that Web Search tools do a pretty good job at some level of finding stuff that's well connected, finding stuff that other people are referring to. And that is why you need to have it -- the actual objects exposed on the web, people don't link to them in the way that we don't at the moment for research, those kind of page rank style mechanisms do not work. That's level one. If we cite the research objects properly, then Google will do some of the job for us. The second level is building better social networks that then start to help us filtering. So these networks of data of objects and thoughts are equally networks of people. And so I don't know whether anyone's actually going to mention this today, but there are -- there are tools available that help people filter other people's content. The one that I know a number of us are very fond of is a tool called FriendFeed where I can bring a bunch of content in, that's fine, I'm saying this might be of interest to people, but what matters is whether other people interact with that a content, make comments on it, these kind of things. And that pushes those objects up to the top of the pile. So there are the beginnings of an idea of how we can effectively socially filter content as well. And then there are all the questions of do you just end up with an echo chamber, do you just end up with a self-reinforcement? And that's again why I put up a number of people on that first slide of people I violently disagree with. Because they challenge you to rethink the reflexive easy stuff. So filtering is not an easy problem to solve. But you can't build a filter without having the stuff there that you want to filter first to test it against. Yes? >>: Are there search engines that generate the web of results as opposed to a list? >> Cameron Neylon: The question was are there search engines that generate a web of results for other than a list. That's a really interesting question. I've never had that question asked quite that way before. Not at the top -- not at the top level in terms of -- I mean a sense also [inaudible] because they provide you with hyperlinks and those hyperlinks so -- and I have seen some quite clever visualizations of search results. The major problem with that being that Google doesn't let you actually have an API under the search results. I'm not sure whether Bing does or not. But either way, because those are very proprietary outputs, it's difficult -- this is a classic example of the genre, it's difficult to build the tool over the data because you're not legally sure what you're allowed to do with the data. I'm sure there are people working on those kind of things. I mean, I know there are people working on search visualization very, very hard. But I couldn't give you an explicit example of it. Go. >>: What exactly it does bring to mind is by many experts if you do a coauthor search. That half that comes up it's JavaScript is really very useful when you're trying, for instance, to figure out who should review this paper. If you're an editor you can use that graph to find out if they -- if the suggested review is a published and you can see clusters of authors and try to pick one from each cluster. Say you've got a representation of the field in your referees, it's really very useful. That's the best part I've seen. >>: What is it again? >>: [inaudible]. >> Cameron Neylon: Biomed Experts which is -- is it Thompson? I never quite remember if it's a separate company. >>: I have no idea. >> Cameron Neylon: It uses the underlying data of the co-citation network to -- it basically generates the co-authorship network of scientific authors based on published literature. And one of the main things it displays is you look for a person and then it displays a network of co-authorships with that person. So essentially it's kind of what you were asking about but for a person rather than for the research objects. Which is kind of the wrong way around. Because if you take Jeff Jonas's and John Udell's sound byte seriously, Data Finds Data, then people find people. It's kind of the wrong way around for research but it's a start in the right direction. >>: [inaudible]. >> Cameron Neylon: Sometimes you just want people, that's true. >> Lee Dirks: All right. Thank you very much, Cameron. [applause]. >> Lee Dirks: All right, everyone, what I'd like to do now is hand the floor to Jean-Claude Bradley to talk to us about Open Notebook Science. >> Jean-Claude Bradley: Thank you. So thanks very much for the invitation. Thanks to Hope and Lisa. You guys did a great job in setting this up. What I'd like to do is to follow up on what Cameron was discussing in terms of why it is that we need openness. And I'd like to take a pretty concrete example of that and show you in chemistry application what kind of openness currently exists and what's actually possible. Okay? So I'm going to be talking about Open Notebook Science with free-hosted tools. And these are the issues that I'd like to make a case for Open Notebook Science. The concept is very simple. At least in chemistry if you're doing chemistry experimentally, you have a lab notebook. That lab notebook is typically an extremely private document, something that nobody else will see, something that no one will read probably when you leave the lab. And there's a lot of information in there. And the question is what if we make that notebook publically available, does that help? So I'm going to try to make a case that it does on these various levels. So first of all, if our current system is working very well, then, you know, what's the motivation for doing this? And I'll show you a few examples of where the system really isn't working very well in chemistry. Is Open Notebook Science difficult to implement? I'll show you that there's at least one way of doing it that's, you know, free, and fairly simple to do. Does it prevent peer-review publication? No. I'll show you an example, although it will be qualified, which I'll go through shortly. Can you discover the data? As Cameron was saying, you know, if you put it out there and people don't find it, it's not very useful. So there are ways of putting the data out there that people can find, and even if they don't already know about your project. And that's really important. Can the information be usually archived and cited? I'll show you a pretty recent work of where I think we have a pretty good system for archiving. And citation, we've been able to cite our lab notebook pages, and that's worked out. And finally, is ONS compatible with IP protection? So, I'll -- mainly no, but there's a small exception to that that you might find interesting. So how bad is our current system? Well, I'm picking an example here as a chemist that I think most of you can relate to. If you're familiar with the concept of solubility, how much sugar goes into coffee, you can only put so much. So there's a number, there's a certain amount that you can put in. And it's such a simple measurement that you think would be very easy to find, right? So EGCG is actually the antioxidant in green tee. Okay. So it's a compound of tremendous interest. There are lots of researchers that are doing things with it. So if you want it to start to work with this material, you probably want to find its solubility to see what kind of solutions you could make. So if you use the -- our current scholarly communications, you go to the peer-review literature, you go on [inaudible] or use cache to find information, you'll find this paper that says solubility is 21.7 grams per litre. That's actually an enormous amount. The number itself really doesn't make sense. And what's really interesting is that this actually went through peer review. So the people who reviewed this paper didn't think that was a problem. But luckily there is a citation, okay. So if you take the first paper and go down to the citation you'll see that actually it was a misprint. In the original article it was 5 grams per litre, and the solubility of caffeine, 21.7 was accidentally put at the end of the number 5. Now, the issue here is, okay, I have the number 5, now where did this come from? Unfortunately could not find a reference to this number. Okay? So we keep searching. If we go to Sigma-Aldrich, it's a very popular source of chemicals, and it also has a very good reputation for having good data. So if you want to know the density of a compound or if you want to know some kind of practical property, it's usually pretty good. So for this particular compound it says that you can make a solution at 5 milligrams per mil, which is 5 grams per litre. It doesn't actually say that's the solubility of the materials. It doesn't say it's the maximum solubility. It just says you can make the solution of that number. Okay? So maybe that's where the number came from and it got misinterpreted. So we keep doing more search in the peer-reviewed literature and we find another paper that said that the maximum solubility is 2.3 grams per litre. So this is actually troubling as a chemist because I have two really strong what are typically going to be good data sources, I have a peer-reviewed paper, and I have enough from a company catalog, a company that I trust very much. So how do we make sense of this? Well, for the company catalog, you're completely out of luck, because there's no information about how the number was obtained at all. There's no reference. It's just a number. And if it's a typo, you have no idea. Now, we get a little bit further with that last paper because they do have an experimental section where they describe, you know, how they actually did the experiment. And this is really -- this is the best that you can do in chemistry right now in terms of finding out how a researcher actually carried out their experiment. But this is not the lab notebook, okay? This contains summarized information, it contains a level of abstraction that, you know, it's a little bit more complicated what actually happened when they tried to do it. For example, here they say they sonicated this, but they don't say the power. And then they diluted and filtered. They don't really say how they filtered it. So I can see some issues with this why the number might not match the company catalog. But I don't have enough information to be able to assess which number is more likely to be correct. So the reality is if I want to know the solubility of this thing, it's probably easier for me to just do it because the literature isn't helping as much. Okay. Here's a second example where we actually will show some notebook information being produced. The sodium hydride oxidation controversy. How many of you have heard about this? Some of the few people related to chemistry here. This is actually a really interesting story. You don't really need to understand chemistry to understand how important this is, but this is something that most chemists would think is impossible to achieve. And it was published in a highly prestigious journal. And so it did generate a lot of controversy, okay? So this was a paper that claimed to do something that most chemists would say is impossible. Now, the way that I found out about it was through the blogosphere. As Cameron mentioned FriendFeed, it's an aggregator that I prefer to use. As well it's very, very efficient. And you see people that basically just start to make comments about this. And then something really interesting happened. People actually started to try to reproduce the experiments but they also provided the raw data so that people could evaluate whether or not what they were saying, you know, was consistent. So the totally synthetic blog tried to repeat one of the experiments and got a 15 percent yield. But again, very importantly, they published the NMR on their blog. So this information is critical for being able to ascertain whether or not the 15 percent is a typo. If you're a chemist, you can go in and you can actually see that 15 percent from the plot. So that's kind of interesting. Now, the 15 percent is -- was much, much lower than what the researchers published originally. So there is still a conversion, but why is it so different? So I was talking to my own students about it, and we were thinking about trying to repeat this. And so again, this is a, you know, experimental section in a chemistry journal, peer-reviewed chemistry journal. It does have information, but it doesn't have lab notebook information. So these are some pretty general terms. I don't know how they monitored the reaction. I don't know exactly what they did. I have a rough idea, of course, but I -- you know, I can't try to reproduce what they did and then see as I'm going through if everything is matching up. So the best we can do is just, you know, try to take a shot at this. So my grad student Khalid and my undergrad Marshall Moritz basically just tried to repeat this. And we also posted the raw NMR data directly. And in this case, it wasn't a blog post but it was also on a Wiki as I'll show you shortly. And we found actually zero percent conversion, okay? So we're getting really wildly divergent results here. And again, the blogosphere comes to the rescue. Someone found a paper from 1965 where this effect had actually been previously reported. And it turns out that it's due to the particular material, sodium hydride can form something at its surface, can form a layer at its surface that completely changes its chemistry. And so this is actually really useful information because this reagent is used by a lot of chemists, and even if you're not trying to repeat this, you want to look out for possible side reactions. So this is great, and this all got sorted out. But the final result, if you go back to the journal where this was published, all you find is that the paper was retracted, and there's no reason given. So all of this information, all the knowledge that was gained by sharing all the information, it's still there. All right? And you'll find it very easily, just do a Google search. Search for sodium hydride oxidation. You'll find our experiment at the top. The second one is the explanation. And then the third one is the second open notebook attempt. And so this is really what Cameron was talking about that, you know, the information can be found, and this is how people are likely to look for it if they want to learn more about this. So it's interesting to see the publisher's stand on this. You know, there's a lot of useful information, but it's not being shared. So a third example that I think is particularly fascinating is Alexander Graham Bell's notebook. This is a recent book. Seth Shulman basically was interested in the notebook, wanted to see when the telephone was invented in Bell's notebook. So he actually looked at it and didn't find the invention before the submission of the patent. And, you know, and he ended up writing a book about the whole sorted affair that it's likely that Bell actually stole the telephone directly from the patent office but visiting it on that day from Elisha Gray. And the fact that he did it for love is a very interesting twist to the story. So I would strongly recommend this. And it basically shows how just providing a lab notebook, which by the way wasn't available to the public until I think 1990, which is why really people hadn't looked at it in detail, and now you can find it on the web. I think since 1999 it was just put on the web for free. So again, you know, something immense that we all thought we knew about it isn't quite what it seemed. Okay. So what I'm talking about is Open Notebook Science. And if you want to see more examples or you want to see articles that are written about this, if you go to Wikipedia that's a pretty good place to start. So what I've been talking about so far is Open Notebook Science in the sense that we make all of our information available immediately, okay, and that's how I'll be talking about it in this way. But it turns out that you might not quite want to expose your work in that way, so we developed these logos, Andy Lang and Shirley Wu to sort of explicitly express what it is that people can expect when they're looking at your notebook. So the top one is what I've been discussing, all content, immediate. And peer, we were discussing before of the talk that you could have all content but delayed, either because of a publication you're waiting on or possibly and intellectual property issue. So there are different ways of doing it. But I think it's important to be explicit about what you're doing. Because if you're doing the top one, what it means is that people are looking at your notebook, and if they don't find something, they can assume that you haven't done it, and then they can go and possibly do it or make decisions based on that. So it is important to be clear about what you're doing. Okay. So this is really sort of a philosophy. You know, the question was brought up earlier about how can you trust data, data is not all created equally, and I think that's a big problem is the way in which it is presented, right? We try to tell our students that there are certain things as facts, but there really aren't. There are certain things, like the melting point of water or the boiling point of water that has been measured so many times by so many people that you can use it as a fact, but the reality is probably most of the measurements in science and surely chemistry may have only been measured once or twice. And I just showed you an example of how hard it is to actually evaluate these data points. Okay. So what we want to do here with Open Notebook Science is maintain the integrity of the data provenance by making all the assumptions explicit so each person can evaluate the source of the information as they wish. So we're moving away from and environment of trust to one of proof. Okay. So the question is if you see two data points, it shouldn't really matter where it came from, whether it came from the most prestigious journal or whether you found it on Google and have no idea who the person is. If you can see the evidence they're providing, that's all you need to be able to assess that particular data point. Okay? So let -- I want to go through a very specific example. So this talk is about using free and hosted tools to do this kind of stuff. So I'll be showing you exactly what kind of tools. Here is a table. Okay, again, I don't want to focus too much on the chemistry but just to note that as a chemist you see that there's a trend there and there's one number that's totally out of whack. So if you were to see this in a traditional paper, there's not much that you could really do about it because is the number a typo, is the number really something that's deviant? You have no way of drilling down usually and actually finding out where that comes from. So in an open notebook what you do -- and we use a Wiki, we use Wiki spaces which is a very nice free hosted service, we can record the log of what the students actually did. So we can investigate here at exactly this time point that, you know, the samples are vortex but the exact amount of time was not recorded. Remember, I was talking about assumptions being explicit? Often it's very useful to know what you don't know. And so in this case, you know, the students didn't -- just didn't measure it. And it turns out that's actually important. So how are you going to find out? Well, you're going to redo this experiment, but now you're going to record this. And that's really how science can, you know, get much better as you keep going through these iterations. Okay. So you can make all kinds of things available. You can also make the rationale, the findings explicit. So you could make a statement but then you can link to various pieces of the puzzle, various parts of raw data that support your statement, which doesn't mean that everyone will agree with you, but that's okay as long as they can debate the raw data. So there's all kinds of raw data these days. We've used images, we've used the short videos that we upload on YouTube. This actually is a very convenient way, take a 15 second YouTube video of an experimental setup and it saves the student from having to write a long paragraph about how they did things. But more importantly, the video doesn't hide things that the students would forget to write like where the thermometer was or exactly what was going on. So this is actually a very efficient way of doing it. Now, we make very extensive use of Google's spreadsheets. And so we're reporting solubilities in this case. We have numbers. But in a Google spreadsheet these are not just numbers. If you click on these cells you'll actually see the formula that are used to calculate all these numbers. And oftentimes when a number doesn't make sense, if you go back to the Google spreadsheet you'll find that the student made a mistake in the calculation. And that's why it's important to, you know, not just have the numbers but be able to track how the numbers were converted. Google spreadsheets are sort of like a Wiki in the sense that they have different versions. So you can go down and if you want to see if an error is corrected or if a student made a change of any kind, you can pretty easily just go back to a previous version. The Wiki where we actually write our experiments is exactly the same way. You can hit the revision history and you can see who made the change and exactly what was the change by comparing two versions. So as a specific implementation I really like Wiki spaces because when you compare two versions the new text is in green and the stuff that was deleted is in red. So I interact with my students a lot this way. As soon as I see them recording something about the experiment they did, I'll go in and I will make comments in bold usually and then they can respond to it. So it's a way of interacting very, very quickly with students that are in the lab. Because typically experiments are pretty complicated, to make sure you have all the information, you analyze it correctly. So I really like a Wiki for that. Now, another pet peeve of mine in traditional publication of chemistry, these NMR spectra that I've been talking with. These are normally stored as PDFs. So even in the supplementary material we download them, it's really just an image. And you can't blow this up, okay, because the information's not there. But it turns out that when you blow up these peaks there's a lot of useful information in the impurities, there's a lot of useful information that is just simply totally absent when you convert it into a PDF. So we use this open format called JCAMP-DX, and we use this open source JSpecView that was developed by Robert Lancashire and this enables us on a web interface, so people come in with a browser they don't have to know that all of this is running, they just take their mouse, expand peaks and they could interact with the data in a way that's much more useful. So we try to leverage as much as possible what's out there. Tony is here. He'll be talking about ChemSpider. In ChemSpider it's just the way of keeping track of molecules, manipulating them, searching them. And we can also upload our data directly to ChemSpider as well. Okay? So like I said, we can upload spectra. Now, the interesting thing about this is when you try to upload a spectrum and ChemSpider asks you whether or not you want it to be open. And if you make it open, there's a lot of interesting consequences from that. One of the consequences that we didn't foresee at the time was if it's open, we can use all the spectra to turn into a game, for example. So Andy Lang, Tony Williams, Robert Lancashire, we all came together and collaborated on this project. We now have a game that uses every NMR spectrum uploaded on ChemSpider that's marked as open data automatically goes into this game. Okay. So I want to go through more of these items that I was going through. This is an example of a chemical reaction that we do where we mix components together, and sometimes we get a precipitate. So what we want to do is to try to understand if we can predict when this is going to happen, okay? So that's the connection to the solubility data that I was talking about. And what we decided to do is to try this interesting approach of using crowd sourcing for anyone in the world to come if and contribute a solubility measurement in a non-aqueous solvent. So any solvent that's not water. We got some funding from Submeta. They funded 10 $500 awards. We got some chemicals donated by Sigma-Aldrich. Nature contributed some magazine subscriptions to the winners. And the concept was very simple, just, you know, submit your measurements, but them in an open notebook, and we -- you know, we will basically judge them. We've just completed the first one where all the awards were made. So these are all students, either graduate or undergrads. And that I think was a very good experience for them. We have six judges, many whom are actually in the room here. Bill is here, Tony, Cameron are here. And basically these judges would interact with the students on the Wiki just as I was mentioning, they would make a comment and then the students would respond or not. We didn't award the prize to the student who made the most measurements. We awarded the prizes to the students who were the most responsible scientists who actually interacted and responded in a way. And so that's -- I'm very happy with the way that was designed because it wasn't just number crunching. Okay. And we had other teachers actually use this in their own labs, which is kind of an interesting approach in a teaching lab to try to get students to contribute to science. Now, I talked to you about searching. So, all right, this is how we put the stuff up there. That's all great. But how are you going to find it? Well, if you're part of the project, you know, this is a common way of finding it. This is sort of a table of contents that has all the experiment numbers and a brief description and who did it. So if you already know about it, this is probably not a bad way to do it. But that's not how most people are going to find the information obviously, right, because they don't know about it. So what we do is we have another Google spreadsheet that aggregates all of the results from all the experiments. So over that year we have I think like 700 different measurements, and they're all on a Google spreadsheet. And the nice thing about this is there's a nice API that enables you to query the goal spreadsheet like a real database. And we could do things like this. So people like Rajarshi Guha can come in and collaborate because the project's open, right, they can find it, they can collaborate with us and make their work open, and then we can, you know, have drop-downs, for example, where you're going to search for vanillin, any solvent. And then it gives you a little table showing you all the different measurements. And you can click on these and actually end up, you know, looking at the actual lab notebook pages. Okay? So how can the scientific process become more automated? One of the longer term benefits that I can see, and certainly Cameron has sort of introduced this concept earlier, is that, you know, it's very important for machines to be able to understand the information just as much as humans. And we're kind of at a very interesting time right now where if we make the information, you know, in some format readable by both humans and machines, they can start to collaborate with each other. And so one of the things that we're -- what we're trying to do is to basically have bots interact with our data and make a useful contribution. So a quick example of this is we're talking about these NMR spectra. So normally a chemist would manually read them and make the calculations. But we've actually have code that automatically goes to the -- from the Google spreadsheet as a web service and integrates it and then does the calculation and returns the final value right here. This is the final solubility value. So this has been very helpful because the students will make mistakes. That's going to happen. It's just a question of how easy it is to find those mistakes. And if you have bots to go in and can, you know, take up a lot of the drudgery, a lot of the things it's very easy to make mistakes, you know, they can contribute, and the humans can contribute what they do very well. So I see if you go through this, see how the information is represented, the APIs that we have, we really want to move towards more and more automation. Okay. So the last part of this in terms of what we're going to do with the data is we've started to build models. So if you're looking for a solubility measurement it might have been measured but if it hasn't been measured we're trying to come up with ways of predicting the number for you. So if you want to do a reaction that hasn't been done, make you can take a guess at what a good solvent might be. So that's something that we're currently working on. Now, one of the criticisms of this project often is that, you know, we're just empirically collecting numbers and putting them together. But actually there's some good science that we've discovered from this as well. Because we're using this NMR method, we can actually see all of the chemicals that are being produced if at all during the making of the solution. And what we found by accident is that for some compounds like this in alcoholic solvents like methanol, they actually form what's called a hemiacetal very quickly. And in this case, if you go back to this solubility this compound was actually reported in 1982, and they did not find actually that there was a chemical reaction between the solvent and the solute, and you know, maybe the information was in here, but the amount of details and provided that we can't go back and see that they missed it or that the method they used, you know, wasn't going to detect it anyway. So this is designed of interesting. And it turns out this is pretty general for a whole class of compounds. So solubilities that have been reported in the past are not really solubilities, they're actually reactions. Okay. So finding the data. I show you the Google spreadsheets, we store everything. Again, if you don't know that spreadsheet exists, how are you likely to find it? We get about 100 queries a day on specific solubility requests. And motion of them come from a Google search or a Wikipedia search. On a Wikipedia if you look up the molecules that we've done, you'll see there's this chem info box. And we have the reported solubility, and then there's a link. You'll notice that none of the other properties typically have links in Wikipedia. So what we're trying to do again is to bring that concept of proof as opposed to trust, okay. So if you want to find out where these came from, click on that link. It takes you to a table, right, of all the different solvents. Clicking on one of these shows you the individual measurements. So different experiments provided different numbers for that same solute and solvent. You can then drill down again, click on one of these, it takes you to the lab notebook page and then you click-through to the associated Google spreadsheet that has all the calculations and everything. So you want to use the number quickly, that's great. You find there's something wrong with the number, you can go in and try to see what it's based on. Okay. So a couple of more issues. How does this affect a publication? Well, because what you're doing is you're making all of this available in realtime. Some publishers will consider that a preprint, and if their preprint policy is nonexistent, then, you know, you won't be able to go to that journal. But it turns out that there are enough journals out there that are peer reviewed and that, you know, will accept preprints. So we've gone through and actually not only used the lab notebook for a particular project, but we've actually written the paper on a public Wiki. So all of the drafts are available as well. And the idea here is that at all times that the world can know exactly what our state of the knowledge of that particular project is. Now, this, of course, is a classic preprint. It just happens to be on a Wiki. But it's just the paper that's already been made public. And again I was talking about citation in my very first slide. This is a very convenient thing. If you have your lab notebook that's public, you can actually use a specific page as a citation. Okay? So you'll notice here that the melting point, for example, was taken from experiment 99, whereas the NMR was taken from experiment 203. Normally in a paper you don't see the that distinction. But it turns out that these are different batches that were made. And maybe they're not identical. So if you're trying to find out why is your melting point not corresponding, you would have to look at the specific batch of where, you know, that compound was made to see if there might be an issue. So we published this in Journal of Visualized Experiments, okay? And this one actually has a video component as well. But that's sort of besides the point. There's also, you know, the actual text that we wrote on the Wiki. And another tool that we made use of was Nature Precedings. So while the paper was under peer review, we submitted it to Nature Precedings, and this has the advantage that it has a DOI, it has a standard author list, it's archived by Nature, so it is something that actually I found does tend to get cited. And, you know, you have nothing to lose if the journal accepts preprints. Now, when the paper comes out Journal of Visualized Experiments is open access. So we've actually really grown to value this tremendously because you retain the copyright and that lets you repurpose that exact same copyright for whatever purpose you want. Tony's going to show this a little bit later, but you can take that same paper and through a ChemSpider journal, it will automatically find the molecules, and when you hover over them, it will show you the image. Okay? So if we had published in a non open access journal, we'd have some issues with this getting permission to actually repurpose it. And there's opportunities that just keep coming and coming when you retain that copyright. So this same paper was turned into an application note with [inaudible] because we had borrowed their robot to actually do the experiment. So again, you have this way of, you know, redundantly distributing your message. Okay. So and here's where a little issue comes up, okay, because if you're repurposing the same content, you're trying to reach a wider audience, it's going to affect the number of times that each one of those sources is actually looked at. So while I'm a huge fan of article level metrics, I think you have to be careful how you interpret your success or failure based on them. Okay? It's just the number it gives you some very useful information. But, you know, don't put your self-esteem into the number of your -- any specific article-level metric. Okay. So there are other approaches to these open notebooks. Cameron has showed briefly he uses a modified blog engine. We have Steve Koch is here. His student uses Open Wetware, which is another Wiki system. Okay? And so there are other approaches to this if you're interested in learning more about them, we can actually discuss them with you. So the final piece of this is what do we do in terms of archiving the material, having a place to cite it, having -- you know, taking a snapshot. So I think there's a huge opportunity right now with libraries as to how they managed this kind of information. What we've been doing recently is actually coming up with a way of archiving these so that, you know, people can actually see what the state of knowledge was. People discuss often why we don't use the way-back machine. Basically it doesn't work very well for our project. I don't know if you ever tried to look at some of your applications on here. But there are actually entries, okay. They are not taken every month, they are not taken very often. But this is what they look like so even though Wikispaces is not protected by a password, for some reason when they try to archive it every page looks like this. So you can't really rely on these default mechanisms often. So you have to sort of take a proactive approach, I think. So with Andrew Lang we've basically gone through, written some code to address specific kinds of archiving, specific kinds of backups. So we have this ONSPreserver, for example, that will go through a Google spreadsheet that has a list of all the items that are high priority that we want to, you know, keep backed up. So it will make a copy whenever this runs. We use Windows Scheduler, just goes in once a day and just executes it. Okay and luckily Google spreadsheets have a very nice option that they can be downloaded as XL. And the reason that's important is that I was telling you about the calculations. I also told you about web services that were calling. When you store them as XL, it retains all the calculations and it actually -- even though it captures the number from a web service call, it will give you the link of the web service that it actually used. So again, if you want to track back to see exactly where the information comes from, this is extremely useful. So where we are right now in terms of this whole, you know, archiving issue, we have a service that again runs once a day that simply backs up the actual Google spreadsheet that summarizes everything. Then we take periodic snapshots where we actually make copies of all the relevant files and the lab notebook. And then we can actually put them in citable storage. So we have them available from lulu.com, so you can actually buy the CD at cost. It's like five dollars. And that will have the archive for a particular day. We also published it in a book. And if you're interested in seeing this, it basically takes the Google spreadsheet and it puts them in a human readable format so you can browse through the book. And this book corresponds to this data archive. Okay? So the this is the concept here. Okay. So basically the way this works just slightly more technical standpoint is wikispaces has a way of exporting the entire Wiki as HTML. So we start with that. And then it makes local references to any images or files that were uploaded on to the Wiki. And then we have all the spectral files, the NMRs. That's actually a very short manual step. And then Andy has written code that actually goes through and identifies how many Google spreadsheets are cited from each page. And then it downloads them and puts them in a file to have a local reference when you look at the archive. So if you go into this archive, for example, from February 11th, you'll notice that this is all local. And when you click on, you know, any of these links, it will just redirect you locally on that archive. So again, this is concept of the snapshot. On this particular day what did all the data sources look like? Okay? So this is the way it looks like on Lulu when you publish these. We've also used DSpace at Drexel as a zip archive, okay? So the difference is when you download this, there are certain functions that you can't do like viewing the interactive spectra. Okay? But otherwise, everything else is basically the same. So we have made use of these. And we have this data book, which is available. So this is kind of interesting. They charge a fairly minimal amount to simply, you know, print out the book and publish it. And then you simply pay for shipping. And in the end this is what it looks like. Okay? So that's basically it for the archive. Just step through for a minute here in terms of what the initial questions that I asked. Is our current system working? I showed you three examples where, you know, it isn't working the way that the chemists would like it to. Is it difficult or expensive to implement ONS? Not necessarily. I showed you Wikispace is free and hosted, Google spreadsheet's completely free and hosted. And the code that we've written for all this archiving, of course, is available to anyone. So it's certainly possible. Does it prevent peer-reviewed publication? No, but you do have to look at the publisher to make sure that they allow it. It is discoverable. Like I said, Google and Wikipedia are the main issues. Can it be easily archived? This is a very recent thing where I think we're pretty good on archiving right now. But in terms of intellectual property protection I think this is very problematic. If you intend on patenting anything, you have to -you know, you have to understand that this is a disclosure every day, every hour that you make this available. Although Steve Koch actually has very interesting strategy because you do retain the ability to file for a US patent a year after you make it public. So even though you don't have international rights, there are still some ways that you can get some intellectual property coverage. But for the work that we're doing, that's not an issue. We're just not going to pursue that. So I'd like to thank you very much for your time. [applause]. >> Lee Dirks: Are there any questions for Jean-Claude? >>: On the Google spreadsheets, is there any plan to make the SMILES of chem structure -- you know, an actual chemical structure so that you could search that way or ultimately do you have to rely on exporting to XL and use some plugin to chemically aware searching? >> Jean-Claude Bradley: Yeah. I mean, the idea is you have the API that -yeah, you can query it in sophisticated ways in terms of only give me the compounds with this solubility, range, or do the substructure search. So, yeah. So those do exist but they don't exist as an inherent property of the Google spreadsheet. That would come through the API. And we have those things written. So that's not an issue. But yeah, we do use SMILES as -- it's probably the most convenient way to store the information in a spreadsheet like this. >>: Well, one thing that's happened and I was doing a search on John Wilbanks is that he had a column that said the Journal of Visualized Experiments JoVE started as open access and last year decided arbitrarily on their part, I guess, not to be open access. So is that a problem or are you predicating everything on open notebook and then suddenly one of your sources says oh, by the way, it's not going to be, and can you get access to your own data and extract it and then post it yourself somewhere else? >> Jean-Claude Bradley: Well, you can do that. But I mean the other thing is JoVE has not -- like having that was open access is still open access. So our paper is still open access. What they were saying is going forward yeah. I mean and you still can make it open access, you know, the gold approach. You could pay for it. So that's still available. Like they haven't, you know, done a bait and switch where it was open access or we thought it was and then they removed it. It's still -- it still is. >>: So the whole system you have is really, really impressive and a how well open science is working. I'm curious if a chemist just like you is not doing any of this wanted to start doing it, how difficult do you think it would be ->> Jean-Claude Bradley: So the question is how, you know, how can you get involved or start -- I you know went through very quickly showed you the best of what we've been doing. You certainly don't have to implement every single thing that we have. I have taken a chemist actually through this and, you know, it's just a question of creating the Wiki, showing a little bit how to create the Google spreadsheets, how to connect them. There's also a blog component to this where we kind of talk about our progress and then link to the actual notebook pages because we certainly don't expect people to read the notebook like a magazine. But I mean in terms of some of the archiving and, you know, some of the more sophisticated stuff like that, I don't think you need that initially to get started. I think you just simply need to have the Wiki created and, you know, learn a little bit about how it works. I like Wikipedia spaces too because they have a pretty nice visual editor, so you don't need to really know any Wikipedia text to sort of get going which I like for my students. But, yeah, I think the whole point is there's a lot of people in the open science community who are very happy to collaborate and help so that, you know, that's what I would suggest is talk to someone who is doing it and then have them help you set something up to get started. And that does seem to work well. >>: [inaudible] Lulu do I get to specify which day's snapshot is ->> Jean-Claude Bradley: Yes. Yeah. This is the third edition. And each edition has its own entry on Lulu. >>: Okay. So editions aren't every day there could be a new edition, depending on [inaudible] spreadsheet? >> Jean-Claude Bradley: Probably every couple of months we want to put out a new edition. As we implement new things, like the archive, for example, wasn't part of edition two. So now that we do, we have a preface where we explain the new features, then we just happen to take a new snapshot. But, yeah, that's actually a good question is how often we do the archiving. I think, you know, every couple weeks or months makes sense at this point. [applause]. >> Lee Dirks: Were all scientists all this thoughtful and diligent in our approach and some amazing best practices. All right. Well, let's take a break. Let's start again at 11:30 a.m. sharp. So gives us about 18 minutes. Grab some coffee and chat with your colleagues. Thank you.