>> Lisa Green: Good morning. My name's Lisa... Commons. I'm really excited to be here today. ...

advertisement
>> Lisa Green: Good morning. My name's Lisa Green, and I'm from Science
Commons. I'm really excited to be here today. I hope you are too. And if you're
not, I'm sure you're going to be once we get started. We have a spectacular
lineup of speakers.
And the things we're going to be talking about today, I mean, these are some of
the most important ideas in science right now. And their impact goes well
beyond even science. So I'm really excited, and I'm sure this is going to be a
very stimulating and rewarding day.
Before we get started, I'd like to give out some thank yous. And my first thank
you goes to my coorganizer, Hope Leman. Many of you know Hope. If you don't
know Hope, I encourage you to introduce yourself to her today and thank her for
this day because it's not hyperbole to say we wouldn't be here without Hope's
work. So thank you very much, Hope.
[applause].
Also, I want to thank you our video blogger, Chris Pirillo. Chris has a tremendous
following, and he's really helping us get the message out to as many people as
possible. It's being live cast now, and it will be packaged so you can download
and watch it again after the fact.
And the biggest thank you of all guess to Microsoft Research and Lee Dirks.
[applause].
So Microsoft Research is a long-time supporter and partner of Creative
Commons and their source of innovation for over two decades. And speaking of
who we wouldn't be here without, Microsoft Research is who really made this
happen. Lee is who really made this happen. I would get e-mails from Lee at
two in the morning and then again at 5:30 in the morning. So he had a lot of
passion about making this happen, and we're really grateful.
Later today, you'll -- well, pretty soon, you'll hear Peter Murray-Rust speak a little
bit about what's going on at Microsoft Research with his collaboration. But right
now I'd like to bring up Lee and give him a very warm thank you from all of us.
And he's going to tell us about what's going on at Microsoft Research. Lee.
[applause].
>> Lee Dirks: Okay. So, yeah, I do send e-mails at weird hours, there's no
doubt about that. But it's definitely Lisa and a lot of people from Microsoft
Research that made this happen. It certainly wasn't me by myself by any stretch
of the imagination.
My name is Lee Dirks. I'm from Microsoft Research, specifically a team called
External Research. I'm going to speak a little bit about that in a moment, some
very brief comments. But first off, thank you very, very much for coming out
bright and a early on a Saturday morning. It's a tremendous day, so I'm very
sorry to lock you in a room. But we have in conjunction with Lisa lined up an
amazing, amazing group of speakers.
So I drove all the way from Seattle to see them. Sorry. I don't know about you
guys, but -- no, but many people have flown in. Obviously some of the speakers
are here from the UK. So it's a tremendous occasion. And I think it's going to be
very, very exciting. There's some things that were announced yesterday that I
think will be delved into today that are going to be of interest to everyone. So
and auspicious milestone. But I'll leave it to the speakers to chat about that.
I did want to cover a few logistics for the day. First off, wireless. Everyone, if you
don't have the wireless code, it's written up here. We also have slips of paper at
the registration desk. So if you need to get access to your Internet, we should
have no problem. If there's any problem, do let me know, track me down any
time during the day and let me know.
Chris, as Lisa mentioned, Chris is live streaming this, so that's available right
now. In addition Microsoft Research is going to be taping and capturing the
entire day. We have a -- kind of a relationship with the University of Washington
and research channel that we put a lot of our talks out on. So this will be made
available after the fact. And we can send a link out. So this whole day will be
captured and you can definitely share that with everyone. So the speakers have
all signed release forms except the two I haven't been able to track down yet. So
if you are a speaker who hasn't signed a form yet, come find me. But that will be
available.
We also will be having a reception at the end of the day. So at five p.m. until
about 6:30 out in the atrium of building 99 we'll have a wine and cheese
reception. Everyone is welcome to join. Also at that time, we will be giving
everyone free copies of the Fourth Paradigm which is a book that Microsoft
released in October that is related to kind of the future of eScience. And it was,
we're very proud to say, Microsoft's first Creative Commons' release publication.
So all of the content the openly available, thanks to obviously all of the authors
for making that available. But -- I do have a copy. I'll show you what it looks like.
So everyone will be getting a copy of that. And yes, there's the Creative
Commons' license right there. So everyone will be getting copies of this.
And we've got a commitment from John Wilbanks, he was one of the authors in
one of the articles, that he'll do book signings. He said if people want my
signature, I'm happy to give it to them. So pin him down, okay.
The other kind of logistical point is obviously this is a Microsoft building. We're
not normally open to the public on Saturday. So I think many of you have dealt
with the security guard to get in. We need you to kind of stay in this area. This is
a public area. But if you need anything, again, come to me or find the security
guard if you get outside of the building or something like that, you know, kind of -the security guard should be there all day. But if you get locked out, knock and
wave at him.
And rest rooms, hopefully you found them down that direction. There's going to
be refreshments all day on the back. We have a box lunch for you and then
again the reception. If you need other beverages other than the ones that are
provided here, outside of this hallway and in the back corner, you can actually go
out through either of these doors as well, there's a refrigerator with a selection of
beverages there as well. So if you need a water or Coke or a little caffeine to
keep you up, you'll find that.
Perhaps some of you are saying wait a minute, why -- what does Microsoft
Research have to do or why are they interested in Creative Commons or Science
Commons? And so I wanted to give a little bit of context.
Our legal team has been working very closely with Creative Commons for many
years. So Tom Rubin and Lawrence Lessig go back a ways, and so we've had a
long-standing relationship with Creative Commons overall. But mainly over the
last three our four years Microsoft Research and External Research has
developed a relationship with Science Commons.
And what I wanted to do is go ahead and give you some background. So
Microsoft Research overall is a group of about 900 researchers, about 450 of
them reside in this building. And the remainder sit in these locations around the
world. Research, as Lisa made reference to, has been around for almost 20
years. We'll be celebrating our 20th anniversary shortly. And we, you know, and
look at in all areas of computer science but specifically our group is called -- that
we reside in is referred to as External Research.
And so our team is very focused on applications of computer science and
specifically applications of computer science in health and well-being, in core
computer science, in Earth energy and environment, and then the area that I'm
responsible for which is education and scholarly communication.
And so we look at the applications computer science in this area, and our team is
called External Research because we actually -- our team doesn't do the
research here, we do research in collaboration with external parties, typically
most often with academic -- with academics and you'll see one or two examples
of that over the course of the day, some of the projects that we have going on.
But we've done a couple of partnerships with Creative Commons that I just want
to reference specifically out of my group, one of which some of you might be
familiar with. We actually about two years ago released an add in for Microsoft
Office which allows you to embed the Creative Commons -- let me go back this
for a second -- the Creative Commons license in either Power Point, Excel or
Word. This is actually one of our most popular downloads. We've had
something like over half a million downloads of this add-in alone.
Another one that we did about a year and a half ago was the ontology add-in.
We originally did this work with Phil Bourne at University of California a San
Diego, and then we actually did some work with John Wilbanks and his team that
is responsible for NeuroCommons. So this is an ontology add-in that allows you
to import any ontology. Whoops. It's on a timer. Shouldn't be. Sorry about that.
That allows you to import and embed an ontology into the Word document and
have that travel and then mark up specific words in XML and embed those in
those tags in the document itself. So just a couple of examples of the
engagements that we've had.
What I would like to do now, just to give you a little bit of context, but what I'd like
to do now is actually turn it over to my colleagues Stewart Tansley and Kris Tolle.
And they were going to go into a little bit more detail about the Fourth Paradigm.
And I think this is a -- if you're not familiar with it, I think you'll find it an intriguing
book. And I think again John's got an article a in it, but there's a lot of -- a lot of
similar concepts that I think we'll be addressing today that were addressed in this
book. And so I'd like to hand it over to Kris and Stewart.
>> Kristin Tolle: Thanks, Lee. Those of us who -- Tony Hey, who is the actual
main editor of this book and is also the vice president for External Research,
likes to describe himself as somebody who practices management by walking
around.
Those of us that know Tony a lot better and work with him actually would say that
Tony likes to first send out an e-mail let's go do crazy idea X and then he
wanders around trying to find out why nobody responded. [laughter].
And that's pretty much actually how the Fourth Paradigm started out. He really
wanted to create a book that would show how computer science was facilitating
each of the pillars inside of his External Research team. And those pillars are
health and well-being, they're scholarly communications, Earth energy and
environment and then core computer science.
And the book itself is actually structured this way. It has these four different
themes.
Now, there was one person who actually responded to the mail and said here's a
list of authors I think would be really good, because she'd already been thinking
about creating a similar book but just localizing it to health and well-being, and
that person actually happened to be me.
And so once that mail hit his door you could almost hear the elephant-like
footsteps pounding their way down to my office saying you think this is a good
idea, we should do this, and of course I agreed. So a book was born.
From the beginning, we really -- we have three tenets that we wanted to make
sure that we hit. And the first tenet was that we wanted this book to honor Jim
Gray and his memory and the good work that he did.
Secondly, we wanted to illustrate how it was that science was really transforming
-- computer science was transforming science and how it was dealing with the
deluge of data that we deal with every day.
And lastly, we wanted to make sure that this would be openly and freely available
to everyone because we thought Jim really would have wanted it to be that way.
So since I had already had my list, then the next challenge was for us to go
around to each of the other pillars and collect those lists. And then we got down
to the real core business of writing a book. And I won't go into detail, because
I'm sure many of you know what it's like to edit a book. It's a considerable
amount of effort and a considerable amount of challenge.
So along the way we picked up Stewart. And believe me, Stewart was a
godsend. Not only was he helpful with the day-to-day stuff, but he also brought a
very unique perspective to the book. He brought the perspective of how do we
get there from here? So that's useful if the day-to-day work where you're having
to collect papers, get them back, get them edited, send them back, go for review.
But more importantly, it also -- he brought the perspective of how would you think
of the book as a whole and then you move that into the space where people
could actually make it actionable. And so he really helped us see how we could
take this book forward and make sure that remained a living document, that it
would be something that would be owned by the community. Stewart.
>> Stewart Tansley: Thank you, Kris. That's kind of you. We're Microsoft, too,
right, we're Microsoft, too. So I've got my slides. I thought I'd show you some
slides too. Thank you for the introduction, Kris.
The -- I'll press the right keyboard first. What is this book in performs of the gritty
detail? It's 250 pages there. You've seen a copy. You'll get a copy later if you
so wish.
It's the first Creative Commons publication from Microsoft Research which we're
really proud of. As Lee said, took us a while to get there. We've been working
with Creative Commons for a number of years. But this has really been a break
through, and we hope to see other publications come from there, from Microsoft
Research.
It's an interesting collection of 26 papers, short technical papers, just 2,000 words
on average. There was some editing involved.
>> Kristin Tolle: Some editing.
>> Stewart Tansley: There's about 70 leading practitioners from around the
world. Many of their names I hope you'll recognize, a few you may not
recognize, but we think that you will recognize them in the years to come.
It's not all Microsoft, but a significant Microsoft, but it's mostly not Microsoft
people. It's published under Microsoft Research but mostly it's about 45 authors,
scientists and some computer scientists from around the world.
The four things I won't go into them again. We've highlighted that. And just to
reiterate, our own External Research group is somewhat structured along similar
lines because it maps how we think this field is panning out.
Similarly Gray, who together with Alex Scalay and Tony Hey really formulated
this concept of the Fourth Paradigm. There's some hints about the other three
paradigms if you're not familiar with this at the bottom of this page. It really was a
testament to -- Jim inspired us all who interacted with him. And there is a certain
labor of love. Kris was very keen to even use that phrase in the book. We're
very proud to represent Jim's legacy going forward, but as a living entity, not as
something that just is a milestone. This is something that Jim would like to have
seen going forward as an idea, a mean in the community.
We launched in eScience Workshop, I hope you're familiar with that workshop, in
October.
And let me show you the cover. You've seen that. We've got that -- doesn't have
the Creative Commons' license on. This was a prepublication version. But I
assure you it is on the real one.
I won't have time to go through all of the details but here I do put up some of the
papers. You see we have structured it in the fourth sections as being described
and you'll recognize some of the names already there.
Next page I think John is highlighted on this page bottom right. Yes, no?
>> Kristin Tolle: Yes.
>> Stewart Tansley: And so without further ado, that's what it looks like inside. I
hope you enjoy the book. You can get to it from this URL. If you don't want to
carry a heavy copy home, if you traveled a long way, it's downloadable, too. But
it's printed nicely, so do take a copy for yourself. Okay. Thank you very much.
[applause].
>> Lee Dirks: Very good. And with no further ado I -- well, one bit of ado. Sorry.
I did want to pass along regrets from Tony Hey. Dr. Tony Hey was unable to be
here today. It was something he was very passionate, very much intended and
wanted to be here. Unfortunately he had this pesky event that he was being
awarded a fellow in the AAAS today in San Diego. So he had a conflict. So we
decided we'd let him go on that. So he definitely does wish he could be here and
passes on his regards.
So now with no further ado, I would like to hand the podium over to Dr. Cameron
Neylon. And he will be speaking to us about science in the open. Why do we
need it and how do we do it.
>> Cameron Neylon: Okay. So am I amplified yet? Or do I just need to talk
loudly. So thank you again to Hope, Lisa, and Lee for the invitation to come; to
Hope in particular for e-mailing me I think about once every 12 hours over a
period of a couple weeks saying are you coming, have you sorted the logistics
yet. And thank you for our hosts and thank you all for coming.
I put this slide up hopefully to somewhat frame the point of today's discussion. I
should add you are also free to take notes, to think, and to disagree with me and
indeed to publish that.
But while I'm thanking people, I want to thank a whole bunch of people. And this
is actually for those of you who have seen the slide before, actually updated a it
now, so it's not quite as out of date as it used to be. If I have seen any distance
at all, it is by standing on the blog posts, the tweets, and indeed on the formal
publications of others.
We would do well as scientists to remember that -- whoops. It's coming back.
We would do well to remember that even the biggest paradigm shifts, the biggest
breakthroughs in science are really a very thin veneer over what has come
before. And sometimes a little humility perhaps might be effectively applied to
the process of how we think about communicating science and how we manage
it.
And all the people up here are not necessarily people I've met. Some of them
are people I've met online. Many of them are people I disagree with profoundly,
but they are people who have influenced my thinking and to a very large extent I
can no longer tell which of the ideas I'm presenting are my ideas and which have
come out of these.
Think of this more as me filtering the things that I've seen. Though you shouldn't
hold these people responsible for what I'm saying, obviously.
Okay. So who am I? I live in Bath, which is a lovely place to live, and I work at a
place called the Rutherford Appleton Lab, which happens to be about 60 miles
away. So I have about a two hour commute each day.
And the organization I work for is called STFC. We're a UK infrastructure
organization, however, we are also a research funder, so I need to put up this
disclaimer saying that I'm not presenting any policy here at the organization blah,
blah, blah, blah.
So I get up relatively early in the morning, unfortunately and I catch a train, and
then I get on a bus and then eventually I get to work usually about quarter to 9 in
the morning.
The things I work on are more or less could be described as structural biology.
So we're trying about trying to determine the structures of biological molecules
and in particular I'm interested in trying to solve the structures of assemblies of
biological molecules using a range of techniques but primarily small angle
scattering. For those of you who are interested I can whitter on about that for
several hours but I will avoid doing so at this stage.
I also get to do a lot of really, you know, cool experiments. We do some stuff
with protein labeling and connecting proteins to other stuff, so I get to take cool
pictures of fluorescent stuff, which is always a lot of fun.
It's an interesting mixture of small lab work. So this is actually a picture of me in
the lab. And also I work at a large facility so we do experiments that kind of do
involve the big iron of experimental facilities but also we have some experience
of the problems of handling and looking after data.
As a scientist you spend an awful lot of your time reading, you spend an awful lot
of your time in meetings, and probably too much time travelling. This is my
second home. The departure terminal at Heathrow terminal 5. Often I'm doing
that to give a talk which usually involves me preparing the talk at the last minute
often on a Saturday morning. Which I should say for those of you following along
at home you can find the slides to a similar talk at
slideshare.net/CameronNeylon, and they're not quite the same slide, but most of
them are there.
So show a picture of myself in the lab but often students are actually quite keen
to keep me out of the lab. This was not actually my fault, but these kind of things
do happen, and of course these lead to more meetings about safety and then
more stuff about more reading.
But, you know, at the end of the day, I get on my bus and go home, unless of
course I'm travelling somewhere.
So the question one might ask about this kind of lifestyle, the lifestyle that many
scientists choose to lead is why. And there are a series of levels to this question.
The first perhaps is why does somebody actually pay for me to do this? Why
governments are funding this kind of work. And in fact that's really the wrong
kind of question, because it's not the governments that fund research, it's the
wider community, the public.
But of course we should really ban the term of the use the public because we are
the public. So the question is as the community of people who pay taxes,
whether that's directly or indirectly, why do we think science research me getting
to do cool stuff in the lab is something that's worth paying for?
And the answers to that are a number of things. People are very keen on seeing
medical advances, cures. Prestige is a big issue. This is a close-up of a Nobel
Prize medal. Countries are actually very keen to get Nobel Prizes. The GDP of
a small country can be significantly increased by the winning of a Nobel Prize,
surprisingly enough.
And we shouldn't forget just the idea of -- just the pure excitement, the idea that
we can talk about exciting stuff like galaxies, like the origin of the universe, like
how our biology actually works, and that is something that appeals to us. It
appeals also to the community and it particularly appeals to children who may be
the next generation of sciences. So there are a lot of reasons why as a
community we fund research.
Why do I do it? Why do I put myself through the rigmarole of running around a
place, doing all of these things? Well, simple answers. I have a mortgage to
pay. I need a job the same as anyone else. Many people have said that I'm far
too curious for my own good, that I stick my nose in places where it's not really
wanted and probably not really advisable to put it.
And of course it can be fun. And again, as a researcher often stuck in meetings
or stuck in dealing with stuff that I don't really want to know about, it's worth
thinking back to when this was just sort of an amazing thing to look at and just
really cool. And, you know, I do get to do these things. I do get to come out and
listen to the rest today's speakers, which is just going to be really great fun.
And that really comes to the core point that as a scientist this is a privilege. I
have an immensely privileged life. I get a good salary to do stuff that I find
interesting. It's certainly not a right. And so the question that I ask is how do I
deliver the best on the public investment in my time? And a I'm not going to get
into an argument about metrics and how we measure things because that would
be another whole day of talks, and we probably wouldn't agree on the outcome.
But that's perhaps not the point.
I think the key thing is that this is not a right. As a scientist, I have an obligation
to the people who fund me to do useful stuff. And if we can't talk about the
details, then we can at least say that when we do this kind of thing we should be
maximizing the value, the return on the public investment. I don't mean the
economic return, I mean of the things that I'm generating papers, results, drugs,
media coverage that gets kids into science. These are the things we should be
maximizing the delivery of for the amount of money that we have available to us,
especially in a situation where that amount of money looks like it's going to be
taking a bit of a nose dive over the next couple of years.
That's easy to say. Less easy to know exactly how to do it. But I think there are
some obvious answers. One is simply that we make sure that the science is
available for people to build on. I said before this is a thin veneer. And we need
to leverage the ability of as many people as possible to build on it. I mean I see
this basically as a no-brainer. We need to make sure that the widest community
possible has access to the results so they can build on them. And we can talk
about how best to do that and several of speakers later in the day will talk about
how best to do that.
So I'm not -- I'm not really going to talk about open access to the formal
published literature because sometimes, and I would argue often, formal
publication, this process that we go through of traditional peer review prior to
publication formatting is really overkill. You do not need a sledge hammer to
take down a snowman. Though sometimes it's fun, particularly the amount of
snow we've had in the UK recently it's been really a little bit difficult to cope with.
But, again, that's a slightly different story.
Let me give you a quick example of that. If you had done a Google search for
solubility of Boc-glycine in THF at 9 a.m. on the 4th of September 2008, you
would have got some not-very-useful Google results, none of which really had
the answer in them. Which is a little disappointing because the day before that,
Jean-Claude and I had actually been in the lab doing an experiment measuring
the solubility of Boc-glycine in THF. It doesn't matter what that is, what matters is
we didn't experiment, we got a number out.
But this was not available to the rest of the world. Except when I did that search,
I know that Jean-Claude was sitting in his hotel room actually writing up the
experiment and putting it online. I'm not going to talk about the details of this
because Jean-Claude can do this much better than I can.
The point is when I did that same search the same evening, the answer is up
there. We haven't been through peer review, we haven't gone through a process
of waiting nine months to put a number in the public domain, it's just there, it's
available.
Now, I don't know whether the following morning some chemical student
somewhere in the world benefited from the fact that this number was suddenly
available and it made it easier for them to do their experiment, but I do know that
there was nothing gained by holding on to it for nine months.
The point is the web makes publishing in the senses of making public extremely
easy. And there are a lot of services, systems available for putting your wide
variety of data, documents, and media on the web. And again, Tony will talk -[inaudible] Tony will no doubt be talking about one example of that later in the
day.
We can put this stuff on the web. We can put our lab notebook on the web.
Now, inspired again by the work of Jean-Claude, this is my lab notebook, it is on
the web, you can go and look at it. It goes up, it's available, the data's there, it's
indexed by search engines.
You might ask the question actually anyone looking at this can actually
understand it, and then that really raises an important question. So perhaps it's
better to say of the web that broadcasting is easy, putting material out so that
people can look for it is easy, actually sharing it effectively is a much harder
problem. Both because you have to make the choice to put that sign on your
table and because you have to make the choice and put in the work to put it in a
form that other people can actually find and use.
So I would argue, many others have argued, John Wilbanks perhaps key
amongst them that the really important thing to focus on in all of this is
interoperability. It's making sure that I don't have to bring that ruddy adapt-a-plug
which sparks every time I put it into an American socket whenever I come to the
US. And certainly don't end up in this situation where you do in various places in
Europe where they -- the poles are fine but the size of the plug is wrong.
We need technical interoperability and then we can talk about formats and
vocabularies, and there are other people who are much more -- much better
equipped to talk about than I am, and that requires work. We need legal
interoperability. We need the ability to be able to be sure that we're allowed to
use data, to use ideas, to use images for the purposes of repurposing that we
want to do that for.
And again, Creative Commons and Science Commons have done an awful lot of
work on this and fundamental conclusions we come to in most cases is that we
need to use very liberal licenses to make this work properly will. It tends to
involve putting things in the public domain and putting them under Creative
Commons attribution licenses.
And as Lee has alluded to, one of the things that's really proud to be able to talk
about today is the idea of trying to come up with principles, approaches, tick lists
that make it possible for people to be sure they're sharing data effectively.
So the Panton Principles that were published yesterday, which like all good
things that come out of English academia involve Peter, myself and Rufus
Pollock going to the pub and having an argument. And where this came out of,
so for those of you who don't know, Rufus Pollock is one of the founders of the
Urban Knowledge Foundation. The Urban Knowledge Foundation is an
important organization promoting open culture, open science, and open source
software.
And they have a slightly different perspective on the type of licenses that should
be used than Science Commons do. And what was really important about this
was that rather than trying to come up with a broad and overarching legal
principle about what to do, what we did was focus on what we could agree on
and what we thought other people might be able to agree on.
So the idea here is that fundamentally if you you want to publish data to publish
science in away which is actually useful to other people, in a way in which they
can reuse it. You need to be able to go through a series of almost tick lists to be
able to do that.
So this is not a statement about when you should publish or if you should
publish, it only applies to when you decided to publish some data. So if you want
to sell data, you want to do that in a proprietary way, this doesn't -- this doesn't
apply. And we're not trying to cover this. I would say you're making a
commercially silly decision if you do that, but that's for me to say.
What we're talking about here is when you decide to publish data, please do
these four things. Be clear, make a clear statement about what you want to do,
make that absolutely explicit about what you want people to do and if there are
things you don't want people to do. And the best way to do that is to use the
legal instrument that actually applies to the stuff you're doing. So if you're
producing data, please do not put a Creative Commons attribution license on it
because it's almost entirely useless. Use something that works.
I should say these are the shortened versions, so sort of headline version of the
three points. Do not use non-commercial terms. And a discussion about why
that is, but effectively using non-commercial terms obviously blocks commercial
use of data but it blocks the use of this data to make money and return money
back to the process of making more data. But this is really the absolutely key
point. This is the point that we bring.
If you publish data, if you decide to publish data, place it explicit in the public
domain, particularly when it comes from public science. And this is really the
key. And there are instruments, legal instruments for doing this. So I encourage
you to go to the website, look at the whole thing if it's whole. If you agree, then I
would also encourage you to sign up. If you disagree, if you think there are
issues with what was said, please take part in the conversation. This was a
really attempt to find common ground and find the things we can talk about. And
there are going to be other issues about when to publish, how to publish, what
sort of policy conditions there should be on that. And Peter will talk a little bit
more about this later.
So I've talked about technical interoperability very briefly, I've talked about legal
interoperability, but I would argue that these are actually subsets of the whole
thing. We need processes, and we need process that actually allows these
things to interoperate. All of this governs about four things and tick boxes and all
of that. That should all be taken away from the scientists. They should just have
to make the choice do I want to make this open, yes, no, and have it all taken
care of them because they're busy people. Tell you that for nothing.
So we need systems that actually work with the existing processes that scientists
actually are using, and we need systems that work with the people. I've said in
the past, if you're using the user as your API, then something's going horribly
wrong. So presenting the scientist with a tool like this is not incredibly helpful.
And this is, to be fair, what a lot of the stuff out there actually looks like, when
what they're really looking for is something like this: Something straightforward,
something simple where it knows who they are and they know what they're
doing. I just want to get on and do my experiment. That's the key thing. Kick the
box, move on.
And would I argue that what we need to do is we need to capture the objects, the
things that happen as part of the research process and then add the structure on
later and provide the tools that help people to add that structure on.
And what I mean by that is to map the processes that I use in the laboratory at
the computer when I am doing things on to agreed vocabularies, on to these
interoperability things. But map them, don't insist that I use them when I'm doing
the work, when I'm doing the stuff that I do about which I hope I'm the expert.
Map these processes on to those vocabularies when we tell the story, when we
have a narrative that we want to fit this into. There's a real problem with a lot of
these systems. They make the assumption that I know why I'm doing the
experiment. And I can tell you most of the time I don't. Most of the time I don't
know what the data's going to be used for. That's the whole point of making it
available so somebody can do something totally unexpected with it.
Machines do structure, computational systems do structure. And they need
structure, and we need the machines. The scale of the data that Lee has
mentioned, that Steven will talk about later in the day, are such that you cannot
handle this with a pen and paper. We need the machines to be able to do
anything useful with this.
But we don't do structure. We tell stories. So any tools that capture the pieces of
the research record, the samples, the data, the little scraps of text that you wrote
down on a piece of paper in the lab and tools that help us structure that and pull
those down into a story when we write a paper, when we write a report, when
we're doing a presentation, and tools that actually are aware of the structure
that's already there, the capturing and leveraging the structure that's already
there as part of the process.
So Lee mentioned the work of Microsoft in this space, and Peter will talk later
about -- about systems that do this in other places. I want to show an example of
some fairly preliminary work that I've been doing. I know I shouldn't possibly be
showing Google products in a Microsoft space but that's what I work with
because I'm too dumb to work in C# basically.
And so the point here is that I'm taking a notebook, I'm writing something about
an experiment that I am doing, just typing it in. But what I'm doing is bringing in
information from other places. These are just RSS feeds. So this robot, this
system is just grabbing information from this feed. And it's using it to populate a
drop-down menu.
So as I go along and I want to talk about what I'm doing, talk about the inputs, the
ideas that led to this experiment, I can just select them, insert them, and the
system creates the link. The system puts that information in and then my tweaks
come up.
You know, again, if I'm referring to the literature, I might have some literature
online, and again, I can just create the link. Ideally if I've generated some data or
if I'm going to generate some data, that already is available somewhere online.
Maybe it's a cure, maybe it's fully available. But again, it's being dumped
somewhere without my intervention and again, there's an RSS feed. So this is a
beta product obviously. And in fact, this is about to crash.
But data, and an image could be data, can be inserted automatically into the
process. Again, this is a fairly crude example. But all I'm doing is typing away
and inserting objects that are inputs, inserting objects that are outputs.
And so the question is how do I then get the structured data out? I want the
system to have captured what's happened. And so this has been automatically
generated. This document, this thing which has an identity on the web, has
inputs and it has outputs. And I've captured that information. Something's
coming in; something's going out.
Of course, if we're doing Semantic Web, then we should be generating that in
Semantic Webby stuff. And I've generated this RDF automatically. Now I faked
this actually. The name spaces don't exist. I haven't put these up online.
But the point is, this is a snippet of RDF generated automatically from my typed
record of an experiment and selecting a couple of things from a drop-down
menu. It knows who I am. It knows who the authors of the document are, and I
put those in automatically. It knows what the inputs were. It knows what the
outputs are. And those can be more sophisticated vocabulary terms which I
might have selected from the drop-down menu.
But we can capture what I was doing as I was doing it and automatically create a
structured record which is then available for machines to do things with. So what
can we do? We're talking about open source. Well, let's say we're actually
technically able to share. And I would say going out on a limb, we're actually -- if
we made the choice, we could choose to share the entire research record.
We need to do a lot of building. We need to do quite a lot of work. And it would
cost a bit of money. But we could do this today, if we chose to do it. The
question is whether we choose to do it as a community, both a community of
researchers, specifically communities of scientists, and as a community that pays
for this research either directly through taxation or indirectly through products.
Those are the choices we can make, and the answer that you fairly resoundingly
get back at the moment is that people do not want to do this.
The mainstream response that I usually get giving this talk in a fairly conventional
conference looks something like this [laughter]. In fact, slightly more commonly,
it looks something like this [laughter]. The pram has definitely -- sorry, the rattle
has definitely been thrown out of the pram at this point. Which leads to a lot of
these kind of conversations. And I'll leave it as an exercise to the reader as to
which one of these is the scientist, which one of these is the funder or the
member of the public or indeed the institutional repository operator.
So the question becomes how do we actually persuade? How do we with -- if we
could make a case, and I believe we can make a case so we can do science
more efficiently, we can do research more effectively, we can make stuff more
available, and that would be a good thing. How do we persuade community to
do that?
And to be honest, I'm actually not at all worried about this. And the reason I'm
not worried is because of graphs like this. And this is, yeah, the reflexive lazy
example that everyone gives of the data deluge problem. But it's a fairly apposite
one, and it's got some interesting more recent wrinkles on it.
So these are submissions of DNA sequences to Genbank over the past 20 or so
years. This is essentially exponential. The sharp out amongst you will note that
as it gets towards the last couple of years, it's no longer exponential. The reason
for that is that 99.9 percent of the DNA sequence data generated in the last two
years has not been put into Genbank because it can't cope.
This is a scale problem. And you could draw this graph for protein structures,
you could draw it for astronomical data, you could draw it for chemical reactions,
you could draw it for just about anything. Everything is scaling exponentially.
And we're generating more data from the exponentially greater set of experiment
that is we're doing.
And this creates a problem because of this graph. We're not getting smarter or
faster, the computers may be, but we're not. Which leaves us in a situation like
this, where the average scientist, the person actually doing the research in the
grant is running faster and faster and faster in a fairly futile attempt to just catch
up.
The point is that the human scientist, the person who remains at the center of a
scientific research, if at least for the moment, singularity is not upon us quite yet,
just doesn't scale. The only thing that really scales effectively in a technological
world is the web.
Governments do not scale. Policy generation from the top down does not scale.
See under various current UK government acts, not to mention Australian
government acts. Research groups don't scale either. You take the average
research output of a research group of 50 people, it ain't even 10 times the
average research output of a group of five people. And network theory tells us
there should be an exponent in there, it should be more than linear. Research
groups do not effectively scale. You do not get more research out by just having
bigger research groups. You do not get more research out by simply
concentrating effort in a small number of places.
The web scales by distribution. The web scales by exploiting network effects.
Which means that just to survive, just to be able to keep up, a scientist is going to
have to be web-native. Which means connected. Which means wiring yourself
into a network that provides network effects and doing that effectively and doing
it in a way that creates outputs.
And that means sharing. This is not a new concept. It goes back, perhaps most
eloquently to Merton in the '60s and '70s and '80s. But it comes back to Bacon, it
goes back to the beginning of the royal society. We're sending letters to each
other as a way of describing the latest findings. Wasn't scaling.
So what did they do? They created the journal. The scientific journal,
philosophical transactions of the royal society was the web of the 18th Century.
And it's not done a bad job for the last 300 years. It's just not cut out to deal with
it anymore. You have your effect in science, going back to that slide with the
names on it, you have your effect by letting other people build on your work. If
you don't do that, you're not having an effect.
And we're used to these networks in research, we're used to this concept of the
journal of networks, of papers which may be the only piece of data connected up
at some level. But this work isn't just papers if we're going to be trying to do this
effectively, it's the images, the ideas, the thoughts, the presentations, the lab
notebooks. All of these things build a network that if we can build it effectively
will give us the network effects which will let us do science more effectively.
And you can choose whether or not to make these things available. But if you
choose not to make these things available on the network, them you're not
connected. If you're not connected, you don't exist. It doesn't matter how good
this idea is if no one knows about it. When was the last time any of you actually
cracked open a paper leather bound version of an encyclopedia? And how many
of you haven't done a Web Search in the last 24 hours? Where are people going
for information?
So it's open content that builds this network that will allow us to build it, that will
allow us to make it interoperable and make it effective. The network is the only
way that scientists are going to be able to keep up and to be able to function
effectively in 21st Century science, in 21st Century research more generally.
If we build these tools, that help researchers to manage and build these
networks, then I think the rest of it just follows from pure competitive interactions.
People want to be at the top of a game, they're going to have to do this to be at
the top of the game.
So I need to thank a number of people for contributions to this talk specifically in
terms of images and inspiration for how I've given it. Thank you for coming. And
I'm happy to answer any questions.
[applause].
>> Lee Dirks: We have five minutes for questions, if any.
>> Cameron Neylon: Yes?
>>: How do we filter out the garbage? I mean, not all data is created equal.
>> Cameron Neylon: Certainly not all data is created equal.
>>: Could you repeat the question?
>> Cameron Neylon: Yes. Sorry. So the question was how do we filter out the
garbage. And this is -- this is a general problem. It's not restricted to research by
any means. There's an awful lot of garbage out there.
The bottom line, at least at the moment, is that Web Search tools do a pretty
good job at some level of finding stuff that's well connected, finding stuff that
other people are referring to. And that is why you need to have it -- the actual
objects exposed on the web, people don't link to them in the way that we don't at
the moment for research, those kind of page rank style mechanisms do not work.
That's level one.
If we cite the research objects properly, then Google will do some of the job for
us.
The second level is building better social networks that then start to help us
filtering. So these networks of data of objects and thoughts are equally networks
of people. And so I don't know whether anyone's actually going to mention this
today, but there are -- there are tools available that help people filter other
people's content. The one that I know a number of us are very fond of is a tool
called FriendFeed where I can bring a bunch of content in, that's fine, I'm saying
this might be of interest to people, but what matters is whether other people
interact with that a content, make comments on it, these kind of things. And that
pushes those objects up to the top of the pile.
So there are the beginnings of an idea of how we can effectively socially filter
content as well. And then there are all the questions of do you just end up with
an echo chamber, do you just end up with a self-reinforcement? And that's again
why I put up a number of people on that first slide of people I violently disagree
with. Because they challenge you to rethink the reflexive easy stuff.
So filtering is not an easy problem to solve. But you can't build a filter without
having the stuff there that you want to filter first to test it against. Yes?
>>: Are there search engines that generate the web of results as opposed to a
list?
>> Cameron Neylon: The question was are there search engines that generate a
web of results for other than a list. That's a really interesting question. I've never
had that question asked quite that way before. Not at the top -- not at the top
level in terms of -- I mean a sense also [inaudible] because they provide you with
hyperlinks and those hyperlinks so -- and I have seen some quite clever
visualizations of search results. The major problem with that being that Google
doesn't let you actually have an API under the search results. I'm not sure
whether Bing does or not. But either way, because those are very proprietary
outputs, it's difficult -- this is a classic example of the genre, it's difficult to build
the tool over the data because you're not legally sure what you're allowed to do
with the data.
I'm sure there are people working on those kind of things. I mean, I know there
are people working on search visualization very, very hard. But I couldn't give
you an explicit example of it. Go.
>>: What exactly it does bring to mind is by many experts if you do a coauthor
search. That half that comes up it's JavaScript is really very useful when you're
trying, for instance, to figure out who should review this paper. If you're an editor
you can use that graph to find out if they -- if the suggested review is a published
and you can see clusters of authors and try to pick one from each cluster. Say
you've got a representation of the field in your referees, it's really very useful.
That's the best part I've seen.
>>: What is it again?
>>: [inaudible].
>> Cameron Neylon: Biomed Experts which is -- is it Thompson? I never quite
remember if it's a separate company.
>>: I have no idea.
>> Cameron Neylon: It uses the underlying data of the co-citation network to -- it
basically generates the co-authorship network of scientific authors based on
published literature. And one of the main things it displays is you look for a
person and then it displays a network of co-authorships with that person. So
essentially it's kind of what you were asking about but for a person rather than for
the research objects. Which is kind of the wrong way around. Because if you
take Jeff Jonas's and John Udell's sound byte seriously, Data Finds Data, then
people find people. It's kind of the wrong way around for research but it's a start
in the right direction.
>>: [inaudible].
>> Cameron Neylon: Sometimes you just want people, that's true.
>> Lee Dirks: All right. Thank you very much, Cameron.
[applause].
>> Lee Dirks: All right, everyone, what I'd like to do now is hand the floor to
Jean-Claude Bradley to talk to us about Open Notebook Science.
>> Jean-Claude Bradley: Thank you. So thanks very much for the invitation.
Thanks to Hope and Lisa. You guys did a great job in setting this up. What I'd
like to do is to follow up on what Cameron was discussing in terms of why it is
that we need openness. And I'd like to take a pretty concrete example of that
and show you in chemistry application what kind of openness currently exists and
what's actually possible. Okay? So I'm going to be talking about Open Notebook
Science with free-hosted tools.
And these are the issues that I'd like to make a case for Open Notebook Science.
The concept is very simple. At least in chemistry if you're doing chemistry
experimentally, you have a lab notebook. That lab notebook is typically an
extremely private document, something that nobody else will see, something that
no one will read probably when you leave the lab. And there's a lot of information
in there.
And the question is what if we make that notebook publically available, does that
help? So I'm going to try to make a case that it does on these various levels.
So first of all, if our current system is working very well, then, you know, what's
the motivation for doing this? And I'll show you a few examples of where the
system really isn't working very well in chemistry. Is Open Notebook Science
difficult to implement? I'll show you that there's at least one way of doing it that's,
you know, free, and fairly simple to do.
Does it prevent peer-review publication? No. I'll show you an example, although
it will be qualified, which I'll go through shortly. Can you discover the data? As
Cameron was saying, you know, if you put it out there and people don't find it, it's
not very useful. So there are ways of putting the data out there that people can
find, and even if they don't already know about your project. And that's really
important.
Can the information be usually archived and cited? I'll show you a pretty recent
work of where I think we have a pretty good system for archiving. And citation,
we've been able to cite our lab notebook pages, and that's worked out. And
finally, is ONS compatible with IP protection? So, I'll -- mainly no, but there's a
small exception to that that you might find interesting.
So how bad is our current system? Well, I'm picking an example here as a
chemist that I think most of you can relate to. If you're familiar with the concept
of solubility, how much sugar goes into coffee, you can only put so much. So
there's a number, there's a certain amount that you can put in. And it's such a
simple measurement that you think would be very easy to find, right?
So EGCG is actually the antioxidant in green tee. Okay. So it's a compound of
tremendous interest. There are lots of researchers that are doing things with it.
So if you want it to start to work with this material, you probably want to find its
solubility to see what kind of solutions you could make.
So if you use the -- our current scholarly communications, you go to the
peer-review literature, you go on [inaudible] or use cache to find information,
you'll find this paper that says solubility is 21.7 grams per litre. That's actually an
enormous amount. The number itself really doesn't make sense. And what's
really interesting is that this actually went through peer review. So the people
who reviewed this paper didn't think that was a problem.
But luckily there is a citation, okay. So if you take the first paper and go down to
the citation you'll see that actually it was a misprint. In the original article it was 5
grams per litre, and the solubility of caffeine, 21.7 was accidentally put at the end
of the number 5.
Now, the issue here is, okay, I have the number 5, now where did this come
from? Unfortunately could not find a reference to this number. Okay? So we
keep searching. If we go to Sigma-Aldrich, it's a very popular source of
chemicals, and it also has a very good reputation for having good data. So if you
want to know the density of a compound or if you want to know some kind of
practical property, it's usually pretty good.
So for this particular compound it says that you can make a solution at 5
milligrams per mil, which is 5 grams per litre. It doesn't actually say that's the
solubility of the materials. It doesn't say it's the maximum solubility. It just says
you can make the solution of that number. Okay? So maybe that's where the
number came from and it got misinterpreted.
So we keep doing more search in the peer-reviewed literature and we find
another paper that said that the maximum solubility is 2.3 grams per litre. So this
is actually troubling as a chemist because I have two really strong what are
typically going to be good data sources, I have a peer-reviewed paper, and I
have enough from a company catalog, a company that I trust very much. So how
do we make sense of this?
Well, for the company catalog, you're completely out of luck, because there's no
information about how the number was obtained at all. There's no reference. It's
just a number. And if it's a typo, you have no idea.
Now, we get a little bit further with that last paper because they do have an
experimental section where they describe, you know, how they actually did the
experiment. And this is really -- this is the best that you can do in chemistry right
now in terms of finding out how a researcher actually carried out their
experiment. But this is not the lab notebook, okay? This contains summarized
information, it contains a level of abstraction that, you know, it's a little bit more
complicated what actually happened when they tried to do it.
For example, here they say they sonicated this, but they don't say the power.
And then they diluted and filtered. They don't really say how they filtered it. So I
can see some issues with this why the number might not match the company
catalog. But I don't have enough information to be able to assess which number
is more likely to be correct. So the reality is if I want to know the solubility of this
thing, it's probably easier for me to just do it because the literature isn't helping
as much.
Okay. Here's a second example where we actually will show some notebook
information being produced. The sodium hydride oxidation controversy. How
many of you have heard about this? Some of the few people related to chemistry
here.
This is actually a really interesting story. You don't really need to understand
chemistry to understand how important this is, but this is something that most
chemists would think is impossible to achieve. And it was published in a highly
prestigious journal. And so it did generate a lot of controversy, okay? So this
was a paper that claimed to do something that most chemists would say is
impossible.
Now, the way that I found out about it was through the blogosphere. As
Cameron mentioned FriendFeed, it's an aggregator that I prefer to use. As well
it's very, very efficient. And you see people that basically just start to make
comments about this. And then something really interesting happened. People
actually started to try to reproduce the experiments but they also provided the
raw data so that people could evaluate whether or not what they were saying,
you know, was consistent.
So the totally synthetic blog tried to repeat one of the experiments and got a 15
percent yield. But again, very importantly, they published the NMR on their blog.
So this information is critical for being able to ascertain whether or not the 15
percent is a typo. If you're a chemist, you can go in and you can actually see that
15 percent from the plot. So that's kind of interesting.
Now, the 15 percent is -- was much, much lower than what the researchers
published originally. So there is still a conversion, but why is it so different?
So I was talking to my own students about it, and we were thinking about trying
to repeat this. And so again, this is a, you know, experimental section in a
chemistry journal, peer-reviewed chemistry journal. It does have information, but
it doesn't have lab notebook information.
So these are some pretty general terms. I don't know how they monitored the
reaction. I don't know exactly what they did. I have a rough idea, of course, but I
-- you know, I can't try to reproduce what they did and then see as I'm going
through if everything is matching up.
So the best we can do is just, you know, try to take a shot at this. So my grad
student Khalid and my undergrad Marshall Moritz basically just tried to repeat
this. And we also posted the raw NMR data directly. And in this case, it wasn't a
blog post but it was also on a Wiki as I'll show you shortly. And we found actually
zero percent conversion, okay? So we're getting really wildly divergent results
here.
And again, the blogosphere comes to the rescue. Someone found a paper from
1965 where this effect had actually been previously reported. And it turns out
that it's due to the particular material, sodium hydride can form something at its
surface, can form a layer at its surface that completely changes its chemistry.
And so this is actually really useful information because this reagent is used by a
lot of chemists, and even if you're not trying to repeat this, you want to look out
for possible side reactions. So this is great, and this all got sorted out. But the
final result, if you go back to the journal where this was published, all you find is
that the paper was retracted, and there's no reason given. So all of this
information, all the knowledge that was gained by sharing all the information, it's
still there. All right? And you'll find it very easily, just do a Google search.
Search for sodium hydride oxidation. You'll find our experiment at the top. The
second one is the explanation. And then the third one is the second open
notebook attempt. And so this is really what Cameron was talking about that,
you know, the information can be found, and this is how people are likely to look
for it if they want to learn more about this.
So it's interesting to see the publisher's stand on this. You know, there's a lot of
useful information, but it's not being shared.
So a third example that I think is particularly fascinating is Alexander Graham
Bell's notebook. This is a recent book. Seth Shulman basically was interested in
the notebook, wanted to see when the telephone was invented in Bell's
notebook. So he actually looked at it and didn't find the invention before the
submission of the patent. And, you know, and he ended up writing a book about
the whole sorted affair that it's likely that Bell actually stole the telephone directly
from the patent office but visiting it on that day from Elisha Gray. And the fact
that he did it for love is a very interesting twist to the story. So I would strongly
recommend this. And it basically shows how just providing a lab notebook, which
by the way wasn't available to the public until I think 1990, which is why really
people hadn't looked at it in detail, and now you can find it on the web. I think
since 1999 it was just put on the web for free.
So again, you know, something immense that we all thought we knew about it
isn't quite what it seemed.
Okay. So what I'm talking about is Open Notebook Science. And if you want to
see more examples or you want to see articles that are written about this, if you
go to Wikipedia that's a pretty good place to start. So what I've been talking
about so far is Open Notebook Science in the sense that we make all of our
information available immediately, okay, and that's how I'll be talking about it in
this way. But it turns out that you might not quite want to expose your work in
that way, so we developed these logos, Andy Lang and Shirley Wu to sort of
explicitly express what it is that people can expect when they're looking at your
notebook.
So the top one is what I've been discussing, all content, immediate. And peer,
we were discussing before of the talk that you could have all content but delayed,
either because of a publication you're waiting on or possibly and intellectual
property issue. So there are different ways of doing it. But I think it's important to
be explicit about what you're doing. Because if you're doing the top one, what it
means is that people are looking at your notebook, and if they don't find
something, they can assume that you haven't done it, and then they can go and
possibly do it or make decisions based on that. So it is important to be clear
about what you're doing.
Okay. So this is really sort of a philosophy. You know, the question was brought
up earlier about how can you trust data, data is not all created equally, and I think
that's a big problem is the way in which it is presented, right? We try to tell our
students that there are certain things as facts, but there really aren't. There are
certain things, like the melting point of water or the boiling point of water that has
been measured so many times by so many people that you can use it as a fact,
but the reality is probably most of the measurements in science and surely
chemistry may have only been measured once or twice. And I just showed you
an example of how hard it is to actually evaluate these data points.
Okay. So what we want to do here with Open Notebook Science is maintain the
integrity of the data provenance by making all the assumptions explicit so each
person can evaluate the source of the information as they wish.
So we're moving away from and environment of trust to one of proof. Okay. So
the question is if you see two data points, it shouldn't really matter where it came
from, whether it came from the most prestigious journal or whether you found it
on Google and have no idea who the person is. If you can see the evidence
they're providing, that's all you need to be able to assess that particular data
point.
Okay? So let -- I want to go through a very specific example. So this talk is
about using free and hosted tools to do this kind of stuff. So I'll be showing you
exactly what kind of tools.
Here is a table. Okay, again, I don't want to focus too much on the chemistry but
just to note that as a chemist you see that there's a trend there and there's one
number that's totally out of whack. So if you were to see this in a traditional
paper, there's not much that you could really do about it because is the number a
typo, is the number really something that's deviant? You have no way of drilling
down usually and actually finding out where that comes from.
So in an open notebook what you do -- and we use a Wiki, we use Wiki spaces
which is a very nice free hosted service, we can record the log of what the
students actually did. So we can investigate here at exactly this time point that,
you know, the samples are vortex but the exact amount of time was not
recorded. Remember, I was talking about assumptions being explicit? Often it's
very useful to know what you don't know. And so in this case, you know, the
students didn't -- just didn't measure it. And it turns out that's actually important.
So how are you going to find out? Well, you're going to redo this experiment, but
now you're going to record this.
And that's really how science can, you know, get much better as you keep going
through these iterations.
Okay. So you can make all kinds of things available. You can also make the
rationale, the findings explicit. So you could make a statement but then you can
link to various pieces of the puzzle, various parts of raw data that support your
statement, which doesn't mean that everyone will agree with you, but that's okay
as long as they can debate the raw data.
So there's all kinds of raw data these days. We've used images, we've used the
short videos that we upload on YouTube. This actually is a very convenient way,
take a 15 second YouTube video of an experimental setup and it saves the
student from having to write a long paragraph about how they did things.
But more importantly, the video doesn't hide things that the students would forget
to write like where the thermometer was or exactly what was going on. So this is
actually a very efficient way of doing it.
Now, we make very extensive use of Google's spreadsheets. And so we're
reporting solubilities in this case. We have numbers. But in a Google
spreadsheet these are not just numbers. If you click on these cells you'll actually
see the formula that are used to calculate all these numbers. And oftentimes
when a number doesn't make sense, if you go back to the Google spreadsheet
you'll find that the student made a mistake in the calculation. And that's why it's
important to, you know, not just have the numbers but be able to track how the
numbers were converted.
Google spreadsheets are sort of like a Wiki in the sense that they have different
versions. So you can go down and if you want to see if an error is corrected or if
a student made a change of any kind, you can pretty easily just go back to a
previous version. The Wiki where we actually write our experiments is exactly
the same way. You can hit the revision history and you can see who made the
change and exactly what was the change by comparing two versions.
So as a specific implementation I really like Wiki spaces because when you
compare two versions the new text is in green and the stuff that was deleted is in
red. So I interact with my students a lot this way. As soon as I see them
recording something about the experiment they did, I'll go in and I will make
comments in bold usually and then they can respond to it. So it's a way of
interacting very, very quickly with students that are in the lab.
Because typically experiments are pretty complicated, to make sure you have all
the information, you analyze it correctly. So I really like a Wiki for that.
Now, another pet peeve of mine in traditional publication of chemistry, these
NMR spectra that I've been talking with. These are normally stored as PDFs. So
even in the supplementary material we download them, it's really just an image.
And you can't blow this up, okay, because the information's not there. But it
turns out that when you blow up these peaks there's a lot of useful information in
the impurities, there's a lot of useful information that is just simply totally absent
when you convert it into a PDF.
So we use this open format called JCAMP-DX, and we use this open source
JSpecView that was developed by Robert Lancashire and this enables us on a
web interface, so people come in with a browser they don't have to know that all
of this is running, they just take their mouse, expand peaks and they could
interact with the data in a way that's much more useful.
So we try to leverage as much as possible what's out there. Tony is here. He'll
be talking about ChemSpider. In ChemSpider it's just the way of keeping track of
molecules, manipulating them, searching them. And we can also upload our
data directly to ChemSpider as well.
Okay? So like I said, we can upload spectra. Now, the interesting thing about
this is when you try to upload a spectrum and ChemSpider asks you whether or
not you want it to be open. And if you make it open, there's a lot of interesting
consequences from that. One of the consequences that we didn't foresee at the
time was if it's open, we can use all the spectra to turn into a game, for example.
So Andy Lang, Tony Williams, Robert Lancashire, we all came together and
collaborated on this project. We now have a game that uses every NMR
spectrum uploaded on ChemSpider that's marked as open data automatically
goes into this game.
Okay. So I want to go through more of these items that I was going through.
This is an example of a chemical reaction that we do where we mix components
together, and sometimes we get a precipitate. So what we want to do is to try to
understand if we can predict when this is going to happen, okay? So that's the
connection to the solubility data that I was talking about.
And what we decided to do is to try this interesting approach of using crowd
sourcing for anyone in the world to come if and contribute a solubility
measurement in a non-aqueous solvent. So any solvent that's not water.
We got some funding from Submeta. They funded 10 $500 awards. We got
some chemicals donated by Sigma-Aldrich. Nature contributed some magazine
subscriptions to the winners. And the concept was very simple, just, you know,
submit your measurements, but them in an open notebook, and we -- you know,
we will basically judge them.
We've just completed the first one where all the awards were made. So these
are all students, either graduate or undergrads. And that I think was a very good
experience for them. We have six judges, many whom are actually in the room
here. Bill is here, Tony, Cameron are here. And basically these judges would
interact with the students on the Wiki just as I was mentioning, they would make
a comment and then the students would respond or not. We didn't award the
prize to the student who made the most measurements. We awarded the prizes
to the students who were the most responsible scientists who actually interacted
and responded in a way. And so that's -- I'm very happy with the way that was
designed because it wasn't just number crunching.
Okay. And we had other teachers actually use this in their own labs, which is
kind of an interesting approach in a teaching lab to try to get students to
contribute to science.
Now, I talked to you about searching. So, all right, this is how we put the stuff up
there. That's all great. But how are you going to find it? Well, if you're part of
the project, you know, this is a common way of finding it. This is sort of a table of
contents that has all the experiment numbers and a brief description and who did
it.
So if you already know about it, this is probably not a bad way to do it. But that's
not how most people are going to find the information obviously, right, because
they don't know about it.
So what we do is we have another Google spreadsheet that aggregates all of the
results from all the experiments. So over that year we have I think like 700
different measurements, and they're all on a Google spreadsheet. And the nice
thing about this is there's a nice API that enables you to query the goal
spreadsheet like a real database. And we could do things like this. So people
like Rajarshi Guha can come in and collaborate because the project's open, right,
they can find it, they can collaborate with us and make their work open, and then
we can, you know, have drop-downs, for example, where you're going to search
for vanillin, any solvent. And then it gives you a little table showing you all the
different measurements.
And you can click on these and actually end up, you know, looking at the actual
lab notebook pages. Okay? So how can the scientific process become more
automated? One of the longer term benefits that I can see, and certainly
Cameron has sort of introduced this concept earlier, is that, you know, it's very
important for machines to be able to understand the information just as much as
humans.
And we're kind of at a very interesting time right now where if we make the
information, you know, in some format readable by both humans and machines,
they can start to collaborate with each other. And so one of the things that we're
-- what we're trying to do is to basically have bots interact with our data and make
a useful contribution.
So a quick example of this is we're talking about these NMR spectra. So
normally a chemist would manually read them and make the calculations. But
we've actually have code that automatically goes to the -- from the Google
spreadsheet as a web service and integrates it and then does the calculation and
returns the final value right here. This is the final solubility value.
So this has been very helpful because the students will make mistakes. That's
going to happen. It's just a question of how easy it is to find those mistakes.
And if you have bots to go in and can, you know, take up a lot of the drudgery, a
lot of the things it's very easy to make mistakes, you know, they can contribute,
and the humans can contribute what they do very well. So I see if you go
through this, see how the information is represented, the APIs that we have, we
really want to move towards more and more automation.
Okay. So the last part of this in terms of what we're going to do with the data is
we've started to build models. So if you're looking for a solubility measurement it
might have been measured but if it hasn't been measured we're trying to come
up with ways of predicting the number for you. So if you want to do a reaction
that hasn't been done, make you can take a guess at what a good solvent might
be. So that's something that we're currently working on.
Now, one of the criticisms of this project often is that, you know, we're just
empirically collecting numbers and putting them together. But actually there's
some good science that we've discovered from this as well. Because we're using
this NMR method, we can actually see all of the chemicals that are being
produced if at all during the making of the solution. And what we found by
accident is that for some compounds like this in alcoholic solvents like methanol,
they actually form what's called a hemiacetal very quickly.
And in this case, if you go back to this solubility this compound was actually
reported in 1982, and they did not find actually that there was a chemical reaction
between the solvent and the solute, and you know, maybe the information was in
here, but the amount of details and provided that we can't go back and see that
they missed it or that the method they used, you know, wasn't going to detect it
anyway. So this is designed of interesting. And it turns out this is pretty general
for a whole class of compounds. So solubilities that have been reported in the
past are not really solubilities, they're actually reactions.
Okay. So finding the data. I show you the Google spreadsheets, we store
everything. Again, if you don't know that spreadsheet exists, how are you likely
to find it? We get about 100 queries a day on specific solubility requests. And
motion of them come from a Google search or a Wikipedia search.
On a Wikipedia if you look up the molecules that we've done, you'll see there's
this chem info box. And we have the reported solubility, and then there's a link.
You'll notice that none of the other properties typically have links in Wikipedia.
So what we're trying to do again is to bring that concept of proof as opposed to
trust, okay. So if you want to find out where these came from, click on that link.
It takes you to a table, right, of all the different solvents. Clicking on one of these
shows you the individual measurements.
So different experiments provided different numbers for that same solute and
solvent. You can then drill down again, click on one of these, it takes you to the
lab notebook page and then you click-through to the associated Google
spreadsheet that has all the calculations and everything. So you want to use the
number quickly, that's great. You find there's something wrong with the number,
you can go in and try to see what it's based on.
Okay. So a couple of more issues. How does this affect a publication? Well,
because what you're doing is you're making all of this available in realtime.
Some publishers will consider that a preprint, and if their preprint policy is
nonexistent, then, you know, you won't be able to go to that journal. But it turns
out that there are enough journals out there that are peer reviewed and that, you
know, will accept preprints.
So we've gone through and actually not only used the lab notebook for a
particular project, but we've actually written the paper on a public Wiki. So all of
the drafts are available as well. And the idea here is that at all times that the
world can know exactly what our state of the knowledge of that particular project
is.
Now, this, of course, is a classic preprint. It just happens to be on a Wiki. But it's
just the paper that's already been made public. And again I was talking about
citation in my very first slide. This is a very convenient thing. If you have your
lab notebook that's public, you can actually use a specific page as a citation.
Okay?
So you'll notice here that the melting point, for example, was taken from
experiment 99, whereas the NMR was taken from experiment 203. Normally in a
paper you don't see the that distinction. But it turns out that these are different
batches that were made. And maybe they're not identical. So if you're trying to
find out why is your melting point not corresponding, you would have to look at
the specific batch of where, you know, that compound was made to see if there
might be an issue.
So we published this in Journal of Visualized Experiments, okay? And this one
actually has a video component as well. But that's sort of besides the point.
There's also, you know, the actual text that we wrote on the Wiki.
And another tool that we made use of was Nature Precedings. So while the
paper was under peer review, we submitted it to Nature Precedings, and this has
the advantage that it has a DOI, it has a standard author list, it's archived by
Nature, so it is something that actually I found does tend to get cited. And, you
know, you have nothing to lose if the journal accepts preprints.
Now, when the paper comes out Journal of Visualized Experiments is open
access. So we've actually really grown to value this tremendously because you
retain the copyright and that lets you repurpose that exact same copyright for
whatever purpose you want.
Tony's going to show this a little bit later, but you can take that same paper and
through a ChemSpider journal, it will automatically find the molecules, and when
you hover over them, it will show you the image. Okay?
So if we had published in a non open access journal, we'd have some issues with
this getting permission to actually repurpose it. And there's opportunities that just
keep coming and coming when you retain that copyright. So this same paper
was turned into an application note with [inaudible] because we had borrowed
their robot to actually do the experiment. So again, you have this way of, you
know, redundantly distributing your message.
Okay. So and here's where a little issue comes up, okay, because if you're
repurposing the same content, you're trying to reach a wider audience, it's going
to affect the number of times that each one of those sources is actually looked at.
So while I'm a huge fan of article level metrics, I think you have to be careful how
you interpret your success or failure based on them. Okay? It's just the number
it gives you some very useful information. But, you know, don't put your
self-esteem into the number of your -- any specific article-level metric.
Okay. So there are other approaches to these open notebooks. Cameron has
showed briefly he uses a modified blog engine.
We have Steve Koch is here. His student uses Open Wetware, which is another
Wiki system. Okay?
And so there are other approaches to this if you're interested in learning more
about them, we can actually discuss them with you.
So the final piece of this is what do we do in terms of archiving the material,
having a place to cite it, having -- you know, taking a snapshot. So I think there's
a huge opportunity right now with libraries as to how they managed this kind of
information.
What we've been doing recently is actually coming up with a way of archiving
these so that, you know, people can actually see what the state of knowledge
was.
People discuss often why we don't use the way-back machine. Basically it
doesn't work very well for our project. I don't know if you ever tried to look at
some of your applications on here. But there are actually entries, okay. They
are not taken every month, they are not taken very often. But this is what they
look like so even though Wikispaces is not protected by a password, for some
reason when they try to archive it every page looks like this. So you can't really
rely on these default mechanisms often. So you have to sort of take a proactive
approach, I think.
So with Andrew Lang we've basically gone through, written some code to
address specific kinds of archiving, specific kinds of backups. So we have this
ONSPreserver, for example, that will go through a Google spreadsheet that has
a list of all the items that are high priority that we want to, you know, keep backed
up. So it will make a copy whenever this runs. We use Windows Scheduler, just
goes in once a day and just executes it.
Okay and luckily Google spreadsheets have a very nice option that they can be
downloaded as XL. And the reason that's important is that I was telling you
about the calculations. I also told you about web services that were calling.
When you store them as XL, it retains all the calculations and it actually -- even
though it captures the number from a web service call, it will give you the link of
the web service that it actually used. So again, if you want to track back to see
exactly where the information comes from, this is extremely useful.
So where we are right now in terms of this whole, you know, archiving issue, we
have a service that again runs once a day that simply backs up the actual Google
spreadsheet that summarizes everything. Then we take periodic snapshots
where we actually make copies of all the relevant files and the lab notebook.
And then we can actually put them in citable storage. So we have them available
from lulu.com, so you can actually buy the CD at cost. It's like five dollars. And
that will have the archive for a particular day.
We also published it in a book. And if you're interested in seeing this, it basically
takes the Google spreadsheet and it puts them in a human readable format so
you can browse through the book. And this book corresponds to this data
archive. Okay? So the this is the concept here.
Okay. So basically the way this works just slightly more technical standpoint is
wikispaces has a way of exporting the entire Wiki as HTML. So we start with
that. And then it makes local references to any images or files that were
uploaded on to the Wiki. And then we have all the spectral files, the NMRs.
That's actually a very short manual step. And then Andy has written code that
actually goes through and identifies how many Google spreadsheets are cited
from each page. And then it downloads them and puts them in a file to have a
local reference when you look at the archive.
So if you go into this archive, for example, from February 11th, you'll notice that
this is all local. And when you click on, you know, any of these links, it will just
redirect you locally on that archive. So again, this is concept of the snapshot.
On this particular day what did all the data sources look like? Okay?
So this is the way it looks like on Lulu when you publish these. We've also used
DSpace at Drexel as a zip archive, okay? So the difference is when you
download this, there are certain functions that you can't do like viewing the
interactive spectra. Okay? But otherwise, everything else is basically the same.
So we have made use of these. And we have this data book, which is available.
So this is kind of interesting. They charge a fairly minimal amount to simply, you
know, print out the book and publish it. And then you simply pay for shipping.
And in the end this is what it looks like. Okay? So that's basically it for the
archive.
Just step through for a minute here in terms of what the initial questions that I
asked. Is our current system working? I showed you three examples where, you
know, it isn't working the way that the chemists would like it to.
Is it difficult or expensive to implement ONS? Not necessarily. I showed you
Wikispace is free and hosted, Google spreadsheet's completely free and hosted.
And the code that we've written for all this archiving, of course, is available to
anyone. So it's certainly possible.
Does it prevent peer-reviewed publication? No, but you do have to look at the
publisher to make sure that they allow it.
It is discoverable. Like I said, Google and Wikipedia are the main issues.
Can it be easily archived? This is a very recent thing where I think we're pretty
good on archiving right now. But in terms of intellectual property protection I
think this is very problematic. If you intend on patenting anything, you have to -you know, you have to understand that this is a disclosure every day, every hour
that you make this available. Although Steve Koch actually has very interesting
strategy because you do retain the ability to file for a US patent a year after you
make it public. So even though you don't have international rights, there are still
some ways that you can get some intellectual property coverage. But for the
work that we're doing, that's not an issue. We're just not going to pursue that.
So I'd like to thank you very much for your time.
[applause].
>> Lee Dirks: Are there any questions for Jean-Claude?
>>: On the Google spreadsheets, is there any plan to make the SMILES of chem
structure -- you know, an actual chemical structure so that you could search that
way or ultimately do you have to rely on exporting to XL and use some plugin to
chemically aware searching?
>> Jean-Claude Bradley: Yeah. I mean, the idea is you have the API that -yeah, you can query it in sophisticated ways in terms of only give me the
compounds with this solubility, range, or do the substructure search. So, yeah.
So those do exist but they don't exist as an inherent property of the Google
spreadsheet. That would come through the API. And we have those things
written. So that's not an issue. But yeah, we do use SMILES as -- it's probably
the most convenient way to store the information in a spreadsheet like this.
>>: Well, one thing that's happened and I was doing a search on John Wilbanks
is that he had a column that said the Journal of Visualized Experiments JoVE
started as open access and last year decided arbitrarily on their part, I guess, not
to be open access. So is that a problem or are you predicating everything on
open notebook and then suddenly one of your sources says oh, by the way, it's
not going to be, and can you get access to your own data and extract it and then
post it yourself somewhere else?
>> Jean-Claude Bradley: Well, you can do that. But I mean the other thing is
JoVE has not -- like having that was open access is still open access. So our
paper is still open access.
What they were saying is going forward yeah. I mean and you still can make it
open access, you know, the gold approach. You could pay for it. So that's still
available. Like they haven't, you know, done a bait and switch where it was open
access or we thought it was and then they removed it. It's still -- it still is.
>>: So the whole system you have is really, really impressive and a how well
open science is working. I'm curious if a chemist just like you is not doing any of
this wanted to start doing it, how difficult do you think it would be ->> Jean-Claude Bradley: So the question is how, you know, how can you get
involved or start -- I you know went through very quickly showed you the best of
what we've been doing. You certainly don't have to implement every single thing
that we have. I have taken a chemist actually through this and, you know, it's just
a question of creating the Wiki, showing a little bit how to create the Google
spreadsheets, how to connect them.
There's also a blog component to this where we kind of talk about our progress
and then link to the actual notebook pages because we certainly don't expect
people to read the notebook like a magazine.
But I mean in terms of some of the archiving and, you know, some of the more
sophisticated stuff like that, I don't think you need that initially to get started. I
think you just simply need to have the Wiki created and, you know, learn a little
bit about how it works.
I like Wikipedia spaces too because they have a pretty nice visual editor, so you
don't need to really know any Wikipedia text to sort of get going which I like for
my students.
But, yeah, I think the whole point is there's a lot of people in the open science
community who are very happy to collaborate and help so that, you know, that's
what I would suggest is talk to someone who is doing it and then have them help
you set something up to get started. And that does seem to work well.
>>: [inaudible] Lulu do I get to specify which day's snapshot is ->> Jean-Claude Bradley: Yes. Yeah. This is the third edition. And each edition
has its own entry on Lulu.
>>: Okay. So editions aren't every day there could be a new edition, depending
on [inaudible] spreadsheet?
>> Jean-Claude Bradley: Probably every couple of months we want to put out a
new edition. As we implement new things, like the archive, for example, wasn't
part of edition two. So now that we do, we have a preface where we explain the
new features, then we just happen to take a new snapshot. But, yeah, that's
actually a good question is how often we do the archiving.
I think, you know, every couple weeks or months makes sense at this point.
[applause].
>> Lee Dirks: Were all scientists all this thoughtful and diligent in our approach
and some amazing best practices.
All right. Well, let's take a break. Let's start again at 11:30 a.m. sharp. So gives
us about 18 minutes. Grab some coffee and chat with your colleagues. Thank
you.
Download