>> Lee Dirks: Thank you very much for coming... and try and stick on schedule. And I'd like...

advertisement
>> Lee Dirks: Thank you very much for coming back on time. We will go ahead
and try and stick on schedule. And I'd like to turn it over now to Tony Williams to
come talk to us about ChemSpider and other things going on at Royal Society of
Chemistry.
>> Antony Williams: Is this active? So first of all, thank you for the invitation. I
love it in this part of the world. I actually got engaged out here a few years ago,
and life has been great since then. I'm going to tell you a little bit about a
pragmatic vision. So the people who know me would say that I just get down to
it, I don't let anything stand in the way, and I'll do silly things that are against most
recommendations. This is an example of that. This is actually a hobby that
started off in a basement.
It started off with a vision of building a structure centric community for chemists.
So I was watching what was going on with PubChem and things like E-molecules
and databases online of chemical structures. And I was working in a company
where we dealt with prediction algorithms for structures, we draw them, we
databased them.
I was working a lot of hours, and I decided to do something that was creative and
fun. Hard work is not always creative. That's one of the challenges of having a
career. So started off with a hobby project in December of 2006, and it was to
integrate chemical data on the web. I liked what PubChem was doing. They
were very focused on dealing with assay data. I started having a concept about
how could you do something Wikipedialike around chemical structures
specifically. So we integrated chemical structure on the web. That was the
vision. Make it a hub to information. Because the web is just covered in
information about chemical structures, but there's no easy way into it and out to
all of the data.
We wanted to provide access to structure based algorithms, so because I was in
the world of chemical informatics, I knew a lot of people who were building
algorithms. I might be willing to share them.
We want to liken chemists to contribute their own data, because there's a lot of
data that get generated that sit in notebooks sits in pads, they never find their
way anywhere, so let them put it somewhere. Wikipedia's not the place for
everything that's getting run because it's an encyclopedia.
Google spreadsheets could work, for example. It certainly ends up in Excel
spreadsheets Word documents, but then it gets limited, it doesn't get shared. So
we wanted to share it. And we wanted this ability to allow the community to
create and correct data. Because of course, everything they write about Tom
Cruise on the web is true, so therefore everything about chemical structures on
the web must be true. There is so much stuff out there it's scary how bad it is.
Chemistry online today? Where is it? I'm not going to read this list out. It's all
there. It's encyclopedias and property databases and drug discovery data and
publications, chemical structure information is all over the place.
The searches on the Internet today primarily limited to text based searching.
That's probably what you would know about it if you were trying to search for
information about a particular compound.
The data are certainly dirty. I'll show you some examples. It's very difficult to
figure out what's good and what's bad. You just don't know. Who can you trust?
Who are the authorities out on the Internet? And too many searches are
required to resource data you do a search on any of the search engines that's a
texted based search for chemical compound and you can get anywhere from a
few 10s of hits to millions. And now how do you figure out what's behind it?
So humans do not want to deal with lots of interfaces. They actually want to deal
with something that's rather simple to use. That's why they mostly go to a single
search engine today. They want as few interfaces as possible, but to get as
much integrated information as they can.
What would the future look like? And actual how far away is the future?
Because it's coming very, very quickly. The Semantic Web for chemistry is in
place. That's our future. Whatever this thing, this Semantic Web is going to end
up to be, it's going to be there.
Crowd sourced contributions. Well, crowd sourcing is here. When you read
Amazon, book reviews, movie reviews, I mean that's the crowd sourcing right
there. Wikipedia, that's crowd sourcing.
Chemists will be searching the Internet based on structure and substructure, not
just section. There's no reason they shouldn't work in their own language.
Chemistry articles are indexed and searchable, both text based as well as
structure and substructure. Why not? It just makes sense for a chemist.
Again, reduced number of searches. The data are integrated. So if you want to
find all publications about a chemical, that's easy, and all the patents and where
to buy the chemical and why not all the analytical data and all the melting points,
boiling points and properties that Jean-Claude and his group are putting up?
Why not. Makes sense.
And then we're going to be in this world of open access and open data. Peter's
almost certainly going to talk to you about open data. He's a better person to do
that than me. This is coming. In fact, in many ways it's already here. We're just
not naming it open data. Classical business models, they will have to change.
So we had a vision, and we went off to get it done. In March 2007, we went to an
international chemical society meeting in Chicago and we released this
searchable structure database online. We bought one computer, and we built
two with our hands from a company called Tiger Direct buying a mother board
and hard drive and chips and throwing them together. We plugged it into the
cable Internet that serves TV signals and voiceover IP and started letting people
do searches.
We seeded it with 10 and a half million structures from the so called public
domain. We sourced them from PubChem. We released it with structure and
substructure searching. And it went live.
June 2007, a few months later, we realized that the data that we had released
out to the Internet was very dirty. Just in our lookups we were finding a lot of
errors. So we had to figure out how to make that claim. Well, 10 and a half
million structures with maybe 20 to 30 to 100 names, lots of properties
associated with them, leaves you open to a whole lot of data issues.
So we needed the crowd to help us clean it. So we put on a curation layer. We
said okay, 10 and a half million is a good start. And we will aggregate more data.
But we'd like you, the scientists out there in the world who are using our system
to add your data. You might not want to publish it, publish it in a classic cal
peer-reviewed sentence, but you might want to share information, so why not?
So we added on a deposition interface. And so it continued.
This is the world of search engines today. This is what people are used to. Most
people are working with a single box that says enter something to search. And
the world of chemistry you can have chemical names, as aspirin, erythromycin,
Vancomycin. Those are very common types of searches.
But then we have this thing called SMILES string, which is a way to represent a
chemical structure. So we let you paste a SMILE string. We also have things
called registry numbers, and those exist in recommending industry systems such
as the chemical abstract servers.
And we also have InChis how many people in the room have not heard of InChi?
Okay. I will tell you a little bit about InChi shortly. Let me swap over here to -- if I
can figure this out. I'm trying to figure out how to get to the live Internet. Oh,
there we go.
So this is ChemSpider. ChemSpider.com. Free to everybody to access, no
charges to use it. If you search cholesterol, it's about .02 of a second to search
it. That's running now out of Cambridge because ChemSpider actually worked
and the Royal Society of Chemistry came and acquired it. There's a chemical
structure, there's some properties, there's a systematic name. This is a SMILES
string. That represents that molecule. These are InChis. I will tell you more
about what they -- they're about shortly. But they are text strings. Often they are
text strings that represent this molecule.
There's a link to a Wikipedia article, but it doesn't make sense for us to pull the
entire Wikipedia article and database it, because it's changing all the time.
People are rewriting these things and adding informing. So we let you read it.
Here's a list of patents. If there's a new patent available, then it's going to find it
on this search. These are linked out directly to patents. Down here we have list
of articles. These are articles on PubChem. PubChem is publishing multiple
articles a day. This is -- this is a web service on PubChem. So if there's a new
article out, it will show up. We don't cache this information for more than a few
hours. So new article on cholesterol will show up here.
And this is only a subset of them, by the way. You can always click-through to
see the if you will list.
Here's a list of supplemental information where there's additional information
about it. These are articles that vice president found their way to PubChem
because not everything goes to PubChem. If it's not about medicine, then it
doesn't necessarily belong there.
There's some properties with direct links. You can see down here we have the
actual link out to the original data source who said that is the particular value.
And you can click-through, and you'll find your way out into the original data
source. And notice we've got melting points with units and without units. That's
because of what the data source provides.
Here's a whole set of chemical names, and there are a lot more. These are all
names for cholesterol. These are validated names, the bolded ones. These are
not validated. There's no highlights on, they're not bolded. These are different
types of database IDs. Here's some database -- there's a long list of database
IDs if I had enabled it. And it's just too much for this particular page.
So we've got all of this information showing up on ChemSpider. .02 second it's
got 23 and a half million molecules today. And this linked out to 300 data
sources. Let's go back here. So one of the other things we have on here are live
data, so special data that get submitted. We don't run any special data. We
can't afford a half million dollar spectrometer. So we'll let anybody who is running
the data submit it to the system.
It's fully interactive, similar to what Jean-Claude was showing. This is an open
source applet where we allow people to zoom in and download the data if it's
open data.
The data sources where we aggregate the data from, where possible we link out
to them directly. So we're building, if you like, a link file. And this case, this little
icon here says it's a chemical vendor. If you want to go buy the material, you can
go buy it. If you hover over the hyperlink that goes out, then it will take your
snapshot of whatsoever on that page. In this case, you get a material safety data
sheet in Chinese, English, and Korean. For cholesterol. Some of these MSDS
sheet, I've seen up to 16 languages represented.
It's not sitting on our server. It's a link out to somebody else's server. And we
have 300 data sources that we're linking out to now. This is the Environmental
Protection Agency, EPA. The SS tox. A metabolome database, Food and Drug
Administration are aggregating all of the data.
When we do that, however, we've been hitting some interesting issues I'm going
to tell you about. This is an example of how extensive this data can get. This is
KEG, Kyoto Encyclopedia of Genes and Genomes. These are massive
metabolic pathways, I mean, just massive. You're scrolling around the page
finding your way around.
In this case, we clicked in through the structure of aspirin, when is just up here on
the left hand side. Again, we can't host all this data because it's changing on
everybody's website. Patents and structures. This is actually the new way we're
going to be showing it in a couple weeks. We can see US Patent Office, EPAB,
European, Japanese patents are all there. The new way we're going to be
showing articles.
I showed you the simple way to search a name, a chemical name or a SMILE
string. The reality, chemists don't want to work that way only. That was magic.
Because if you have a chemical that is named in 300 ways, how would you do
that text search? You'd have to know all 300 names to do it. So you have to
disambiguate through the structure in most cases. So we allow you to do
structure searching, we allow you to do substructure searching. We allow you to
search based on properties, the presence of certain elements, the absence of
certain elements. We allow you to search a molecular weight range and on and
on and on. Incredible layers of complexity you can search by that people are
using, but most of us are used to Google and Bing type searches, so most
chemists come in and search base their name.
Once you come to a ChemSpider record, you're significant on a structure that is
linked out to many things. I've shown you some of them, publications, analytical
data, related reactions, Wikipedia, patents. The question is where should it stop?
How big can this go when you have a structure centric hub and essentially it's all
about what you want to plug into this. And it can be blogs and Wikipedia -- Wiki
links, et cetera.
These are typical questions that a chemist would ask in their language, standard
English. What is phenolphthalein? What's the structure? What does it do?
What's the side effects of it, what's the toxicity of it? All of those questions there
can be answered today by ChemSpider. And it's getting better and better and
better in terms of what we integrate.
What's the compound? The top, a graphical depiction there's a chemical
structure. The one below it is a CAS number. The one below that Oximonam is
a trivial name. The one below that is a systematic name. They're all the same.
They're just -- they're just different because there are different ways that you
have to label things. If I say Oximonam, I'm probably a pharmacist. If I showed
you the picture of the molecule then you wouldn't know what to do then unless
you're a chemist. So that's why we have to work in these different interchanges.
So because we're a structure-centric hub and we're linking out to the Internet, I'm
going to bring together patents and publications and all of the data that's out
there that we can link. We're now making structure based searchable Internet.
The question is how -- how good can it be.
This is the type of linked data on the web that is showing up now with different
services. So DBPedia is out there, CEOBI, which is a Chemical Entities of
Biological Interest from the European Bioinformatics Institute. We've got
PubChem, we've got PubMed. All of these things are linking together with
specific links such as names. But this InChi has shown up. I'm going to show
you a little more about that.
Aspirin, I'm sure most of you have heard of aspirin Taxol. Anybody here not
heard of Taxol? Oh, interesting. So it's a Bristol Meyers drug. It's rather difficult.
It was isolated from the Pacific Yew Tree bark. It's a very powerful drug. It's a
natural product. It's very complex. Where would you go find what the chemical
structure of that is? If you went to Wikipedia to look for aspirin, it's very small, it's
easy, it's correct. The original structure of Taxol in Wikipedia was incorrect.
This one down on the left hand size, you probably know that one -- nothing
personal.
>>: How would I know that?
>> Antony Williams: So a little blue pill from Pfizer. You got to question
everything online. Why are you galling? [laughter] so DHMO.org. So this is a
nice little website. Dihydrogen monoxide. You die if you ingest too much of it.
Long periods of emersion and you will also die because that's called drowning
because indeed dihydrogen monoxide, two hydrogens, one oxygen, is water. It's
a hoax. It's a wonderfully well-done hoax. I suggest you go read it. It's so well
done actually that if you read the Wikipedia article, you hear about the politicians
that tried to get DHMO banned from industrial processes. True stories.
Chemistry on the Internet is messy. You can imagine what might happen there
that might get rather messy. You know, it's probably methane. [laughter]. For
every action there's an equal opposite reaction. He should be flying this way,
right? [laughter].
So what's methane? Wikipedia. Simple organize molecule, one carbon, four
protons, that is correct. Wikipedia article has been validated by numerous
people.
If you go to PubChem, which is a government database and housed by the
International Institute of health, that's the correct structure of methane. It's
actually labeled as charcoal. Now, if you throw methane on a barbecue in the
summer, rather than charcoal you'll get the type of cooking effect you were
seeing with the cow, but not quite what you would want. If you look at the long
list of names that are associated with methane on PubChem, you will see
diamond. That's not a particularly good one because diamond is not exactly a
gas, and you could manage handing over a gold ring with [inaudible] it's not the
right thing to do really. [laughter].
And graphite, also not methane and also not diamond. And bucky bull is also
listed. This is a database of chemistry. It was from PubChem. It's an excellent
platform. PubChem is a wonderful platform for data. However, they're not -they're not responsible for curating it. As a result, data has been showing up on
PubChem for the past few years, and it's kind of public domain, so people have
been taking it and putting it in their own databases. And now you have this
proliferation of errors all over the place. It's actually quite shocking.
Is that the right structure of Vancomycin? Only some of you are chemists. But
even those of you who are chemists I would not imagine that you could check
every stereochemical very easily. That's a rather complex molecule. I'm sure a
number of you have had Vancomycin at some point. That is the correct
structure.
If you search PubChem then you end up with I think three or four pages of
molecules called Vancomycin. People are taking these data into models, they're
building models of prediction, they're using it to resource information from which
one's Vancomycin, which one should you use.
Actually, the structure of Vancomycin is primarily an assertion that comes from
analytical data, it comes from who says it's what. We just released a publication
that shows how many articles are published with incorrect chemical structures
and it's absolutely shocking. Good science to the best of their abilities, but still a
lot of errors. So we've cleaned up a few hundred of late.
We had inherited all of those errors about Vancomycin on to ChemSpider, plus
many others from other sources. So we actually had to go clean it up. You do a
search on ChemSpider today there's one Vancomycin. It took three days and
multiple e-mail exchanges scientists at the EBI to figure it out. Now you go to our
article, and it will tell you why we say this is Vancomycin. Direct links to original
publications. One would assume the expert would get this stuff right.
This is a web page from about harmful algal blooms. That's domoic acid which
kills people. You get shellfish poisoning. You would assume that's correct. In
fact, it's incorrect. Every stereocenter on that molecule from the experts is
wrong.
The bottom right hand side with the red arrow is domoic acid on Wikipedia, also
wrong. Top right hand side is the structure of domoic acid from the American
Chemical Society's C and E news article. Also wrong. Do you see a simulator
between the C and E news picture and the Wikipedia picture? That's because C
and E news taking data from Wikipedia directly. Wikipedia being used as an
encyclopedia and an authority.
One would hope in the future that you could trust all of that data. I believe you
will. You will be able to do that. Domoic acid's cleaned up now. We've been
working on curating every chemical structure that is sitting in Wikipedia. We
checked every stereo bond, every connectivity, we've cleaned up lots and lots of
errors. This is the correct structure of domoic acid on ChemSpider.
The InChi is a way of representing the structure, as I said, in alphanumeric text.
So here you see a couple of examples. It has formula in it, it has isotope details,
it has stereo layers, that whole string there can represent that full molecule. It's a
very, very good way to encourage structures to link together on the web.
The problem is search engines will truncate very, very long strings. So if it's a
very, very big complex molecule, you try to search it, it will just drop the end off.
So now you're stuck. So they had to come up with a way to make that a little
more able to happen on search engines. So they built a hash. But that goes in.
An SHA 256 hatch you take the molecule, you create the InChi string at the top,
you convert it to a hatch and now you have a fixed format for that molecule.
There's two issues with that. One is you cannot go from the hash directly back to
the molecule. And you also cannot go from the hash back to the string. You can
only do it by doing a lockup. There's no way to reverse that hash. So what do
you do?
This is Taxol. All the way back to that Pacific Yew Tree bark natural product. As
you can see, rather complex. Below it is the string. Again, rather long. And
below it is the hash.
If you search across the databases on the Internet to Taxol, you will find different
hashes. There are three different hashes for Taxol. Two of them are the same
structure. One of them is different. And yet the tree hash is different. Why?
Because you can have different settings when you generate the string. So that
was a problem. So what they came up with was a way to create a standard
InChi. It's a standard set of options that will always produce the same outcome
for any of the databases, as long as the input is the same, the molecules have to
be the same. Taxol was different by only one stereocenter. That's one position
in the molecule. Does one stereocenter matter? Here we have two molecules
differing in one stereo center only. Anyone know what that molecule is?
>>: Thalidomide.
>> Antony Williams: Thalidomide, yes. One stereocenter does matter. That's
what one stereo center does.
So who says what Taxol is? That's a challenge. It's assertions. If you look
across most of the publications that are out there, many of them have got that
structure drawn incorrectly. Timelines change, so molecule published can be
revisited a year later. And the structure has changed.
The public data is full of these errors and yet chemists would love to have a
resource that they can trust. And the quality source today is the Chemical
Abstract Service. But it's not easy to access. It's expensive to access. It's -- you
have to pay for the license fees.
This is Vancomycin. The correct structure of Vancomycin. Wouldn't it be nice to
be able to find all recommendations of Vancomycin on the Internet? Well, now
we go to this standard InChiKey. If there are databases being built, if there are
patents being issued, if there are publications that are being written where the
structures that are contained within them have standard InChiKeys associated
with them, you should be able to go search. So what we've done is we've put
this directly on ChemSpider and said if you find Vancomycin, you want to find
across the Internet what goes on, you can click on the first path, this piece right
here is the skeleton of the molecule. If you include this path, you click on this
path, then it includes all the stereo chemistry. With all of that complex stereo
chemistry I would always suggest you search in the skeleton, because people
mess up stereo chemistry rather easily. So what do we get? If you search the
full molecule for Vancomycin, so click on the second part of the string, you find
four hits only. Two of them are on ChemSpider, one of them's on PubChem. But
don't forget this three or four pages of them. So it finds one structure out of
many. And then it finds something else in the chemical register.
Vancomycin is a very, very common compound. So I would expect more than
four. So if I search the skeleton, I find 104. Find 100 more. And they're all
called Vancomycin. These are all significant on public compound database. The
top one is highly curated. And it's different.
So the Internet is a mess. Somebody has to take the responsibility to try and
connect it up and clean it up and feed all of the information back. By the way,
when we've made changes to the database when we found errors, we
communicate them back to the original sources and literally 95 percent of the
time they don't make any changes at all. The screen shots I showed of
PubChem with charcoal and methane has been there for three years. I've given
the public presentation 30 or 40 times, and they don't change it. There has to be
some changes I think in there.
This is something called the InChi Resolver because a hash needs to get back to
a structure you can only do it through a lock up. So publishers are starting to
layer InChiKeys on to publications, however, you can't convert the InChiKey to
the structure. So we've had to build a resolver so you can search an InChiKey
and find out what the molecule is. It's a public resource.
These are people we're depending on to grow the resource, to link in more
information, scientists, students, and retired people. We've got retired curators
running on ChemSpider today. From all over the world. This is a curation
screen. It shows you some of the edits that are being made and suggested.
Anybody, anybody in this room can come to ChemSpider right now, suggest an
error, and click on comments and tell us what you think is wrong. And we have a
gentleman who is retired NMR spectroscopist in Germany, Heinz Cushone
[phonetic], second one down. The third one down is somebody in China. I've
never met these people. This is just examples of people who are contributing to
clean it up.
Multi-level curation. So I showed you what methane looked like on PubChem.
Here's a whole list of names removed from that list that came from PubChem.
It's still on the database but they've been scratched out.
Citizens can become data sources. This gentlemen, he's one of my colleagues
from the Royal Society of Chemistry, but he's billing his own data source on
ChemSpider. So he's a little subset of 2 three and a half million compounds.
He's got 72 of his own molecules. We were just having a discussion about you
can have a vanity site on ChemSpider. Myname.chemspider.com.
It's a multimedia resource, so we host videos and MP3s. This is Theodore Gray
blowing up titanium, making titanium. This is University of Nottingham professor
talking about titanium.
When we build rich resources of structures with dictionaries of names now, all
these trivial names, synonyms, systematic names, registry numbers, then what
you have is the ability to use it for semantic markup.
Peter, will you talk about Oscar as all? Yes. Okay.
>>: Very [inaudible].
>> Antony Williams: Okay. Peter's been working on a project called Oscar for a
number of years through Royal Society of Chemistry use it as the basis of their
semantic markup. It means finding chemical names inside text, and there could
be multiple other [inaudible] doesn't have to just be chemicals. Finding them,
labeling them, and linking them out. In this case, project prospect gives you the
ability to see the chemical structure drill at an article. It makes the data very
discoverable.
This is an example of marking up a Wikipedia article again using an
entity-extraction system we can see some names highlighted. However, it
misses a whole set of them, bosentan, fosphenytoin, diltiazem, erythromycin. I
mean, these are pretty common drugs, but it misses them because the dictionary
is incomplete. So you have to depend on good dictionaries.
We built something called ChemMantis. You have a spider, ChemSpider, so
ChemMantis just made sense. By the way, ChemMantis is markup and
nomenclature transformation integrated system. We tried chem scrabble, but we
couldn't come up with it to mean anything. [laughter].
So in a couple of seconds you can go in and cross an entire chemistry article and
you can find all the chemical names and link them out to ChemSpider, which
takes you out now into the world of Google and Bing and publications and
patents and chemical vendors. From an article today you could figure out where
to buy the chemical, all the patents about the chemical, all PubMed articles about
that chemical. It's all linked up now.
Doesn't have to just be chemicals. It could be species. In this case, we'll link it
out to Wikipedia, articles about species and it would just as easily be hardware
vendors and software vendors. They're just dictionaries.
So what would you want to link it to? We go back to that list of things that we
were doing on ChemSpider. Once you got the link off of your publication and into
ChemSpider, you just made the entire Internet linked by structure.
What we're trying to do is help people get away from having to draw structures
themselves. Nobody should draw cholesterol again. If we've got it right, let them
reuse it. So those of you who know how to embed videos from YouTube, just
take a little piece of JavaScript, go to ChemSpider, find the molecule of interest
and copy the JavaScript code into your blogs and your wikis. JC's students do
this all the time. They never draw the molecules anymore, as long as we have
them on ChemSpider. So that's embed code.
By using the embed capabilities and the web services we built around spectra,
now they can play games. And the students are playing games looking at
spectral data and they find errors, they curate our data for us. We now provide a
game to clean up our data. Tricky. It's great.
So you come along, you choose which molecule fits that spectrum, it takes you to
the next one. 10 spectra later we make it three molecules and then four and then
five. Make it more and more complex and the students that are playing the game
win awards.
Computers don't want JavaScript, they want web services to integrate things
together. So we provided them. We're linking out in many ways now. So
Notebook Science, Open Notebook Science. They're using those structured
drawing packages, they're using those software offerings from billion dollar
organizations like Thermo, Waters, Agilent, Bruker. They're plugging those into
their systems. IPhone apps are linked up to this. What we don't deal with yet,
materials. Materials are tough. You can't draw a connection table. You can't
draw a molecule very easily.
Minerals are tough, polymers. And we don't intend to manage proteins. That's
done well enough by other organizations. We're just going to talk about open
data, likely open source. We're going to talk about open access. So
ChemSpider's not open source. I'm going to thank Microsoft for being very kind
to us and giving us MSDM licenses. It all runs on SQL server. And why we've
had people suggest we run it on to my SQL, we could do that, but we can't
deliver things as quickly as we need to by moving to my SQL. We're on a
Microsoft platform.
We use open source components. There's some great open source components
out there. It's not an open access database. Because open access in most
cases a publishing term. It's free. It's free to use. You can take data, you can
use web services. It's not quote/unquote open access. We don't assume
copyright when you give us data. It's your data. We're not taking it from you.
And then this question is open data. Open data has been an interesting term for
a number of years now. Panton Principles we've already heard mentioned by
Cameron. Peter's going to talk about them again.
Who declares data as open? Everything that sits on ChemSpider cannot, by
default, be open. It can't. And that's because we have organizations giving
those algorithms, and if we gave all of their data away, we would harm their
business model. We have a pragmatic position. We're going to serve as a
community resource and provide value. We're not going to -- we cannot make
everything free, because we're not allowed to. So it's free but not open.
So it is today. 23 million compounds, 300 data sources, 7,000 users a day, half a
million transactions. While I've been sitting here twittering in the back, I've also
been flowing data and put 80,000 molecules in this morning I collected in San
Francisco yesterday.
Gross daily. We're providing a platform that other people can use for their own
needs. We have to keep cleaning the data out there are filthy today. We've got
millions of data left to -- structures left to deposit, six million.
We're now integrating RSC content. A publication gets published by the RSC by
structures, the data goes into ChemSpider at the same time. So that's going to
flow out there together. We'd encourage all publishers to participant if they want
to.
The Semantic Web for chemistry we are trying our utmost to provide one of the
pillars to use.
Long list of people I could acknowledge. I'm out of time already. And
SyntheticPages, for those of you who care about chemical reactions, we're about
to release a public database of chemical reactions for others to contribute to.
And this is my content information. And the slides are already up. I hit upload
one minute before I stood up to talk. So they're there if you need them. Thank
you.
[applause].
>> Lee Dirks: Perhaps just two questions so we can stay on schedule, if there's
any questions.
>>: So all this [inaudible] by the Royal Society of Chemistry, is it?
>> Antony Williams: Well, originally it was run out of the basement, and it was
self funded as a hobby. Now it's actually owned by the Royal Society of
Chemistry.
>>: [inaudible].
>> Antony Williams: Oh, they've been around a long, long time, yes.
>>: That caused quite [inaudible].
>> Antony Williams: My best estimate of what it took for us to build it is about
$25,000. And lots of sweat and tears. To sustain it, well, it's scaling. It's
growing bigger. We have an IT team now that second to none really. There's
three of us that are full-time employees. There are different ways that we can
look at generating revenue from this, but it will always remain free. We can do
advertising and we can license web services.
But the RSC is a charity, so they have a publishing arm and they have a
charitable arm, so in many ways this is give away back to the community
because they're a society that is a charity. So yes. Fully sustainable really.
>>: When you link out to a data source or to wherever you win from a structure,
do you have a way of coping with broken light? Because a lot of the times
they're going to break on you.
>> Antony Williams: Yeah, link decay?
>>: Yes, link decay. Yes. So Bill's question is when we link out to a particular -from a data source out to a particular link and if that breaks what do we do about
that? So we're building systems so that we can actually go through and monitor
full link decay. But you'll probably have to check things three or four times
because sites can go off, you know, a day or a week type of thing. Right now we
don't have that fully under control at all.
It's okay for publications because we use DOIs, so we depend on cross ref to do
that. Wikipedia is unlikely to change its domain name very easily. But
Jean-Claude Bradley, for example, I mean, tomorrow he might choose to stop
doing Open Notebook science. It's very unlikely. We just heard the guy talk,
right?
But he's been kind enough to put up his entire archive on Lulu. So we just
bought a disk for five bucks and we'll set it up on our servers and all his links will
be safe in our world. But chemical vendors come and go and things like that. So
some of those links are going to decay. Which point we'll just disable them
really. It is a tricky thing to do.
>> Lee Dirks: Very good.
>> Antony Williams: Thank you.
>> Lee Dirks: Thank you very much.
[applause].
>> Lee Dirks: And we'll let Peter get set up here. I don't know about you guys,
basement of my house is full of boxes. I think I might have a couple of bottles of
wine down there. This guy's changing the face of chemistry. It's unbelievable.
It's a pretty amazing hobby.
I would like now to hand it over to Peter Murray-Rust, who is one of the -- one of
the signers of the Paton Principles and one of 3 I think that we have here today,
and to give us a presentation on I think a broad variety of topics of the work that
he's doing in and around this field. Over to you, Peter.
>> Peter Murray-Rust: Right. Well, I'm delighted to be here, and I'm also
delighted that this is being recorded because I don't use Power Point, and I don't
know what I'm going to say, and I need to know what I've said after I've said it.
So this I think is a very important meeting that coincides with a whole lot of things
which are coming together in terms of a release of openness. I'm also delighted
to see lots of people in the audience who I've known remotely people like Bill
Hooker and Heather Pivalol [phonetic] and so on, which is great. So you meet
up there.
I'm going to talk about open data. I'm going to go through quite quickly because
I've got three things to announce today quite apart from anything else. I'm going
to say something about the Panton Principles because Cameron didn't show the
pub. I'm going to talk about is it open from the Open Knowledge Foundation, and
I'm going to give a sneak previous of Chem4Word.
So lots of things that I might talk about, and I will come back to these later and
see if any of them I've missed. Linked Open Data is another word for the
Semantic Web, another approach. And what is key here is both open and linked.
And if one is going to have the machines running over the web, there must be
zero friction. And in my view, the biggest amount of friction at the moment on the
web is whether you are allowed to use that resource at the end without having
lawyers send you some sort of letter.
So what I believe is at the moment we can only do Linked Open Data if all the
data are absolutely certified completely open. And I'll say what I mean by that.
It's actually very easy. It's as if it has got an open data button from the open
knowledge foundation. So that is my full definition of open data.
Rufus Pollock, Jordan Hatcher and John Wilbanks -- is John here? He is. Right.
Have spent two years talking about this. They have solved the problem for me.
They've gone into huge amounts of detail about this. I just accept they've got it
right. So I just go ahead and say this data is open. There isn't a difference
between open access and open data. You cannot take open access ideas and
relate it to data. You cannot take open source and relate it to open data.
And almost open, freely accessible are very valuable, but they are not good
enough for open data. Now, I want to talk about software as an agent of change.
We've seen how we can get things out with crowd sourcing, with communities,
with all sorts of ways of doing things. Software is also a major way that one can
push ideas. Because if everybody uses a piece of software and that software
has gotten it embedded a political philosophy or a social philosophy, then that will
get out to zillions of people.
I also want to say something about web democracy. Now, you've probably seen
that the UK has torn its insides out over MP scandal and things like that. We do
this very well in the UK. We agonize but that agonizing is a process of
democracy which is being fueled by web tools. And I want to say something
about what my society has done here.
I also want to say something about a bottom up approach. I've been one of the
founder members of the Blue Obelisk and this is a community which creates
software data and other resources with no membership, no constitution, no
nothing. All that happens is it just meets from time to time and occasionally
people get a Blue Obelisk.
I want also to say something about text and data mining. I think understanding
human natural language is going to be the next great thing in information. At the
moment, Google, all the tools you've heard about at the moment can only
recognize things if they understand single words or stock phrases or if people
have worked very hard to program it into a template. I think that when we start
understanding what people communicate in normal language it will be a big
break through in our use of information.
And I want to say something about the fact that if you get the right system, it near
zero cost to build it. Now, Tony's talked about ChemSpider. Mere zero cost. I'm
going to talk about CrystalEye which runs at essentially zero cost at the moment.
You can build very, very cost effective tools in certain circumstances.
And finally I want to -- how many people here are from the library world? Yes.
Yes, I thought so. Right. Right? Okay. Well, I'm going to say something.
Libraries are not doing enough to make data open. Right? They are simply not
putting their heart into saying this data must be out there. I went to a meeting
last year on electronic thesis and dissertations and I said can I have your theses
and can I data mine those and I'm going to show you what data mining can do
and they said things like you've got to write to every author and you've got to
send it in on this form and all this sort of stuff.
That is not Web Talk 2.0, it's Web 0. And so you have got to find out how to get
that data out there now when it's published. You know, there are no
qualifications, no nothing. Get that stuff out because theses are the biggest
resource that we are missing at the moment in science.
Okay. A few people to thank. My own colleagues in Cambridge, I'll just leave it
up there. I hoped to blog all this before I started, but I haven't been blogging for
a little while. I will resume. But these are the people who have done wonderful
things there. You've heard about Oscar. You've heard about understanding
chemical names. There's a lot of chemistry here. And I make no apology
because actually chemistry is the best subject to do the semantic scientific web
on. And then a whole lot of other people who have contributed here. And you're
going to the hear about our involvement with Microsoft, which has been
tremendously productive.
So let me just say something -- show you a picture of Blue Obelisk. Much of the
software -- all the software I use is open source. It's not all written by me or my
colleagues, but it is part of this ecosystem that this community is providing.
Now, my view is this is enormously liberating because it is not only inexpensive
like zero but it is also something that you can take and modify and innovate with.
You cannot innovate with commercial software. You can innovate with free
software, free as in speech. Right.
So I'm going to build this on a project which I've been very honored to be part of,
which is Richard Whitby's Dial-A-Molecule from the University of Southampton.
Richard has got a grant from the EPSRC, one of the research counsel's in the
UK for a 20-year vision. It's not funded for 20 years, but the vision is 20 years to
build a system where machines can reliably 100 percent work out how to make a
molecule and then make it right so that if you think this would be a good drug or
this would interact with some parted body or whatever it might be, you just tell the
machine, go off and do it, and it will do it for you. So that's the goal.
Now, I am running the strand which is the knowledge-driven approach. And this
ties in with the fourth paradigm, the idea that much science from now on is going
to be knowledge driven. What is out there already. So I'm going to show you
how we get at what's out there already.
And that was very clear that people opportunity want simulations, they didn't want
cunning algorithms and so forth, they wanted to know what was actually out there
at the moment.
So you've seen a lot of chemistry. I'm going to talk about reactions, not
molecules. I don't know how many reactions are published either formally or
informally. Yeah, I'm guessing it's, you know, in the low millions, something of
that sort. Do you know, Antony? Tony?
>>: [inaudible].
>> Peter Murray-Rust: Well, how many new compounds are published a year?
>>: [inaudible].
>> Peter Murray-Rust: Several million.
>>: [inaudible] you mean by published.
>> Peter Murray-Rust: In chemical abstract?
>>: I think chemical abstracts do different things now because they're
enumerating from patents ->> Peter Murray-Rust: Well, anyway, it's an awful lot, right? [laughter]. And it
really doesn't matter, it's zillions, right, okay?
Many of these repeated, which is very good because it takes us back to what
Jean-Claude does about the fact that, you know, you don't always get the same
answer each time. They come from three sources mainly, journals, theses and
patents.
And journals we've heard a bit about. Possibly the gold standard and possibly
not. But the main problem you have with journals is the fact that most journals in
chemistry are not free. So there is Wiley statement about this copyright Wiley.
Copyright on the tables, copyright on the molecules, copyright on the spectra.
The ACS are slightly less laid back about it. What is important? Subscription to
an SGM journal is an access to a database maintained by the publisher. In other
words, all our data are belong to us. Right? Okay.
So what you have here is the real problem that large organizations want to own
data. And there's little sign of that going away at the moment. So we have to
develop agents of cultural change, which had are a mixture of stick and carrots,
and you've heard of some of them already, but we've got to put in place that
future which will allow us to move away from this central gatekeeper control.
So back to Dial-A-Molecule. What's a patent look like? Well, this is a bit of a
patent. I'm guessing -- well, I know -- this I know, because you have a colleague
who's working on this. He downloads approximately 3,000 chemical patents a
year from the European Patent Office, and they contain about 100,000 syntheses
a year. And we can read them all. So here is a typical chemical reaction.
You've seen this sort of thing.
Now, I'm not going to executives this, but our software, our natural language
processing software can not only recognize the chemical names in that, it can the
language that the chemist has used so you can work out who did what to which,
when, an for how long and so forth. It understands every word in that sentence,
which means that we now have a complete capture of that type of information.
The patent itself is enormous, it's 270 pages. That's one patent. No human can
read that, you know, if they're doing lottery.
So theories a thesis. Now, one of the great things about theses is that there's
much more detail about what doesn't work than there is in the published
literature. If you publish a journal article and you say this doesn't work, they'll
reject it. If you publish a thesis and you don't say this didn't work, your examiner
will be on top of you. So there's a lot more in the thesis which is there.
So here, you've got a bit of a thesis. This is actually one of our post-docs in
Cambridge, Jurgen Harter with Steve Ley. And here you've got a reaction. Now,
this reaction is -- I'm going to show it. Right now -- anyone recognize -- oh, my
goodness. You don't want to see that yet. Right. That's a surprise.
Okay. So here's this -- anyone recognize the software that's being displayed
here? Well, don't laugh because I'm going to click on that. How many people
recognize that bit of software? Right. Because that reaction is actually a partially
intelligent object. And you can click on it and the reaction will burst into life. Oh,
dear. The server application source file or item cannot be found.
Now, what does that mean? That means that if you pay a certain vendor about
$250, you would be able to read that reaction. And so what we have here is
gatekeeping by commercial software. That thesis is crippled by the
non-availability of software. So we have to do something about it. That reaction
needs liberating. And the only way we can liberate it is to come up with open
software which does the same thing.
Wonder if anybody can see where this is going. [laughter]. Kill that one. No, I
don't want to save it. And I'll put that one just down there for the moment.
So anyway, here's a typical example of a failed reaction. It was not [inaudible].
Instead something happened in addition it was a somewhat unexpected product.
These are words which are in natural language processing called sentiment and
so we're doing sentiment analysis on this. We're trying to find out, you know,
what the motivation was, what happened and so forth.
So now we come down to what can we do about it. So here's the -- we want
hundreds of thousands of reactions, we want them for zero cost, and we don't
want any trouble, right? Okay. So here's our first effort. And Cameron didn't
show this, but this is actually the Panton arms. And you can see the people.
That's Jenny Molloy who is a second year or third year Cambridge
undergraduate. She's done a huge amount of effort on that. Some of the others
you will recognize.
But here's Rufus, right? This is Rufus Pollock. You know, the indefatigable
progenitor to you know, and hustler in the open knowledge foundation. So he
makes things happen. There's John and that's Jordan isn't it? I think. Anyway,
there it is. This is a highly historic picture. You know.
So that's the Panton. But also we wanted to know not only is the data there, you
know, can we get the open data, but how do we know it's open? So here I went
to something called whatdotheyknow.org. Now, this is a freedom of information
in the UK. And if we go to this -- and this is -- so this is actually a website
whatdotheyknow.org, and I can request a message from any organization who's
required to respond by the Freedom of Information Act.
And I've mailed the British Library. I mailed the British Library about why were
the British Library charging for open access publications which they were. Dear
sir or madam. Now, this is their in public. And I asked that question, and
because my society have created this tool, everybody can read that. So you can
go to this whatdotheyknow.org and see who's requested what of bodies which
are required by UK law to respond. And in fact I got this back, and somewhere in
the middle of this I've probably got the results of this.
What I thought is that wouldn't it be nice if we did the same for publishers? Is
your data open? Right? Yes, no, possibly. So we put together a tool which is
called is it open. And I think what I will do is probably go to another web browser.
I -- here we are. This is better. So here we are starting to ask authorities
whoever they are is your data open? So I have mailed David Wild and Christoph
Steinbeck who are editors of the Journal of Cheminformatics. And here I mailed
them early this morning.
I am writing to ask you about the openness of data published in Journal of
Cheminformatics, right? Now, it's an open access journal, CC-BY so, you know,
it's algorithmically open if you like. But I thought we'd start with the easy ones.
And what we've got back, here we are. Within, you know, and hour or two, the
editor has applied asking had his editor can we technically -- can we put an open
data button on the Journal of Cheminformatics because that button is the key to
making data open. All we have to do is to make that button on to enough data
sets, enough software tools, we put it into tools so that tools create an open data
button, and then the problem is solved.
So that's what I mean by software being an agent of culture change. Now, I'm
going to show -- just before I show -- did I mention Chem4Word? Yes. Before I
show Chem4Word, I'm going to say just a very brief thing about CrystalEye. This
was a student project, a graduate student project build by Nick Day, and this is
built by one person over about nine months. And what this system does is it
trolls every publisher on the web who publishes crystal structures, brings them
down every night and aggregates them.
Now, it can't do every publisher because some of the publishers hide their stuff
behind firewalls. So it do Wiley, it doesn't do Springer, it doesn't do else fear but
it does Royal Society of Chemistry, it does ACS, it does the International Union
of Crystallography.
This, you won't be surprised to know that this wonderful software on both sides
here came from the Blue Obelisk. And let's just go through and look at this data.
And what can you see at the top left of that page? Yes. You can see open data.
And when you click on that, it goes through probably in finite time -- well, it's gone
through I think -- there you go. So it goes through to the open knowledge
definition. That is open data. Machine knows that it can use that data there for
any purpose whatsoever.
Now, of course reading the literature in this way is actually a bit of sticking
plaster. We ultimately want to create semantic data at source. We've heard lots
of ideas here. We've got some projects in house which are looking at that. But
the most important one here is Chem4Word. So I'm going to show you
Chem4Word. Now, I'm just going to started off by showing here's a bit of thesis
in order Word. And you'll notice here things like bold numbers which don't mean
anything.
That is actually reference to compound 155. We don't know what it is because
it's about 50 pages elsewhere in the thesis. So it makes it incredibly difficult to
read. So what we've done here, and I don't have time to talk about Lee and
Pablo and Alex and Tony Hey and the others is over a year and a half built a
completely free open source add-in for Word. It fits into Word 2010. And if I now
go to -- here we are. This is it. Here's what it looks like. Now, this is the same
stuff here. But you'll notice now that when we move over anything in here it lights
up. And there's a whole list here of all of the molecules in the thesis. So if I want
to know where water is here, I can go there and click it, and it lights up there.
This one lights up there. That one lights up there and so on.
Now, this means that we know at each stage in the thesis exactly what it is. So
this is a tool of great -- much greater power than the current tools for a student
editing their thesis. So this is not only if you like going to cost zero, it's actually
going to do do more. And what we're looking at is ways in which the community
can help develop it.
Now, that's actually wrong. So what I'm going to show you is how we can edit it.
Are we going to the editor? To do that, we have to have what's called a 2D
structure. Now, notice here -- you probably can't read it, C11H1804. But
remember those three numbers. Right? And we go to 2D representation and
here it is. So that's a molecule. I go to edit 2D and now I realize that those
should be acids instead. So I click on that one. I'm not a very good clicker
sometimes. There we go. And we'll put an acid on that one and we'll go to that.
I'll put an a acid on that one. Right?
We'll save it. It tells us that some of the names have changed. It's inconsistent,
so it's Tony told you all about how names didn't relate to the right things. This is
the sort of way that it happens. So this tool is able to tell you where your names
probably don't relate to whatever. But now if you look at that, that's gone up to
C11H8 -- what is it? It's increased the number of oxygens in that.
So what we have here is we have a tool which is open, which anybody can
download, which we hope the community is going to build new inversions in
without necessarily our permission, which then becomes a way of creating this
chemical Semantic Web.
Finally, I want to say a little bit about two other projects. I think I started a bit late,
so I have two extra.
The next one is actually the -- probably the first example of a designed small part
of the chemical Semantic Web. It's relatively straightforward to put your stuff out
there. It's more difficult for two or three people to share various components of
that. And this is the OREChem project, which is sponsored by Microsoft. Lee's
program director manager, whatever. And he chose four participants here, Penn
State, Southampton, Cambridge, and Indiana University. And the whole the
coordinated from Cornell with Karl Lagoze, who has developed a Semantic Web
framework for academia which is called OAI-ORE.
How many people have heard of OAI-PMH? Well, this is its sort of little sister or
whatever, right? It's coming along. It will be very important. It is very important.
Right?
So the idea here is that Penn State reads stuff from the literature, does some
crawling, indexing, sends it down to Southampton who then turn it into partially
semantic form. That then comes down here. These little things are RSS feeds.
It comes town to Cambridge. And we turn it into chemical markup language and
RDF, send it to Indiana, and Indiana then do high throughput computing on this.
So one of the outcomes of this will be that we can compute every molecule that
comes out on the web. Because there might be million molecules a year,
whatever it is. They might take -- some of them might take a week. But what is a
million weeks on today's computers? It's almost nothing. You'll find places
anywhere to do that sort of thing, and of course some of the companies are very
keen to try this out.
I will -- the last thing I'll talk about what we're doing here is because it came up in
relation to things that Jean-Claude was saying, is the release of data which is not
immediate. So Jean-Claude has got these categories which I hadn't heard about
before, and let's talk about the category which is delayed. So category either
complete or partial delayed.
Now, the trouble is if you delay something, it's awful easy actually to do nothing,
to say I'm going to do that later. So what our system here does, and you don't
need to know the details here, it's an embargo manager called EmMa, and it
controls when something is released.
So the idea is you publish directly into EmMa at the time that you do it. So it's
immediate but hidden. That means that you can take the same philosophy
whether you're going to ultimately release it at all or whether you want to release
it immediately.
This manages the trust between the different components in the system so that
some of this, for example, goes into an internal repository or C private atom feed.
Atom, by the way is nothing to do with chemical. It's an RSS protocol.
And so again, open source. Anybody who needs an embargo manager, and I
suspect most of you are going to need embargo managers, that will be available.
So I'm just going to come back and review over one or two minutes are the things
that we've covered here. I hope that I made the case that if we are doing
completely machine collection and analysis of data then it has to be truly open
data along the lines of the open knowledge foundation, open data button.
Because a machine cannot make the decision about reading lawyers letters and
licenses and things like that. So it's got to be completely open.
We except pragmatically that there will be many things which had are not
completely open. It does allow some principle to come down to granularity where
you could even stick this on individual data items of some sort.
We've come very clearly to conclusions, particularly Cameron and myself, that
we want a very simple approach. And awful lot of time is spent in the open
access community trying to work out what open access is. You know, it's
actually what you do in many cases rather than huge amounts of discussion
about it. And if you come to the end, which is CC-BY, it becomes trivial. And I
would strongly suggest all of you if you got open access material, strive however
you can just to make that CC-BY. It solves all your problems. You've actually
got time to do something else.
I'm very excited about the power of web democracy of building tools which will
change culture. You've seen what my society can do. You can ask questions of
people. Organizations are going to have to be more responsive to public interest
and so on. And that will cover all sorts of disciplines. And it means I think that
the publishers are going to have to step up and actually answer some of these
questions.
Publishers cannot hide at the moment. We know one publisher who has taken
over two years not to respond to a query. That's really not acceptable when they
are charging academia huge amounts of money for the privilege of reading our
material.
I think text and data mining can be critical but until we solve the legal problem it's
going to be problematic. We start with a bit easy. And I would therefore like to
thank you the organizers for inviting me. I am very pleased to be associated with
the Science Commons and effort in this area. And also a great deal of thanks to
Microsoft over last three years who have moved and enormous amount.
When we started this project, there was no hint of open source. We have now at
the stage where there is enthusiastic pressure from Microsoft to release this as
open source, and we're really looking forward to this change in culture with
freedom means innovation and involvement. Thank you.
[applause].
>>: Is Chem4Word open source at the moment?
>> Peter Murray-Rust: Lee actually I think it -- Lee, if you answer that question,
it's probably a more accurate answer.
>> Lee Dirks: So we'll be releasing the data for Chem4Word. It will be actually
released -- we've been working on it kind of code developing it over the course of
the last year and a half. It will be announced that we'll release the beta next
month probably at ACS. The beta will be made available open source. We
haven't finalized the licensing terms but it's probably going to be an Apache
license. And then once we're -- at that point, we're going to stand it over to
Cambridge, and Cambridge is going to take that project on and move it forward.
And it won't be -- we'll probably have a seat at the table, but they will be the,
what, benevolent dictator of the -- of that project moving forward.
>> Peter Murray-Rust: And Lee is the 800 pound gorilla [inaudible].
>> Lee Dirks: Thank you, Peter. [laughter].
[applause].
>>: What is CC-BY?
>> Peter Murray-Rust: Right. Okay. This is a good time to go out on the web.
So this is one of the great change advantages of the web. So we'll go out and
see if we can find CC-BY. I do have a slide that is actually quicker just to do this.
All right. Creative Commons Attribution 2.0 Generic. It says you are free to
share it, copy, distribute and transmit it, and you're free to adapt it.
And you must acknowledge the work in the manner specified by the author or
licensor but not in any way that suggests that they endorse you or that. So, in
other words, it is you must attribute the source and that's it. Right? It's
wonderfully powerful. So, yeah? You -- sorry. Was there another question.
>>: It's pointing to you first.
>>: Will there be a chem 4 open office?
>> Peter Murray-Rust: Good question. We have two technical problems here.
One is the question of language, right? So Chem4Word is written in C#, which is
an open language but is largely used on Microsoft platforms, right? Whether that
would work in an open office environment, I don't know.
Secondly, there are certain things which are called in open office -- in
Chem4Word which are probably not present in open office. Right? There is no
reason why it couldn't be reverse engineered for open office. We are not funded
to do that. I think you know personally I think it would be a good idea to do that if
there is a need for it.
>>: You are talking about you're the embargo. What sort of time periods have
you all been talking about?
>> Peter Murray-Rust: Well, normally it's actually until time of publication. So
normally what a chemist wants to do -- now, of course we do have some
problems here, but let me just go back to the picture of it. Normally what a
chemist wants to do is to keep their data safe, know that at some stage they're
going to do something with it. They don't necessarily know what they want to do.
They certainly don't want it released before publication because some publication
will actually then regard the publication as invalid and probably reject their next
paper or whatever.
It's actually quite a difficult thing to know when a paper has been published. You
know, a young researcher may very well read every odd issue of whatever, scan
the web pages, but there's nothing that triggers this has been published, right?
There isn't a place, you know, where triggers come back to the author. So what I
think EmMa can probably do, although there's no immediate mechanism for this,
is to respond to that trigger when something's being published.
So this would also work for things like theses, so clearly when the thesis has
been accepted and put in the library, then that's the time at which data might be
released. So sometimes it might be, you know, six months, but it's more likely to
be until such an event has occurred.
>>: So are you talking about just the experiments [inaudible] the paper or the
associated experiments as well?
>> Peter Murray-Rust: It's whatever we can persuade people to do. I mean, we
have an approach in Cambridge where we have invested in a commercial
electronic lab notebook but we are managing independently our own NMR and
x-ray data and so on.
Now, it's my hope that all of the x-ray and NMR data goes into this anyway, right?
There's a question as to who -- some problems you have who owns it, how do
you identify people? It's not actually trivial to identify people authorities, groups,
and things like that. So all of this is bringing out the -- you know, the problems in
the human metadata. The scientific metadata is straightforward, it's the human
metadata that's [inaudible].
>> Lee Dirks: Are people ready for lunch? [laughter].
>> Peter Murray-Rust: I'm sure they are.
>> Lee Dirks: All right. Thank you very much, Peter.
[applause]
Download