>> Lee Dirks: Thank you very much for coming... and try and stick on schedule. And I'd like...

>> Lee Dirks: Thank you very much for coming back on time. We will go ahead and try and stick on schedule. And I'd like to turn it over now to Tony Williams to come talk to us about ChemSpider and other things going on at Royal Society of Chemistry. >> Antony Williams: Is this active? So first of all, thank you for the invitation. I love it in this part of the world. I actually got engaged out here a few years ago, and life has been great since then. I'm going to tell you a little bit about a pragmatic vision. So the people who know me would say that I just get down to it, I don't let anything stand in the way, and I'll do silly things that are against most recommendations. This is an example of that. This is actually a hobby that started off in a basement. It started off with a vision of building a structure centric community for chemists. So I was watching what was going on with PubChem and things like E-molecules and databases online of chemical structures. And I was working in a company where we dealt with prediction algorithms for structures, we draw them, we databased them. I was working a lot of hours, and I decided to do something that was creative and fun. Hard work is not always creative. That's one of the challenges of having a career. So started off with a hobby project in December of 2006, and it was to integrate chemical data on the web. I liked what PubChem was doing. They were very focused on dealing with assay data. I started having a concept about how could you do something Wikipedialike around chemical structures specifically. So we integrated chemical structure on the web. That was the vision. Make it a hub to information. Because the web is just covered in information about chemical structures, but there's no easy way into it and out to all of the data. We wanted to provide access to structure based algorithms, so because I was in the world of chemical informatics, I knew a lot of people who were building algorithms. I might be willing to share them. We want to liken chemists to contribute their own data, because there's a lot of data that get generated that sit in notebooks sits in pads, they never find their way anywhere, so let them put it somewhere. Wikipedia's not the place for everything that's getting run because it's an encyclopedia. Google spreadsheets could work, for example. It certainly ends up in Excel spreadsheets Word documents, but then it gets limited, it doesn't get shared. So we wanted to share it. And we wanted this ability to allow the community to create and correct data. Because of course, everything they write about Tom Cruise on the web is true, so therefore everything about chemical structures on the web must be true. There is so much stuff out there it's scary how bad it is. Chemistry online today? Where is it? I'm not going to read this list out. It's all there. It's encyclopedias and property databases and drug discovery data and publications, chemical structure information is all over the place. The searches on the Internet today primarily limited to text based searching. That's probably what you would know about it if you were trying to search for information about a particular compound. The data are certainly dirty. I'll show you some examples. It's very difficult to figure out what's good and what's bad. You just don't know. Who can you trust? Who are the authorities out on the Internet? And too many searches are required to resource data you do a search on any of the search engines that's a texted based search for chemical compound and you can get anywhere from a few 10s of hits to millions. And now how do you figure out what's behind it? So humans do not want to deal with lots of interfaces. They actually want to deal with something that's rather simple to use. That's why they mostly go to a single search engine today. They want as few interfaces as possible, but to get as much integrated information as they can. What would the future look like? And actual how far away is the future? Because it's coming very, very quickly. The Semantic Web for chemistry is in place. That's our future. Whatever this thing, this Semantic Web is going to end up to be, it's going to be there. Crowd sourced contributions. Well, crowd sourcing is here. When you read Amazon, book reviews, movie reviews, I mean that's the crowd sourcing right there. Wikipedia, that's crowd sourcing. Chemists will be searching the Internet based on structure and substructure, not just section. There's no reason they shouldn't work in their own language. Chemistry articles are indexed and searchable, both text based as well as structure and substructure. Why not? It just makes sense for a chemist. Again, reduced number of searches. The data are integrated. So if you want to find all publications about a chemical, that's easy, and all the patents and where to buy the chemical and why not all the analytical data and all the melting points, boiling points and properties that Jean-Claude and his group are putting up? Why not. Makes sense. And then we're going to be in this world of open access and open data. Peter's almost certainly going to talk to you about open data. He's a better person to do that than me. This is coming. In fact, in many ways it's already here. We're just not naming it open data. Classical business models, they will have to change. So we had a vision, and we went off to get it done. In March 2007, we went to an international chemical society meeting in Chicago and we released this searchable structure database online. We bought one computer, and we built two with our hands from a company called Tiger Direct buying a mother board and hard drive and chips and throwing them together. We plugged it into the cable Internet that serves TV signals and voiceover IP and started letting people do searches. We seeded it with 10 and a half million structures from the so called public domain. We sourced them from PubChem. We released it with structure and substructure searching. And it went live. June 2007, a few months later, we realized that the data that we had released out to the Internet was very dirty. Just in our lookups we were finding a lot of errors. So we had to figure out how to make that claim. Well, 10 and a half million structures with maybe 20 to 30 to 100 names, lots of properties associated with them, leaves you open to a whole lot of data issues. So we needed the crowd to help us clean it. So we put on a curation layer. We said okay, 10 and a half million is a good start. And we will aggregate more data. But we'd like you, the scientists out there in the world who are using our system to add your data. You might not want to publish it, publish it in a classic cal peer-reviewed sentence, but you might want to share information, so why not? So we added on a deposition interface. And so it continued. This is the world of search engines today. This is what people are used to. Most people are working with a single box that says enter something to search. And the world of chemistry you can have chemical names, as aspirin, erythromycin, Vancomycin. Those are very common types of searches. But then we have this thing called SMILES string, which is a way to represent a chemical structure. So we let you paste a SMILE string. We also have things called registry numbers, and those exist in recommending industry systems such as the chemical abstract servers. And we also have InChis how many people in the room have not heard of InChi? Okay. I will tell you a little bit about InChi shortly. Let me swap over here to -- if I can figure this out. I'm trying to figure out how to get to the live Internet. Oh, there we go. So this is ChemSpider. ChemSpider.com. Free to everybody to access, no charges to use it. If you search cholesterol, it's about .02 of a second to search it. That's running now out of Cambridge because ChemSpider actually worked and the Royal Society of Chemistry came and acquired it. There's a chemical structure, there's some properties, there's a systematic name. This is a SMILES string. That represents that molecule. These are InChis. I will tell you more about what they -- they're about shortly. But they are text strings. Often they are text strings that represent this molecule. There's a link to a Wikipedia article, but it doesn't make sense for us to pull the entire Wikipedia article and database it, because it's changing all the time. People are rewriting these things and adding informing. So we let you read it. Here's a list of patents. If there's a new patent available, then it's going to find it on this search. These are linked out directly to patents. Down here we have list of articles. These are articles on PubChem. PubChem is publishing multiple articles a day. This is -- this is a web service on PubChem. So if there's a new article out, it will show up. We don't cache this information for more than a few hours. So new article on cholesterol will show up here. And this is only a subset of them, by the way. You can always click-through to see the if you will list. Here's a list of supplemental information where there's additional information about it. These are articles that vice president found their way to PubChem because not everything goes to PubChem. If it's not about medicine, then it doesn't necessarily belong there. There's some properties with direct links. You can see down here we have the actual link out to the original data source who said that is the particular value. And you can click-through, and you'll find your way out into the original data source. And notice we've got melting points with units and without units. That's because of what the data source provides. Here's a whole set of chemical names, and there are a lot more. These are all names for cholesterol. These are validated names, the bolded ones. These are not validated. There's no highlights on, they're not bolded. These are different types of database IDs. Here's some database -- there's a long list of database IDs if I had enabled it. And it's just too much for this particular page. So we've got all of this information showing up on ChemSpider. .02 second it's got 23 and a half million molecules today. And this linked out to 300 data sources. Let's go back here. So one of the other things we have on here are live data, so special data that get submitted. We don't run any special data. We can't afford a half million dollar spectrometer. So we'll let anybody who is running the data submit it to the system. It's fully interactive, similar to what Jean-Claude was showing. This is an open source applet where we allow people to zoom in and download the data if it's open data. The data sources where we aggregate the data from, where possible we link out to them directly. So we're building, if you like, a link file. And this case, this little icon here says it's a chemical vendor. If you want to go buy the material, you can go buy it. If you hover over the hyperlink that goes out, then it will take your snapshot of whatsoever on that page. In this case, you get a material safety data sheet in Chinese, English, and Korean. For cholesterol. Some of these MSDS sheet, I've seen up to 16 languages represented. It's not sitting on our server. It's a link out to somebody else's server. And we have 300 data sources that we're linking out to now. This is the Environmental Protection Agency, EPA. The SS tox. A metabolome database, Food and Drug Administration are aggregating all of the data. When we do that, however, we've been hitting some interesting issues I'm going to tell you about. This is an example of how extensive this data can get. This is KEG, Kyoto Encyclopedia of Genes and Genomes. These are massive metabolic pathways, I mean, just massive. You're scrolling around the page finding your way around. In this case, we clicked in through the structure of aspirin, when is just up here on the left hand side. Again, we can't host all this data because it's changing on everybody's website. Patents and structures. This is actually the new way we're going to be showing it in a couple weeks. We can see US Patent Office, EPAB, European, Japanese patents are all there. The new way we're going to be showing articles. I showed you the simple way to search a name, a chemical name or a SMILE string. The reality, chemists don't want to work that way only. That was magic. Because if you have a chemical that is named in 300 ways, how would you do that text search? You'd have to know all 300 names to do it. So you have to disambiguate through the structure in most cases. So we allow you to do structure searching, we allow you to do substructure searching. We allow you to search based on properties, the presence of certain elements, the absence of certain elements. We allow you to search a molecular weight range and on and on and on. Incredible layers of complexity you can search by that people are using, but most of us are used to Google and Bing type searches, so most chemists come in and search base their name. Once you come to a ChemSpider record, you're significant on a structure that is linked out to many things. I've shown you some of them, publications, analytical data, related reactions, Wikipedia, patents. The question is where should it stop? How big can this go when you have a structure centric hub and essentially it's all about what you want to plug into this. And it can be blogs and Wikipedia -- Wiki links, et cetera. These are typical questions that a chemist would ask in their language, standard English. What is phenolphthalein? What's the structure? What does it do? What's the side effects of it, what's the toxicity of it? All of those questions there can be answered today by ChemSpider. And it's getting better and better and better in terms of what we integrate. What's the compound? The top, a graphical depiction there's a chemical structure. The one below it is a CAS number. The one below that Oximonam is a trivial name. The one below that is a systematic name. They're all the same. They're just -- they're just different because there are different ways that you have to label things. If I say Oximonam, I'm probably a pharmacist. If I showed you the picture of the molecule then you wouldn't know what to do then unless you're a chemist. So that's why we have to work in these different interchanges. So because we're a structure-centric hub and we're linking out to the Internet, I'm going to bring together patents and publications and all of the data that's out there that we can link. We're now making structure based searchable Internet. The question is how -- how good can it be. This is the type of linked data on the web that is showing up now with different services. So DBPedia is out there, CEOBI, which is a Chemical Entities of Biological Interest from the European Bioinformatics Institute. We've got PubChem, we've got PubMed. All of these things are linking together with specific links such as names. But this InChi has shown up. I'm going to show you a little more about that. Aspirin, I'm sure most of you have heard of aspirin Taxol. Anybody here not heard of Taxol? Oh, interesting. So it's a Bristol Meyers drug. It's rather difficult. It was isolated from the Pacific Yew Tree bark. It's a very powerful drug. It's a natural product. It's very complex. Where would you go find what the chemical structure of that is? If you went to Wikipedia to look for aspirin, it's very small, it's easy, it's correct. The original structure of Taxol in Wikipedia was incorrect. This one down on the left hand size, you probably know that one -- nothing personal. >>: How would I know that? >> Antony Williams: So a little blue pill from Pfizer. You got to question everything online. Why are you galling? [laughter] so DHMO.org. So this is a nice little website. Dihydrogen monoxide. You die if you ingest too much of it. Long periods of emersion and you will also die because that's called drowning because indeed dihydrogen monoxide, two hydrogens, one oxygen, is water. It's a hoax. It's a wonderfully well-done hoax. I suggest you go read it. It's so well done actually that if you read the Wikipedia article, you hear about the politicians that tried to get DHMO banned from industrial processes. True stories. Chemistry on the Internet is messy. You can imagine what might happen there that might get rather messy. You know, it's probably methane. [laughter]. For every action there's an equal opposite reaction. He should be flying this way, right? [laughter]. So what's methane? Wikipedia. Simple organize molecule, one carbon, four protons, that is correct. Wikipedia article has been validated by numerous people. If you go to PubChem, which is a government database and housed by the International Institute of health, that's the correct structure of methane. It's actually labeled as charcoal. Now, if you throw methane on a barbecue in the summer, rather than charcoal you'll get the type of cooking effect you were seeing with the cow, but not quite what you would want. If you look at the long list of names that are associated with methane on PubChem, you will see diamond. That's not a particularly good one because diamond is not exactly a gas, and you could manage handing over a gold ring with [inaudible] it's not the right thing to do really. [laughter]. And graphite, also not methane and also not diamond. And bucky bull is also listed. This is a database of chemistry. It was from PubChem. It's an excellent platform. PubChem is a wonderful platform for data. However, they're not -they're not responsible for curating it. As a result, data has been showing up on PubChem for the past few years, and it's kind of public domain, so people have been taking it and putting it in their own databases. And now you have this proliferation of errors all over the place. It's actually quite shocking. Is that the right structure of Vancomycin? Only some of you are chemists. But even those of you who are chemists I would not imagine that you could check every stereochemical very easily. That's a rather complex molecule. I'm sure a number of you have had Vancomycin at some point. That is the correct structure. If you search PubChem then you end up with I think three or four pages of molecules called Vancomycin. People are taking these data into models, they're building models of prediction, they're using it to resource information from which one's Vancomycin, which one should you use. Actually, the structure of Vancomycin is primarily an assertion that comes from analytical data, it comes from who says it's what. We just released a publication that shows how many articles are published with incorrect chemical structures and it's absolutely shocking. Good science to the best of their abilities, but still a lot of errors. So we've cleaned up a few hundred of late. We had inherited all of those errors about Vancomycin on to ChemSpider, plus many others from other sources. So we actually had to go clean it up. You do a search on ChemSpider today there's one Vancomycin. It took three days and multiple e-mail exchanges scientists at the EBI to figure it out. Now you go to our article, and it will tell you why we say this is Vancomycin. Direct links to original publications. One would assume the expert would get this stuff right. This is a web page from about harmful algal blooms. That's domoic acid which kills people. You get shellfish poisoning. You would assume that's correct. In fact, it's incorrect. Every stereocenter on that molecule from the experts is wrong. The bottom right hand side with the red arrow is domoic acid on Wikipedia, also wrong. Top right hand side is the structure of domoic acid from the American Chemical Society's C and E news article. Also wrong. Do you see a simulator between the C and E news picture and the Wikipedia picture? That's because C and E news taking data from Wikipedia directly. Wikipedia being used as an encyclopedia and an authority. One would hope in the future that you could trust all of that data. I believe you will. You will be able to do that. Domoic acid's cleaned up now. We've been working on curating every chemical structure that is sitting in Wikipedia. We checked every stereo bond, every connectivity, we've cleaned up lots and lots of errors. This is the correct structure of domoic acid on ChemSpider. The InChi is a way of representing the structure, as I said, in alphanumeric text. So here you see a couple of examples. It has formula in it, it has isotope details, it has stereo layers, that whole string there can represent that full molecule. It's a very, very good way to encourage structures to link together on the web. The problem is search engines will truncate very, very long strings. So if it's a very, very big complex molecule, you try to search it, it will just drop the end off. So now you're stuck. So they had to come up with a way to make that a little more able to happen on search engines. So they built a hash. But that goes in. An SHA 256 hatch you take the molecule, you create the InChi string at the top, you convert it to a hatch and now you have a fixed format for that molecule. There's two issues with that. One is you cannot go from the hash directly back to the molecule. And you also cannot go from the hash back to the string. You can only do it by doing a lockup. There's no way to reverse that hash. So what do you do? This is Taxol. All the way back to that Pacific Yew Tree bark natural product. As you can see, rather complex. Below it is the string. Again, rather long. And below it is the hash. If you search across the databases on the Internet to Taxol, you will find different hashes. There are three different hashes for Taxol. Two of them are the same structure. One of them is different. And yet the tree hash is different. Why? Because you can have different settings when you generate the string. So that was a problem. So what they came up with was a way to create a standard InChi. It's a standard set of options that will always produce the same outcome for any of the databases, as long as the input is the same, the molecules have to be the same. Taxol was different by only one stereocenter. That's one position in the molecule. Does one stereocenter matter? Here we have two molecules differing in one stereo center only. Anyone know what that molecule is? >>: Thalidomide. >> Antony Williams: Thalidomide, yes. One stereocenter does matter. That's what one stereo center does. So who says what Taxol is? That's a challenge. It's assertions. If you look across most of the publications that are out there, many of them have got that structure drawn incorrectly. Timelines change, so molecule published can be revisited a year later. And the structure has changed. The public data is full of these errors and yet chemists would love to have a resource that they can trust. And the quality source today is the Chemical Abstract Service. But it's not easy to access. It's expensive to access. It's -- you have to pay for the license fees. This is Vancomycin. The correct structure of Vancomycin. Wouldn't it be nice to be able to find all recommendations of Vancomycin on the Internet? Well, now we go to this standard InChiKey. If there are databases being built, if there are patents being issued, if there are publications that are being written where the structures that are contained within them have standard InChiKeys associated with them, you should be able to go search. So what we've done is we've put this directly on ChemSpider and said if you find Vancomycin, you want to find across the Internet what goes on, you can click on the first path, this piece right here is the skeleton of the molecule. If you include this path, you click on this path, then it includes all the stereo chemistry. With all of that complex stereo chemistry I would always suggest you search in the skeleton, because people mess up stereo chemistry rather easily. So what do we get? If you search the full molecule for Vancomycin, so click on the second part of the string, you find four hits only. Two of them are on ChemSpider, one of them's on PubChem. But don't forget this three or four pages of them. So it finds one structure out of many. And then it finds something else in the chemical register. Vancomycin is a very, very common compound. So I would expect more than four. So if I search the skeleton, I find 104. Find 100 more. And they're all called Vancomycin. These are all significant on public compound database. The top one is highly curated. And it's different. So the Internet is a mess. Somebody has to take the responsibility to try and connect it up and clean it up and feed all of the information back. By the way, when we've made changes to the database when we found errors, we communicate them back to the original sources and literally 95 percent of the time they don't make any changes at all. The screen shots I showed of PubChem with charcoal and methane has been there for three years. I've given the public presentation 30 or 40 times, and they don't change it. There has to be some changes I think in there. This is something called the InChi Resolver because a hash needs to get back to a structure you can only do it through a lock up. So publishers are starting to layer InChiKeys on to publications, however, you can't convert the InChiKey to the structure. So we've had to build a resolver so you can search an InChiKey and find out what the molecule is. It's a public resource. These are people we're depending on to grow the resource, to link in more information, scientists, students, and retired people. We've got retired curators running on ChemSpider today. From all over the world. This is a curation screen. It shows you some of the edits that are being made and suggested. Anybody, anybody in this room can come to ChemSpider right now, suggest an error, and click on comments and tell us what you think is wrong. And we have a gentleman who is retired NMR spectroscopist in Germany, Heinz Cushone [phonetic], second one down. The third one down is somebody in China. I've never met these people. This is just examples of people who are contributing to clean it up. Multi-level curation. So I showed you what methane looked like on PubChem. Here's a whole list of names removed from that list that came from PubChem. It's still on the database but they've been scratched out. Citizens can become data sources. This gentlemen, he's one of my colleagues from the Royal Society of Chemistry, but he's billing his own data source on ChemSpider. So he's a little subset of 2 three and a half million compounds. He's got 72 of his own molecules. We were just having a discussion about you can have a vanity site on ChemSpider. Myname.chemspider.com. It's a multimedia resource, so we host videos and MP3s. This is Theodore Gray blowing up titanium, making titanium. This is University of Nottingham professor talking about titanium. When we build rich resources of structures with dictionaries of names now, all these trivial names, synonyms, systematic names, registry numbers, then what you have is the ability to use it for semantic markup. Peter, will you talk about Oscar as all? Yes. Okay. >>: Very [inaudible]. >> Antony Williams: Okay. Peter's been working on a project called Oscar for a number of years through Royal Society of Chemistry use it as the basis of their semantic markup. It means finding chemical names inside text, and there could be multiple other [inaudible] doesn't have to just be chemicals. Finding them, labeling them, and linking them out. In this case, project prospect gives you the ability to see the chemical structure drill at an article. It makes the data very discoverable. This is an example of marking up a Wikipedia article again using an entity-extraction system we can see some names highlighted. However, it misses a whole set of them, bosentan, fosphenytoin, diltiazem, erythromycin. I mean, these are pretty common drugs, but it misses them because the dictionary is incomplete. So you have to depend on good dictionaries. We built something called ChemMantis. You have a spider, ChemSpider, so ChemMantis just made sense. By the way, ChemMantis is markup and nomenclature transformation integrated system. We tried chem scrabble, but we couldn't come up with it to mean anything. [laughter]. So in a couple of seconds you can go in and cross an entire chemistry article and you can find all the chemical names and link them out to ChemSpider, which takes you out now into the world of Google and Bing and publications and patents and chemical vendors. From an article today you could figure out where to buy the chemical, all the patents about the chemical, all PubMed articles about that chemical. It's all linked up now. Doesn't have to just be chemicals. It could be species. In this case, we'll link it out to Wikipedia, articles about species and it would just as easily be hardware vendors and software vendors. They're just dictionaries. So what would you want to link it to? We go back to that list of things that we were doing on ChemSpider. Once you got the link off of your publication and into ChemSpider, you just made the entire Internet linked by structure. What we're trying to do is help people get away from having to draw structures themselves. Nobody should draw cholesterol again. If we've got it right, let them reuse it. So those of you who know how to embed videos from YouTube, just take a little piece of JavaScript, go to ChemSpider, find the molecule of interest and copy the JavaScript code into your blogs and your wikis. JC's students do this all the time. They never draw the molecules anymore, as long as we have them on ChemSpider. So that's embed code. By using the embed capabilities and the web services we built around spectra, now they can play games. And the students are playing games looking at spectral data and they find errors, they curate our data for us. We now provide a game to clean up our data. Tricky. It's great. So you come along, you choose which molecule fits that spectrum, it takes you to the next one. 10 spectra later we make it three molecules and then four and then five. Make it more and more complex and the students that are playing the game win awards. Computers don't want JavaScript, they want web services to integrate things together. So we provided them. We're linking out in many ways now. So Notebook Science, Open Notebook Science. They're using those structured drawing packages, they're using those software offerings from billion dollar organizations like Thermo, Waters, Agilent, Bruker. They're plugging those into their systems. IPhone apps are linked up to this. What we don't deal with yet, materials. Materials are tough. You can't draw a connection table. You can't draw a molecule very easily. Minerals are tough, polymers. And we don't intend to manage proteins. That's done well enough by other organizations. We're just going to talk about open data, likely open source. We're going to talk about open access. So ChemSpider's not open source. I'm going to thank Microsoft for being very kind to us and giving us MSDM licenses. It all runs on SQL server. And why we've had people suggest we run it on to my SQL, we could do that, but we can't deliver things as quickly as we need to by moving to my SQL. We're on a Microsoft platform. We use open source components. There's some great open source components out there. It's not an open access database. Because open access in most cases a publishing term. It's free. It's free to use. You can take data, you can use web services. It's not quote/unquote open access. We don't assume copyright when you give us data. It's your data. We're not taking it from you. And then this question is open data. Open data has been an interesting term for a number of years now. Panton Principles we've already heard mentioned by Cameron. Peter's going to talk about them again. Who declares data as open? Everything that sits on ChemSpider cannot, by default, be open. It can't. And that's because we have organizations giving those algorithms, and if we gave all of their data away, we would harm their business model. We have a pragmatic position. We're going to serve as a community resource and provide value. We're not going to -- we cannot make everything free, because we're not allowed to. So it's free but not open. So it is today. 23 million compounds, 300 data sources, 7,000 users a day, half a million transactions. While I've been sitting here twittering in the back, I've also been flowing data and put 80,000 molecules in this morning I collected in San Francisco yesterday. Gross daily. We're providing a platform that other people can use for their own needs. We have to keep cleaning the data out there are filthy today. We've got millions of data left to -- structures left to deposit, six million. We're now integrating RSC content. A publication gets published by the RSC by structures, the data goes into ChemSpider at the same time. So that's going to flow out there together. We'd encourage all publishers to participant if they want to. The Semantic Web for chemistry we are trying our utmost to provide one of the pillars to use. Long list of people I could acknowledge. I'm out of time already. And SyntheticPages, for those of you who care about chemical reactions, we're about to release a public database of chemical reactions for others to contribute to. And this is my content information. And the slides are already up. I hit upload one minute before I stood up to talk. So they're there if you need them. Thank you. [applause]. >> Lee Dirks: Perhaps just two questions so we can stay on schedule, if there's any questions. >>: So all this [inaudible] by the Royal Society of Chemistry, is it? >> Antony Williams: Well, originally it was run out of the basement, and it was self funded as a hobby. Now it's actually owned by the Royal Society of Chemistry. >>: [inaudible]. >> Antony Williams: Oh, they've been around a long, long time, yes. >>: That caused quite [inaudible]. >> Antony Williams: My best estimate of what it took for us to build it is about $25,000. And lots of sweat and tears. To sustain it, well, it's scaling. It's growing bigger. We have an IT team now that second to none really. There's three of us that are full-time employees. There are different ways that we can look at generating revenue from this, but it will always remain free. We can do advertising and we can license web services. But the RSC is a charity, so they have a publishing arm and they have a charitable arm, so in many ways this is give away back to the community because they're a society that is a charity. So yes. Fully sustainable really. >>: When you link out to a data source or to wherever you win from a structure, do you have a way of coping with broken light? Because a lot of the times they're going to break on you. >> Antony Williams: Yeah, link decay? >>: Yes, link decay. Yes. So Bill's question is when we link out to a particular -from a data source out to a particular link and if that breaks what do we do about that? So we're building systems so that we can actually go through and monitor full link decay. But you'll probably have to check things three or four times because sites can go off, you know, a day or a week type of thing. Right now we don't have that fully under control at all. It's okay for publications because we use DOIs, so we depend on cross ref to do that. Wikipedia is unlikely to change its domain name very easily. But Jean-Claude Bradley, for example, I mean, tomorrow he might choose to stop doing Open Notebook science. It's very unlikely. We just heard the guy talk, right? But he's been kind enough to put up his entire archive on Lulu. So we just bought a disk for five bucks and we'll set it up on our servers and all his links will be safe in our world. But chemical vendors come and go and things like that. So some of those links are going to decay. Which point we'll just disable them really. It is a tricky thing to do. >> Lee Dirks: Very good. >> Antony Williams: Thank you. >> Lee Dirks: Thank you very much. [applause]. >> Lee Dirks: And we'll let Peter get set up here. I don't know about you guys, basement of my house is full of boxes. I think I might have a couple of bottles of wine down there. This guy's changing the face of chemistry. It's unbelievable. It's a pretty amazing hobby. I would like now to hand it over to Peter Murray-Rust, who is one of the -- one of the signers of the Paton Principles and one of 3 I think that we have here today, and to give us a presentation on I think a broad variety of topics of the work that he's doing in and around this field. Over to you, Peter. >> Peter Murray-Rust: Right. Well, I'm delighted to be here, and I'm also delighted that this is being recorded because I don't use Power Point, and I don't know what I'm going to say, and I need to know what I've said after I've said it. So this I think is a very important meeting that coincides with a whole lot of things which are coming together in terms of a release of openness. I'm also delighted to see lots of people in the audience who I've known remotely people like Bill Hooker and Heather Pivalol [phonetic] and so on, which is great. So you meet up there. I'm going to talk about open data. I'm going to go through quite quickly because I've got three things to announce today quite apart from anything else. I'm going to say something about the Panton Principles because Cameron didn't show the pub. I'm going to talk about is it open from the Open Knowledge Foundation, and I'm going to give a sneak previous of Chem4Word. So lots of things that I might talk about, and I will come back to these later and see if any of them I've missed. Linked Open Data is another word for the Semantic Web, another approach. And what is key here is both open and linked. And if one is going to have the machines running over the web, there must be zero friction. And in my view, the biggest amount of friction at the moment on the web is whether you are allowed to use that resource at the end without having lawyers send you some sort of letter. So what I believe is at the moment we can only do Linked Open Data if all the data are absolutely certified completely open. And I'll say what I mean by that. It's actually very easy. It's as if it has got an open data button from the open knowledge foundation. So that is my full definition of open data. Rufus Pollock, Jordan Hatcher and John Wilbanks -- is John here? He is. Right. Have spent two years talking about this. They have solved the problem for me. They've gone into huge amounts of detail about this. I just accept they've got it right. So I just go ahead and say this data is open. There isn't a difference between open access and open data. You cannot take open access ideas and relate it to data. You cannot take open source and relate it to open data. And almost open, freely accessible are very valuable, but they are not good enough for open data. Now, I want to talk about software as an agent of change. We've seen how we can get things out with crowd sourcing, with communities, with all sorts of ways of doing things. Software is also a major way that one can push ideas. Because if everybody uses a piece of software and that software has gotten it embedded a political philosophy or a social philosophy, then that will get out to zillions of people. I also want to say something about web democracy. Now, you've probably seen that the UK has torn its insides out over MP scandal and things like that. We do this very well in the UK. We agonize but that agonizing is a process of democracy which is being fueled by web tools. And I want to say something about what my society has done here. I also want to say something about a bottom up approach. I've been one of the founder members of the Blue Obelisk and this is a community which creates software data and other resources with no membership, no constitution, no nothing. All that happens is it just meets from time to time and occasionally people get a Blue Obelisk. I want also to say something about text and data mining. I think understanding human natural language is going to be the next great thing in information. At the moment, Google, all the tools you've heard about at the moment can only recognize things if they understand single words or stock phrases or if people have worked very hard to program it into a template. I think that when we start understanding what people communicate in normal language it will be a big break through in our use of information. And I want to say something about the fact that if you get the right system, it near zero cost to build it. Now, Tony's talked about ChemSpider. Mere zero cost. I'm going to talk about CrystalEye which runs at essentially zero cost at the moment. You can build very, very cost effective tools in certain circumstances. And finally I want to -- how many people here are from the library world? Yes. Yes, I thought so. Right. Right? Okay. Well, I'm going to say something. Libraries are not doing enough to make data open. Right? They are simply not putting their heart into saying this data must be out there. I went to a meeting last year on electronic thesis and dissertations and I said can I have your theses and can I data mine those and I'm going to show you what data mining can do and they said things like you've got to write to every author and you've got to send it in on this form and all this sort of stuff. That is not Web Talk 2.0, it's Web 0. And so you have got to find out how to get that data out there now when it's published. You know, there are no qualifications, no nothing. Get that stuff out because theses are the biggest resource that we are missing at the moment in science. Okay. A few people to thank. My own colleagues in Cambridge, I'll just leave it up there. I hoped to blog all this before I started, but I haven't been blogging for a little while. I will resume. But these are the people who have done wonderful things there. You've heard about Oscar. You've heard about understanding chemical names. There's a lot of chemistry here. And I make no apology because actually chemistry is the best subject to do the semantic scientific web on. And then a whole lot of other people who have contributed here. And you're going to the hear about our involvement with Microsoft, which has been tremendously productive. So let me just say something -- show you a picture of Blue Obelisk. Much of the software -- all the software I use is open source. It's not all written by me or my colleagues, but it is part of this ecosystem that this community is providing. Now, my view is this is enormously liberating because it is not only inexpensive like zero but it is also something that you can take and modify and innovate with. You cannot innovate with commercial software. You can innovate with free software, free as in speech. Right. So I'm going to build this on a project which I've been very honored to be part of, which is Richard Whitby's Dial-A-Molecule from the University of Southampton. Richard has got a grant from the EPSRC, one of the research counsel's in the UK for a 20-year vision. It's not funded for 20 years, but the vision is 20 years to build a system where machines can reliably 100 percent work out how to make a molecule and then make it right so that if you think this would be a good drug or this would interact with some parted body or whatever it might be, you just tell the machine, go off and do it, and it will do it for you. So that's the goal. Now, I am running the strand which is the knowledge-driven approach. And this ties in with the fourth paradigm, the idea that much science from now on is going to be knowledge driven. What is out there already. So I'm going to show you how we get at what's out there already. And that was very clear that people opportunity want simulations, they didn't want cunning algorithms and so forth, they wanted to know what was actually out there at the moment. So you've seen a lot of chemistry. I'm going to talk about reactions, not molecules. I don't know how many reactions are published either formally or informally. Yeah, I'm guessing it's, you know, in the low millions, something of that sort. Do you know, Antony? Tony? >>: [inaudible]. >> Peter Murray-Rust: Well, how many new compounds are published a year? >>: [inaudible]. >> Peter Murray-Rust: Several million. >>: [inaudible] you mean by published. >> Peter Murray-Rust: In chemical abstract? >>: I think chemical abstracts do different things now because they're enumerating from patents ->> Peter Murray-Rust: Well, anyway, it's an awful lot, right? [laughter]. And it really doesn't matter, it's zillions, right, okay? Many of these repeated, which is very good because it takes us back to what Jean-Claude does about the fact that, you know, you don't always get the same answer each time. They come from three sources mainly, journals, theses and patents. And journals we've heard a bit about. Possibly the gold standard and possibly not. But the main problem you have with journals is the fact that most journals in chemistry are not free. So there is Wiley statement about this copyright Wiley. Copyright on the tables, copyright on the molecules, copyright on the spectra. The ACS are slightly less laid back about it. What is important? Subscription to an SGM journal is an access to a database maintained by the publisher. In other words, all our data are belong to us. Right? Okay. So what you have here is the real problem that large organizations want to own data. And there's little sign of that going away at the moment. So we have to develop agents of cultural change, which had are a mixture of stick and carrots, and you've heard of some of them already, but we've got to put in place that future which will allow us to move away from this central gatekeeper control. So back to Dial-A-Molecule. What's a patent look like? Well, this is a bit of a patent. I'm guessing -- well, I know -- this I know, because you have a colleague who's working on this. He downloads approximately 3,000 chemical patents a year from the European Patent Office, and they contain about 100,000 syntheses a year. And we can read them all. So here is a typical chemical reaction. You've seen this sort of thing. Now, I'm not going to executives this, but our software, our natural language processing software can not only recognize the chemical names in that, it can the language that the chemist has used so you can work out who did what to which, when, an for how long and so forth. It understands every word in that sentence, which means that we now have a complete capture of that type of information. The patent itself is enormous, it's 270 pages. That's one patent. No human can read that, you know, if they're doing lottery. So theories a thesis. Now, one of the great things about theses is that there's much more detail about what doesn't work than there is in the published literature. If you publish a journal article and you say this doesn't work, they'll reject it. If you publish a thesis and you don't say this didn't work, your examiner will be on top of you. So there's a lot more in the thesis which is there. So here, you've got a bit of a thesis. This is actually one of our post-docs in Cambridge, Jurgen Harter with Steve Ley. And here you've got a reaction. Now, this reaction is -- I'm going to show it. Right now -- anyone recognize -- oh, my goodness. You don't want to see that yet. Right. That's a surprise. Okay. So here's this -- anyone recognize the software that's being displayed here? Well, don't laugh because I'm going to click on that. How many people recognize that bit of software? Right. Because that reaction is actually a partially intelligent object. And you can click on it and the reaction will burst into life. Oh, dear. The server application source file or item cannot be found. Now, what does that mean? That means that if you pay a certain vendor about $250, you would be able to read that reaction. And so what we have here is gatekeeping by commercial software. That thesis is crippled by the non-availability of software. So we have to do something about it. That reaction needs liberating. And the only way we can liberate it is to come up with open software which does the same thing. Wonder if anybody can see where this is going. [laughter]. Kill that one. No, I don't want to save it. And I'll put that one just down there for the moment. So anyway, here's a typical example of a failed reaction. It was not [inaudible]. Instead something happened in addition it was a somewhat unexpected product. These are words which are in natural language processing called sentiment and so we're doing sentiment analysis on this. We're trying to find out, you know, what the motivation was, what happened and so forth. So now we come down to what can we do about it. So here's the -- we want hundreds of thousands of reactions, we want them for zero cost, and we don't want any trouble, right? Okay. So here's our first effort. And Cameron didn't show this, but this is actually the Panton arms. And you can see the people. That's Jenny Molloy who is a second year or third year Cambridge undergraduate. She's done a huge amount of effort on that. Some of the others you will recognize. But here's Rufus, right? This is Rufus Pollock. You know, the indefatigable progenitor to you know, and hustler in the open knowledge foundation. So he makes things happen. There's John and that's Jordan isn't it? I think. Anyway, there it is. This is a highly historic picture. You know. So that's the Panton. But also we wanted to know not only is the data there, you know, can we get the open data, but how do we know it's open? So here I went to something called whatdotheyknow.org. Now, this is a freedom of information in the UK. And if we go to this -- and this is -- so this is actually a website whatdotheyknow.org, and I can request a message from any organization who's required to respond by the Freedom of Information Act. And I've mailed the British Library. I mailed the British Library about why were the British Library charging for open access publications which they were. Dear sir or madam. Now, this is their in public. And I asked that question, and because my society have created this tool, everybody can read that. So you can go to this whatdotheyknow.org and see who's requested what of bodies which are required by UK law to respond. And in fact I got this back, and somewhere in the middle of this I've probably got the results of this. What I thought is that wouldn't it be nice if we did the same for publishers? Is your data open? Right? Yes, no, possibly. So we put together a tool which is called is it open. And I think what I will do is probably go to another web browser. I -- here we are. This is better. So here we are starting to ask authorities whoever they are is your data open? So I have mailed David Wild and Christoph Steinbeck who are editors of the Journal of Cheminformatics. And here I mailed them early this morning. I am writing to ask you about the openness of data published in Journal of Cheminformatics, right? Now, it's an open access journal, CC-BY so, you know, it's algorithmically open if you like. But I thought we'd start with the easy ones. And what we've got back, here we are. Within, you know, and hour or two, the editor has applied asking had his editor can we technically -- can we put an open data button on the Journal of Cheminformatics because that button is the key to making data open. All we have to do is to make that button on to enough data sets, enough software tools, we put it into tools so that tools create an open data button, and then the problem is solved. So that's what I mean by software being an agent of culture change. Now, I'm going to show -- just before I show -- did I mention Chem4Word? Yes. Before I show Chem4Word, I'm going to say just a very brief thing about CrystalEye. This was a student project, a graduate student project build by Nick Day, and this is built by one person over about nine months. And what this system does is it trolls every publisher on the web who publishes crystal structures, brings them down every night and aggregates them. Now, it can't do every publisher because some of the publishers hide their stuff behind firewalls. So it do Wiley, it doesn't do Springer, it doesn't do else fear but it does Royal Society of Chemistry, it does ACS, it does the International Union of Crystallography. This, you won't be surprised to know that this wonderful software on both sides here came from the Blue Obelisk. And let's just go through and look at this data. And what can you see at the top left of that page? Yes. You can see open data. And when you click on that, it goes through probably in finite time -- well, it's gone through I think -- there you go. So it goes through to the open knowledge definition. That is open data. Machine knows that it can use that data there for any purpose whatsoever. Now, of course reading the literature in this way is actually a bit of sticking plaster. We ultimately want to create semantic data at source. We've heard lots of ideas here. We've got some projects in house which are looking at that. But the most important one here is Chem4Word. So I'm going to show you Chem4Word. Now, I'm just going to started off by showing here's a bit of thesis in order Word. And you'll notice here things like bold numbers which don't mean anything. That is actually reference to compound 155. We don't know what it is because it's about 50 pages elsewhere in the thesis. So it makes it incredibly difficult to read. So what we've done here, and I don't have time to talk about Lee and Pablo and Alex and Tony Hey and the others is over a year and a half built a completely free open source add-in for Word. It fits into Word 2010. And if I now go to -- here we are. This is it. Here's what it looks like. Now, this is the same stuff here. But you'll notice now that when we move over anything in here it lights up. And there's a whole list here of all of the molecules in the thesis. So if I want to know where water is here, I can go there and click it, and it lights up there. This one lights up there. That one lights up there and so on. Now, this means that we know at each stage in the thesis exactly what it is. So this is a tool of great -- much greater power than the current tools for a student editing their thesis. So this is not only if you like going to cost zero, it's actually going to do do more. And what we're looking at is ways in which the community can help develop it. Now, that's actually wrong. So what I'm going to show you is how we can edit it. Are we going to the editor? To do that, we have to have what's called a 2D structure. Now, notice here -- you probably can't read it, C11H1804. But remember those three numbers. Right? And we go to 2D representation and here it is. So that's a molecule. I go to edit 2D and now I realize that those should be acids instead. So I click on that one. I'm not a very good clicker sometimes. There we go. And we'll put an acid on that one and we'll go to that. I'll put an a acid on that one. Right? We'll save it. It tells us that some of the names have changed. It's inconsistent, so it's Tony told you all about how names didn't relate to the right things. This is the sort of way that it happens. So this tool is able to tell you where your names probably don't relate to whatever. But now if you look at that, that's gone up to C11H8 -- what is it? It's increased the number of oxygens in that. So what we have here is we have a tool which is open, which anybody can download, which we hope the community is going to build new inversions in without necessarily our permission, which then becomes a way of creating this chemical Semantic Web. Finally, I want to say a little bit about two other projects. I think I started a bit late, so I have two extra. The next one is actually the -- probably the first example of a designed small part of the chemical Semantic Web. It's relatively straightforward to put your stuff out there. It's more difficult for two or three people to share various components of that. And this is the OREChem project, which is sponsored by Microsoft. Lee's program director manager, whatever. And he chose four participants here, Penn State, Southampton, Cambridge, and Indiana University. And the whole the coordinated from Cornell with Karl Lagoze, who has developed a Semantic Web framework for academia which is called OAI-ORE. How many people have heard of OAI-PMH? Well, this is its sort of little sister or whatever, right? It's coming along. It will be very important. It is very important. Right? So the idea here is that Penn State reads stuff from the literature, does some crawling, indexing, sends it down to Southampton who then turn it into partially semantic form. That then comes down here. These little things are RSS feeds. It comes town to Cambridge. And we turn it into chemical markup language and RDF, send it to Indiana, and Indiana then do high throughput computing on this. So one of the outcomes of this will be that we can compute every molecule that comes out on the web. Because there might be million molecules a year, whatever it is. They might take -- some of them might take a week. But what is a million weeks on today's computers? It's almost nothing. You'll find places anywhere to do that sort of thing, and of course some of the companies are very keen to try this out. I will -- the last thing I'll talk about what we're doing here is because it came up in relation to things that Jean-Claude was saying, is the release of data which is not immediate. So Jean-Claude has got these categories which I hadn't heard about before, and let's talk about the category which is delayed. So category either complete or partial delayed. Now, the trouble is if you delay something, it's awful easy actually to do nothing, to say I'm going to do that later. So what our system here does, and you don't need to know the details here, it's an embargo manager called EmMa, and it controls when something is released. So the idea is you publish directly into EmMa at the time that you do it. So it's immediate but hidden. That means that you can take the same philosophy whether you're going to ultimately release it at all or whether you want to release it immediately. This manages the trust between the different components in the system so that some of this, for example, goes into an internal repository or C private atom feed. Atom, by the way is nothing to do with chemical. It's an RSS protocol. And so again, open source. Anybody who needs an embargo manager, and I suspect most of you are going to need embargo managers, that will be available. So I'm just going to come back and review over one or two minutes are the things that we've covered here. I hope that I made the case that if we are doing completely machine collection and analysis of data then it has to be truly open data along the lines of the open knowledge foundation, open data button. Because a machine cannot make the decision about reading lawyers letters and licenses and things like that. So it's got to be completely open. We except pragmatically that there will be many things which had are not completely open. It does allow some principle to come down to granularity where you could even stick this on individual data items of some sort. We've come very clearly to conclusions, particularly Cameron and myself, that we want a very simple approach. And awful lot of time is spent in the open access community trying to work out what open access is. You know, it's actually what you do in many cases rather than huge amounts of discussion about it. And if you come to the end, which is CC-BY, it becomes trivial. And I would strongly suggest all of you if you got open access material, strive however you can just to make that CC-BY. It solves all your problems. You've actually got time to do something else. I'm very excited about the power of web democracy of building tools which will change culture. You've seen what my society can do. You can ask questions of people. Organizations are going to have to be more responsive to public interest and so on. And that will cover all sorts of disciplines. And it means I think that the publishers are going to have to step up and actually answer some of these questions. Publishers cannot hide at the moment. We know one publisher who has taken over two years not to respond to a query. That's really not acceptable when they are charging academia huge amounts of money for the privilege of reading our material. I think text and data mining can be critical but until we solve the legal problem it's going to be problematic. We start with a bit easy. And I would therefore like to thank you the organizers for inviting me. I am very pleased to be associated with the Science Commons and effort in this area. And also a great deal of thanks to Microsoft over last three years who have moved and enormous amount. When we started this project, there was no hint of open source. We have now at the stage where there is enthusiastic pressure from Microsoft to release this as open source, and we're really looking forward to this change in culture with freedom means innovation and involvement. Thank you. [applause]. >>: Is Chem4Word open source at the moment? >> Peter Murray-Rust: Lee actually I think it -- Lee, if you answer that question, it's probably a more accurate answer. >> Lee Dirks: So we'll be releasing the data for Chem4Word. It will be actually released -- we've been working on it kind of code developing it over the course of the last year and a half. It will be announced that we'll release the beta next month probably at ACS. The beta will be made available open source. We haven't finalized the licensing terms but it's probably going to be an Apache license. And then once we're -- at that point, we're going to stand it over to Cambridge, and Cambridge is going to take that project on and move it forward. And it won't be -- we'll probably have a seat at the table, but they will be the, what, benevolent dictator of the -- of that project moving forward. >> Peter Murray-Rust: And Lee is the 800 pound gorilla [inaudible]. >> Lee Dirks: Thank you, Peter. [laughter]. [applause]. >>: What is CC-BY? >> Peter Murray-Rust: Right. Okay. This is a good time to go out on the web. So this is one of the great change advantages of the web. So we'll go out and see if we can find CC-BY. I do have a slide that is actually quicker just to do this. All right. Creative Commons Attribution 2.0 Generic. It says you are free to share it, copy, distribute and transmit it, and you're free to adapt it. And you must acknowledge the work in the manner specified by the author or licensor but not in any way that suggests that they endorse you or that. So, in other words, it is you must attribute the source and that's it. Right? It's wonderfully powerful. So, yeah? You -- sorry. Was there another question. >>: It's pointing to you first. >>: Will there be a chem 4 open office? >> Peter Murray-Rust: Good question. We have two technical problems here. One is the question of language, right? So Chem4Word is written in C#, which is an open language but is largely used on Microsoft platforms, right? Whether that would work in an open office environment, I don't know. Secondly, there are certain things which are called in open office -- in Chem4Word which are probably not present in open office. Right? There is no reason why it couldn't be reverse engineered for open office. We are not funded to do that. I think you know personally I think it would be a good idea to do that if there is a need for it. >>: You are talking about you're the embargo. What sort of time periods have you all been talking about? >> Peter Murray-Rust: Well, normally it's actually until time of publication. So normally what a chemist wants to do -- now, of course we do have some problems here, but let me just go back to the picture of it. Normally what a chemist wants to do is to keep their data safe, know that at some stage they're going to do something with it. They don't necessarily know what they want to do. They certainly don't want it released before publication because some publication will actually then regard the publication as invalid and probably reject their next paper or whatever. It's actually quite a difficult thing to know when a paper has been published. You know, a young researcher may very well read every odd issue of whatever, scan the web pages, but there's nothing that triggers this has been published, right? There isn't a place, you know, where triggers come back to the author. So what I think EmMa can probably do, although there's no immediate mechanism for this, is to respond to that trigger when something's being published. So this would also work for things like theses, so clearly when the thesis has been accepted and put in the library, then that's the time at which data might be released. So sometimes it might be, you know, six months, but it's more likely to be until such an event has occurred. >>: So are you talking about just the experiments [inaudible] the paper or the associated experiments as well? >> Peter Murray-Rust: It's whatever we can persuade people to do. I mean, we have an approach in Cambridge where we have invested in a commercial electronic lab notebook but we are managing independently our own NMR and x-ray data and so on. Now, it's my hope that all of the x-ray and NMR data goes into this anyway, right? There's a question as to who -- some problems you have who owns it, how do you identify people? It's not actually trivial to identify people authorities, groups, and things like that. So all of this is bringing out the -- you know, the problems in the human metadata. The scientific metadata is straightforward, it's the human metadata that's [inaudible]. >> Lee Dirks: Are people ready for lunch? [laughter]. >> Peter Murray-Rust: I'm sure they are. >> Lee Dirks: All right. Thank you very much, Peter. [applause]

>> Lee Dirks: Thank you very much for coming... and try and stick on schedule. And I'd like...

Related documents

Products

Support

&gt;&gt; Lee Dirks: Thank you very much for coming... and try and stick on schedule. And I'd like...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Lee Dirks: Thank you very much for coming... and try and stick on schedule. And I'd like...