>> Lucy Vanderwende: Okay, so today it's my great pleasure to introduce Anita de Waard from Elsevier. I know that you have a new title but your previous title was disruptive technology which appealed to me very much. >> Anita de Waard: Yes. >> Lucy Vanderwende: It's just intriguing to me to be able to work on disruptive technologies at Elsevier. I got to know Anita while we were working on defining slowly a shared task for summarization of scientific literature. That's, I come from the summarization world and Anita comes from a very deep understanding of scientific literature and really brings so much expertise to this task that we're defining together. Anita's background is varied. First worked in experimental physics and that's all I can say on the topic, and I won't try to say anything more. But you worked as a publisher of physics and other literature since 1988 which gives you a lot of background on this topic, and today you're going to be talking about supporting scientific sense-making, and I'm very excited to introduce Anita. Thank you. >> Anita de Waard: Thank you. Wow, beautiful introduction, thank you Lucy. And thank you so much for having me over. I think this has been in the works for a while and I'm really really happy to know be here with David. Okay. So do I need to wait at all for any online -- >>: Nope. >> Anita de Waard: -- people or just venture onwards? All right. >>: Would you like questions throughout or do you want? >> Anita de Waard: That's fine with me. Please do. Because I'm touching on a bunch of different topics and other wise people might lose their thread -- the thought thread. So I really want to talk about three main topics. But hopefully it's all one coherent story, at least in my head it is so if that doesn't come through please ask me how are these topics related. I'd like to tell you a little bit about a model of scientific sense-making that I've developed over the years which I call stories that persuade with data. So essentially what scientists do when they write papers is they -- they write stories which are intended to persuade and they do so using data. So I'll explain the different components of that. I'll talk a little bit about the work that aye, the University of Utrecht linguistics group that I'm working with there to analyze the the -- the minutiae of scientific discourse. And in particular looking at verb tense. A lot of this work to me goes into formulating a way to to view collections of scientific reports. Scientific documents scientific papers. And kind of look over them as what I'm calling claim evidence networks and it's called that in the literature although it's also called different things. But essentially finding ways to make sense of large collections of scientific papers as a representation in claim evidence networks. And I'll talk a bit about an essential component to that which which touches on the project that Lucy was talking about. The -- the topic of hedging, of wrapping your claims into a certain linguistic representation of uncertainty. And I'll briefly touch on some projects that we've been involved in to create claim evidence networks. And lastly, I'll -- I'll talk for a bit about data. So this is how most scientific arguments are validated, how they're motivated is with data. I'll say a little bit about the little understanding that I have about biology, which is that it's really, really, really complicated and just mention a couple of ways in which I've understood it to be. So I'm not a biologist but I'll tell you a little bit about that and some thoughts on collecting by logical experiments into collaboratories. So these are sort of three headers, but like I said hopefully they interconnect and I'll start with the first one. So if you look at scientific paper and you start reading it and actually a large part of my motivation to look into scientific papers was that I was working with somebody called Kierz [phonetic] a long time ago and he said well, you know, scientific papers are essentially just stories. They're fairy tales. Once upon a time there was an experiment, there was a research question, and the research question needed answering, and so the research question knocked upon a door. And I got a grant in 2006 from the from the Dutch NFWO [phonetic], the foundation for doing scientific research, to really investigate the proposition of what is the structure of scientific papers. And I started reading a lot about narrative theory and particularly the concept of the story grammar, which came out of AI in the 70s. I don't know how many of people are you, maybe Lucy knows about it but you know Wimmelhart [phonetic], Thorndike [phonetic], these kinds of folks. And then the work very influential by Trupp [phonetic] in Russia who have models of fairy tails and essentially they all come up with this very same setting for a story of a fairy tale. And – and this is only the beginning of it, it goes on there's more on the on the page there, but generally, you start with some sort of setting, then you have a theme, then you have a series of episodes and in stories or fairy tales these are generally between two and six episodes. Then there's some resolution at the end. So from -- anybody not know Goldilocks? So Goldilocks, girl walks into an empty house, the bears have just left she tastes the porridge, she eats the little bear's porridge, she tries the chairs, she breaks the little bear's chair. She goes up stairs and ends up falling asleep in the little bear's bed. And then the bears come home, and they're like ah, who ate my porridge? Ah, who broke my chair ah who's sleeping in my bed? And she runs out. So it's typical fairy tale. And I think what's really an interesting exercise if you think about the fairy tale and you think about knowledge representation what would it look like if you were to represent Goldilocks in a database? So take a moment and imagine the database schema, right? It's not self-evident. I would -unless somebody immediately can picture that. But it's not self-evident because there is a point to the sequence. There is a point to the fact that you see this empty house this little innocent girl. Then you have three episodes that occur, the sitting in the -- the eating, the sitting, the sleeping, it kind of all piles on top of each other, and at the end when the bears come, it's dramatic, you know, if she just came in and ate the porridge and the bears came home, it wouldn't be such a big deal, but it's the porridge and the chairs and the bed that makes it kind -- it kind of adds to the suspense and it makes -- it's very hard to say what the story is really about, but there is something exciting to it that humans at every level grasp. And so a paper is a lot like that. Right? So a research paper, you generally have a background. The protagonist, I believe, is either the object of study or the research question. So you can -- you can actually frame and build a story frame around research paper what you say it's really the research question looking for an answer. And the research question then is tests itself, like a protagonist in a fairy tale in scientific papers there are generally between biology, I'm talking now specifically I can talk a bit more about the difference between biology and other subjects, but I've focused only on biology. They're generally between two and six experiments. And so these are all places where the research question does something with the world is somehow tested. It it like the research question tries all the different porridges and finds which one it likes right? So you have a similar buildup, and you have these little mini-stories that are the little episodes that describe every experiment. And as in a paper, as in a fairy tail, there is a story that's woven throughout. There's the protagonist, the research question starts in one state and ends in another. And the answer at the end is because the protagonist has actually been changed by it interactions. Inside these experimental setup. And a good convincing paper says not only did we find -- you know whatever over expressed dAtx files but we also found this and we also found that. And it kind of whips you around the head with all these different little stories that happen that all add up in the end. So -- so I think this is sort of an interesting thing to look at. Another model that I think is fascinating to look at is the model of -- of Aristotelian rhetoric or Quintilian's rhetoric. Quintilian was an orator in Rome, and he -- he has a beautiful, very clear description of what are -- what are the parts of the speech? And if you look at the different parts of the rhetoric as as both Aristotle and Quintilian, put them forth again you really recognize both in scientific talks but also in scientific papers. How there are different parts, for instance, the appeal to different aspects of the audience's brain, right? So you have an appeal to ethos in the beginning, you want to establish credibility. Typically in the beginning of a paper, you will just make it clear that you know what you're talking about by dropping a lot of reference, by using the right type of jargon, you will position yourself, but in doing so also show that you know, you know what you're talk about. Then there is the appeal to logos, who you really make your logical case in the middle, and in the end you have an appeal to pathos, so you have an appeal about why this truly matters and try to -- try to appeal to the emotions also of your reader. So I think a lot is, there are a lot of similarities, and at one point I thought it's really funny, it's kind of like with viruses. The goal of the paper is to be published, and it essentially uses the author and the journal as the host. And essentially the format of the paper has co-evolved and it's in a predator-prey relationship with its reviewers, right? So if there are areas where there's like heavy hunting, and there's a lot of pressure on the -- on the prey there are very very tight guidelines -- hi -- for what a paper looks like. If it -- if it's a very sparsely populated area, it's not just a such a busily occupied niche like in paleontology, papers have very very different formats, but if you're in like the high throughput area like bioinformatics or microbiology, you have a very clear way in which the paper must be. And like nature, you know, those are like the -- only the speediest gazelles get through all the hoops that they need to jump through to get into nature. So the format is much stricter in these kind of very aggressive fields, it's kind of interesting. And I think essentially these, these -- the formats of these stories and the formats of these narratives I think are there's something in -- innately human about them. We have always been telling stories in this format, and I'll get back to one -- one component of that a little later on. So, data, I'll talk more about data later on, but I just think it's fascinating that in biology generally, I started for a while to try to make argumentative representations of biological text, but it always ended up that in the end what was the way to convince somebody? It was figure 2A. Right? It's always figure 2A. So there's some statement and the statement like here dAtx one contains the AXH domain but not a poly Q-track [phonetic], that's the main claim. What's your evidence? Figure 1A. Right? Or as figures 2A and 2C show, again, claim. This is necessary. This shows that this is expressed figures 3A, right? So there's -essentially you jump to a non-textual representation. The argumentation turns out to be very simple. It's always figure 2A or data not shown, which is also a great one. It's really great. There are many many claims in biology that are motivated by data not shown if you look through the literature. But the point is I suppose the idea is if, if I could show you what it actually looks like. And so you're jumping to a representation of the knowledge that's outside the text itself. I guess that's really the point I wanted to make, and that's really really key in biological argumentation. So, I'll talk a little bit about my more kind of fine-grained works, so this is the -- I was interested in this story grammar but then I thought yeah, but there's so much happening inside a sentence and coming from physics, and not knowing anything about linguistics, I was amazed at what happens inside even a single sentence. I mean, if you want to -- I should have said this beforehand, but my goal was to represent the scientific papers in some way that would be more easily digestible by computers, that would be more easily summarized that could be more easily concatenated, exchanged, etc. And I foolishly thought, well, you know you just get the nouns and the verbs like many people do. And then I started to realize that it's not that simple. And in fact, as I went on, the units, my -- my own units of of study, the units of discourse I was looking at became smaller and smaller and ended up at the clause. So they didn't become as small as a word, or a phrase, but ended up as a clause. And here's my little tiny in defense of the clause. I think when you look at -- oh let me go back for a second. When you look, these are a couple of sentences from a paper that I've analyzed from Voorhoeve in 2006, and if you look, these are four sentences, from that text. These -- contiguous sentence. But if you look you can see that there's more than one thought unit per sentence. You see that the verb tense changes and I'll get to that in more detail in a moment. And there are all kinds of things happening. There are attributions, there are actions, there are prepositions, they're all within the sentence. But if you look at a clause level, I think it already starts to make a lot more sense. So this is from, I don't even know what linguistics textbook I got this from, but generally the theme, main idea which is at the -- you can probably name the author, probably Halliday [phonetic] and that whole area, right? F.G. probably, I think? >>: Yeah, it's Halliday -- >> Anita de Waard: Right. Right. Right I should have put the reference there, sorry. But for those of you I can dig -- or Lucy could probably tell you more about this. But anyway -- who said that essentially in a sentence the head is a premise the motivation and the attribution, and in scientific writing it's often a matrix clause. So this is a clause from these results show that, and then you get the rest of the sentence. Or, you know, Vanderwende et al. have demonstrated that, and then you get whatever follows the that, right. So you see that here typically, you have the premise or the motivation like in sentence two to shed more light on this aspect, so this is a goal clause. Why did we do this thing? And then at the end of the sentence, you have the interpretation and the elaboration and the attribution. One of the reasons that references come at the end of the sentence that's often where we give references, said Lucy. You know, that kind of thing. And in the middle, what -- what you generally find you have your main biological statement, your main assertion, happens in the middle of this. And I -- I -- these are just four examples, but you do find it a lot. What I've done specifically is to look at these clauses and assign specific clause types to them. I made a little taxonomy of different clause types and look at how are these clauses linguistically similar. And I'll show you a little bit about verb tense in a moment, but just to show you, so you have these these two infinitive clauses which are goal clauses that you see a lot. Generally they're at the beginning of the sentence. They're often in the beginning of the paper. They indicate the -- the motivation behind the experiment. There are typical regulatory clauses, and again, I'll get back to those in a moment because they're incredibly important for sense-making aspects. Our results indicate that or figure 4A shows that so these are -- have a lot to do with the attribution how do I know what I'm about to tell you. And then there are, I believe there's a specific -- oh, right then there are methods of course, which is what we did, this is where you see the first person pronouns always in the plural in biology. You can never say I, but it's always we, but it's what actually took place in the lab. And they're -- they're generally in the past tense or in a gerund form like you see in sentence four. And then I'm differentiating between results and implication, and for those of you who are aware of Gully Burns' [phonetic] work, anybody, so he's K-Feb [phonetic] model, so I think he makes a good distinction between observational assertions and interpretational assertion. So the observational assertions are what you directly see, and the interpretation assertions are what you think that means, and I think you can find those quite clearly. So, what I essentially did was look at, look at sentences these are these are different sentences from the same paper. Looked at parsed them into clauses and looked at the verb tense, and again these are contiguous sentences from the Voorhoeve paper and gave them this clause type which I came up with this little taxonomy, and essentially what -- what you can see is -- or the way that I see it is there are two types of knowledge that are conveyed, and you're constantly in the text going back and forth between these two types of knowledge. One part is the conceptual knowledge. So these are statements about how the universe is perceived to be. Like seminomas and EC component of nonseminomas share features with ES cells. These are things that are presumed to be true and presumed to be known they don't need any attribution, and -- and -- and like, at the bottom, miR-371 expression is a selective tumorigenesis. If you ignore the rest of the sentence that's again a statement about the conceptual realm. Right. And then there is the other type of knowledge which is the experimental knowledge. So this is what we did. We tested. Was undetectable, things that happened or were directly perceived by the authors. And then there are specific -- oops. There are specific sentences to go from one realm to the oh. So if we we take the second sentence but we turn the picture on its head, and I'll go through this quickly, essentially you can see that the -- the conceptual sentences. They all what they -- oh right, how these two realms differ is the conceptual sentences are all in the present tense and the experimental sentences are all in the past tense. And that is, I believe, cognitively also how we keep them apart. And we did a small user experiment where we gave people result sentences and fact sentences, and we changed the tense, and they would interpret a result as a fact and a fact as a result, if they changed the tenses. Quite obvious. So the tense indicates whether this was something that was experienced directly by the researcher, so it was an observational assertion. It was something they either did or perceived or it's something that pertains to this conceptual representation of the field. And then there are specific verb tenses, like the two infinitive which essentially gets you from the conceptual generally gets you from the conceptual to the experimental. So this is the case. To investigate this matter, we did this. Past tense, right, and then you can go back and then generally you have a matrix clause which can be with a gerund or -- or a present tense. These results suggest that. And then you go back up into the conceptual realm. And essentially, these little cycles happen over and over and over again inside biological texts. They also happen at a macro scale in the introduction's generally more conceptual the methods and results obvious are more experimental, and the discussion is more conceptual. So then at a macro level, but they also really happen. Here we have, I believe -- yeah, three sentences. And within the span of those sentences you can even have one sentence in which this occurs, so these little cycles sort of taking your conceptual statement, dipping it into experimental experience and going back up to make your conceptual claim happen over and over again in biological text. And what's kind of funny. I -- I talked to this with, with Ed Hoeve [phonetic] in fact and I said, you know, is there any way we can figure out whether this is, you know, deeply innately true? Is -- is there something that says that conceptual things are always in the present and experiential things are always in the past? And he said if you wanted to make a real statement about that, you'd have to find some other type of corpus that would also have different tense uses and then maybe you could say that yes, this is how people think or something. So, I've only done a very very small dipping my toe into this, but in fact if you look in mythology, it's very similar. So you have facts in the internal present, they're -- it's called the gnomic present in fact in ancient, in classical Greek, there's a specific present tense that has to do with the the qualities of the Gods. So, so Hera, queen of the immortals is she, she is sister of loud thundering Zeus. This is a statement about how concepts are arranged. Concepts in these cases are of course the Gods and their properties, similar to small RNA's regulate gene expression by mechanisms conserved across metazoans. Everybody knows this, right? This is our representation of truth as we live with. Then you have events that you've experienced and they are in the simple past. Vehicle-treated animal spent equivalent time investigating juvenile, or from a disuse now the wooers turned to dance and to gladsome song and made merry. These are experiences that happen on earth as it were. This is direct human experience. This is in our in our -- our event past. You can have embedded, you can have events with embedded facts. So we generated cells expressing the -- maybe I should point to this, I don't know if that's necessary, but. Let me see how that works. We generated BP [phonetic] cells expressing the gene which is only active when tamoxifen is added. So you have a little clause that is an embedded clause that refers to the realities that we all share. And similarly, here's something about Hera, I think, she took her mighty spear where with she vanquishes the ranks of men, men of warriors. So, again you have, you have this knowledge about Hera, and about Hera's characteristics embedded within an event, just like you have the knowledge about the chimera gene which is embedded within the event, the -- the actual experimental event that you had. Similarly, attribution it's kind of interesting is in the present perfect. If you look at the verb tense for attribute sentences like has been shown, that is typically in the present perfect, and interestingly enough this is Snorri Sturluson, the Norwegian sagas, I have had old stories written down as I have heard them told. And then implications generally again are in the present tense, so it is said that whenever the camel sees a place where ashes have been scattered, he wants to get revenge. This is from a Mongolian mythology. And similarly here, these results indicate that, this confers complete protection. So you see also in the verb tense that after telling of the stories that happen sort of on the ground we know more about what rules govern, you know, life on earth, so to speak. So I won't go so far as to say that concepts are the mythology of scientists. You could offend lot of scientists that way, but I think cognitively, we have similar systems where we talk about conceptual realms that we presume to be true, and we're trying to make sense of how those laws that govern our lives work by interpreting the experiences we have on earth and you see that, you see that in tense use quite, quite clearly. All right. Anyway, how do we how does all this refer to science publishing? So I think one small lesson for those of you that are text minors, please please please do not throw away tense. Do not ignore tense. And do -- and if at all possible to look at the clause level, I think it is very important, do so. And I guess that's my take home from this part. Tense matters a great deal. So one thing that's that's very interesting is if you look across collections of documents so now you have this claim that has been made within a document after you've -- you've taken your factual, your -- your knowledge of facts and your questions and then you described your experiments, now you have a claim. Some sort of the claim. Like Voorhoeve et al., this paper I was just showing you they have a claim about microRNAs 373 and 372, and they say they neutralized the P53-mediated CDK inhibition, whatever that means, but they don't know exactly what the process is. But they're saying possibly this happens through direct inhibition of LATS2, some entity. And possibly is a hedge. So, they cannot say we have shown unequivocally because -- and there are different reasons for that, I'll get to -- I'll get to the dynamics of hedging in a moment. This is just to explain what a hedge is, a typical hedge. What you see if you look through the literature when this paper gets cited, Kloosterman and Plasterk, for those of you who know Holland, Plasterk is again our minister of education, I believe, so it's quite interesting. It's actually one of his papers that he wrote on the topic before he was the minister of science -minister of science in education. So he refers to the Voorhoeve paper and he copies this possibly, right? So this is in the same year in 2006. Then there are a number of other papers. But in 2011, for instance, you see now this has become a factual statement. Two on oncogenic microRNAs directly inhibit the expression of LATS2, period. There's no possibly there anymore. And I happen to have looked into this case, and there is no other paper investigating this relationship with, between these microRNAs and LATS2, and in fact I think we, we did a little study on this. So, simply by being cited and hence the quote from Bruno Latour on from laboratory life, you can transform fact into fiction by adding or subtracting references. If you, if you -- if a claim gets cited enough, the claim is hedged at the beginning, but as it gets cited, this hedging erodes and simply by being cited there for the claim turns into a fact, and Bruno has this other beautiful, beautiful quote which is a fact is a claim that has been agreed upon by a committee. And so many many scientific facts simply are based on being cited and not on other experiments validating a certain statement, all right. So that's just kind of the point. Just a little excursion here about hedging and I'll zip over this I have a whole bunch of references for those of you who are interested with this, why do authors hedge? This is from -- from the area of genre studies authors hedge to make a claim that pends acceptance in the community. Creating a research space, this is Swales, essentially you need to show that there is a problem. And insert yourself into the discourse. And you need hedging to do this. You can't just go and say everybody else is crazy. You need some you need some hedges you need, there's politeness there's but there's also kind of a knock on the door saying perhaps there might be an issue here. And then Meijer [phonetic] says it's the strongest in careful researcher can make you cannot actually really know that you're -- you're conceptual interpretations of experimental findings are true. You simply do not know that this so you need to hedge for this. There's a lot of work in computational linguistics, and you might know a lot about this, but about finding hedging cues and speculative language and modality and negation and number of workshops. I think Light et al. is the first reference that's generally mentioned. I'm finding speculative language, Wilbur have five forms of hedging that they identify and then Thompson and Nanyadu [phonetic] also looked at levels of speculations and types of source of the evidence in level of certainty. And again it was and Hovy who pointed me towards the similarity to sentiment detection, right? So in, in product reviews and such there's often a lot of sentiment detection. There's the holder of the opinion, the strength and the polarity, and what, what I really got from Ed and it was very insightful to me was this concept of the hedge as a mathematical function that's acting on the key assertion. So essentially there is a claim. But it it's wrapped around with the hedge. And so we look -well I looked a little bit on how could you actually model this? And so this is more or less borrowed from sentiment detection but using some of the taxonomies in the other papers. If you have some sort of proposition, some sort of statement in biology, it's epistemically marked as an evaluation where you have three essentially three dimensions of -- of the the epistemic modality, so the truth value of it. One is the value which can be assumed true possible probable or unknown. This is taken largely from Anyaru's [phonetic] work, they have these this four part tied scale I suppose, then you could also have the negative ones but the interesting thing is in biology it is largely unstated if something is not true. So you find very few incidences of this. Then there's the basis, you can say something is based on reasoning, or it's based on data, or there should be a zero category with that as well as, it's just not indicated, and then you can say the source, you can say the speaker is the author it can be explicit or implicit, where he found that. Or we believe that, that kind of thing. Or it can be implicit, these results show. There can be an explicit or implicit and author and again there should be a zero category added to that. Oh, clicker. And actually, with Jodie Schneider from dear re, Jodie made a very nice little al ontology of this. And the purpose which is exactly what I showed you but then, you know. Let me go ahead for one, nope sorry, sorry. I wanted to show you this. We'll get to the others in a moment. The purpose of doing that is what you can now do if you have this ontology if you can take statements that are biological statements and this is with a Bell markup for those of you that know Bell with Selventa [phonetic] it's a project out of -- out of Boston and they have a formal representation of biological statements, let me see, here. So this is the Bell representation of this statement. But what Bell and other formal representations of biological knowledge never model is what is the modality of this. Right and here it said that this happens and this is possibly how it is being done. Right? So, so the proposal of the ontology is that you could wrap these types of statements which are generally what the output of a text mining program is, with these types of epistemic evaluations. And I'll have -- I have another example, this is using a tool called Medscan, that's used by a company called Ariadne which is now part of Elsevier as well. Similarly, they have a representation of formal representation of the biology in this statement. And you can -you can throw the, so it says we present evidence that. So, a value that is probable if they're presenting evidence, they say it's probable. The source is the author who is explicitly mentioned and the basis is data, because they say that they have evidence. So you can imagine layering biological statements with this epistemic modality shell, so to speak. And this would give you a lot more -- a lot better representation of -- of biological texts than simply showing the triples is my argument. Just to get back to where we were -- right. So if you look, and this was a manual corpus study that I did, if you look at how does this hedging occur, by far the most prevalent form is a clause of the form that is these results suggest that. And this is from one paper, I think there were 42 incidences where the hedging happened in this form and this is all the content in those instances. So you have a matrix clause, you have a clause of this form and, like these are all the values there are no more in a particular paper. So it's, it's very very limited, so these results suggest that, and instead of suggest, there are different verbs that are used there are specific verbs for indicating a lack of knowledge. And so this is based on a manual corpus analysis that I did, by now I've done ten full text papers about three thousand clauses and have gone through them and far the most prevalent, you know, you get -- 75 percent pretty much. If you look simply at these clauses, these form, there are certain verbs used specifically for hypotheses, for probability, and for presumed true, and and show and demonstrate are -- are very strong among those. With [indiscernible] of Xerox, I did a small project to look at claimed knowledge updates and this was a very kind of small tentative proof of concept. She looked at using their Xerox parser that she looked at detecting these claim knowledge updates where the authors present the CKU as factual, so the strength is certain certainty, it's derived from experimental work the basis is data and the ownership is explicitly attributed to the author of the articles. So it's an explicit article. So if you do this, you can pull out the -- the statements, the propositions inside the paper that the authors attribute to their data which gives you I think a good summary of what are the authors saying, and surely a lot better than just finding all the triples in the paper. All right. So the ideas that this will all lead to the creation of -- of networks of claims in evidences. And I want to quickly introduce -- explain to you two examples of claim evidence networks a that we've been working on. This is a project called Data2Semantics that was funded by the Dutch government and we're currently working on it with the free university in Holland which was the to improve the speed of integration of medical research into medical practice. And there what we're looking at is to go from patient records to clinical guidelines and then from guidelines into how is this guideline motivated. So here -- here we're going to speak the other way around as it were, so this is really -- this is really the claim and here it has become fact. But the point is that that the facts that are often used downstream in -- in a medical process in patient records and, and treatments decision trees and such. There's a great time lag between these too, and so, it can take up to you know ten years or seven years, or terrible time periods during which patients could have been treated. Because the knowledge was there, it just hasn't gone through the system quickly enough. So this project is to link the patient data and diagnosis to the guideline recommendation and then to base the guideline recommendation into evidence, but then also to do the other way around. So to trying to find such claims using some of these linguistic analysis tools to see if we can pre-populate guidelines and then connect them to the patient records more quickly. There's a another project with University of Pittsburgh where we look at drug-drug interactions, and define a model of drug-drug interactions. This is with underlying conceptual model that we're trying to detect and then automate the process and store it as linked data, where again, at the step where you're going from here to there, if you can distinguish between known drug-drug interactions and newly created, newly argued drug-drug interactions, you can really build and prove systems. Okay. Now this is actually our work. Lucy and I didn't want to go into it too long because I wasn't sure to what extent people were aware of its, but I'll briefly discuss this project. >> Lucy Vanderwende: Please do, because -- >> Anita de Waard: Okay. Well we can talk about this more later on -- okay. It's -- it's I just had this one little slide but. So together with Lucy, and Kevin Cohen at Colorado and I believe we're now in stage where we're -- >> Lucy Vanderwende: Getting close. >> Anita de Waard: Getting close [laughter]. >> Lucy Vanderwende: Funding for the data, yeah. >> Anita de Waard: Yes. Yes yes yes. The goal is to make I -- I'll preface this by something that's not on the slide. So if you look if you go back to the -- to the fact becoming -- coming from the claims through the documents. Conversely you can say that a citation of a document presents a certain type of summary of the cited document. And these are -- this is well-known in the conversational linguistics community there's a lot of work done on that by [indiscernible] and others, and these are called citances. So the citing sentence of a paper says something about the cited paper, and you can consider the collection of sentences that cite a specific paper as a a possible summary of that paper. And the project that Lucy and I are engaged in is to do -- to build a corpus and do a challenge that has to do with summarization where you're looking at the citing text. So here -- here are two citations, the the top ones at there, by other people of this Voorhoeve paper, and we're -- we're trying to find what part of the original text is being cited. And specifically what aspect of the original text is being cited. So here you have for instance, Voorhoeve employed a novel strategy by combining miRNA vector library etc. Or Agami and coworkers performed a cell based screen. These are pertaining to the method of the paper, and so you can try to look for the most relevant sentence in this case, or -- or section of the paper. Similarly other citations have to do with the results. Voorhoeve et al. identified these microRNAs, it was described as an oncogene. They were found to permit proliferation etc. So these have to do with the experimental results and this is representation of that. Or you can look at the interpretation. This is the clause I was talking about before. Through direct inhibition of LATS2, might reduce the selective pressure for P-53 in activation, so this has to do with the implication. And our premise is that you can build enhanced summaries and enriched summaries of scientific publications by identifying these citations and then grouping them according to the facet as we're calling them the facet of the text the method the interpretation the results. These types of things. And that's what the -- what the project with Lucy's about, so if you want to hear more about to that happy to talk more about it. All right. So lastly, now let's get to data. I still have fifteen minutes right, it's okay? Okay good, good. Great. So all of this in the end -- and when you speak to scientists, they say yeah, well, you're talking about linguistics, but really what we have nothing to do with language. It's very interesting, and if you ever read, I highly recommend Charles Bazerman shaping written knowledge. It -- it's a beautiful book and he -- one of the things that he says at the beginning is, he said, I started all these interviews with scientists, and I said you as a writer, and he said I'm not a writer, I just observe, you know. And he said but you've written ten books, and two hundred papers. Yeah but I'm not a writer. I just, I just put on paper what I see. So the self-perception of scientist is that they're not committing rhetoric and that they're not persuading anybody. They're just say what they observed, right. So this is very interesting. And they say, and then you show this to biologists they kind of chuckle and say yeah ha ha but really I only look at the figures. Actually, if you do an experiment a user experiment and you take away all the text and you only show them the figures, there's no way they know what the paper is about, right? In fact they parse, it's, it's pretty clear that -- that and there's been psychology research on this -- they simply parse the structure of the structure so quickly that they don't become aware of the fact that they're reading or looking through it and it allows them to focus on the parts that are new to them. They have a schema in their mind of what the paper will look like. They compare it with the known schema and they only pick up the differentials and that's why scientists can process, any scientists, you give them a stack of papers and they'll sift through them in, you know, in no time flat and pick out only the ones that are interesting or novel to them. Because the schema is so embedded and that's how we how we think anyway. Anyway, that's a little digression. So biologists will all say, well, none of that matters. All this talk is totally irrelevant. What I care about is the data. And so the current project that I'm working on with David is about data sharing in biology and why that is so important. I only started to realize when I read a little bit about biology -- and I have a -- a longer talk about this but I'll zip through it right now because I'm also not really the one to talk about this. But essentially biology -- what biologists are finding all over the place is a biology's really really really really complicated. It's not like physics. In physics you can take something apart study the parts and put it back together and you have an understanding of the whole. But biology doesn't work that way right. If you take a mouse apart you can't put it back together. It's just not working. So for one thing there's an inter-species variability. Within the species, one human is drastically different from another human which we're finding out now of course with genetic and personalized medicine. My -- the drug that works for me will not work for you and how it all works is very different. The gene expression, so, so what genes do and what they make is very different. They found this huge level for instance if rats were just fed or not, the gene expression would vary a factor of ten essentially you know annihilating the results of scores of research. So any circumstance. They never use female rats because they have menstrual cycles, which as every mammal knows, can seriously influence your behavior, and everything else. So, so there there's a lot going on there. The whole microbiome theory which you probably all know about, you know, there is ten times more RNA in your body from nonhuman sources than there is from human sources. The -the, every part of your body is -- is a little ecosystem, is living together with a whole bunch of germs, essentially, that work for us and we work for them and it's all very symbiotic but you're not looking at one entity when you're looking at any creature, and systems biology, the whole is more of the sum of its parts. If you take a cell out of its context and you put it back, it's total different thing, right? So, and I think the other thing is -- is if you look at models and you're looking at experiments, there -- in every area of biology there are huge doubts whether, if you're modeling a system are you really modeling the thing that you're measuring. So if you put all this together, you know any representation that you make of any of these systems is going to come up short. So, and -- oh right, then there's dynamics. Right, life is not in equilibrium. Life is constantly, thank you, if you let everything degrade if you let everything happen, it would all be over very quickly. There are constantly these systems that change, and the individuals evolve and species evolve and you as an entity keep changing all the time. So any snapshot you take is incorrect anyway. It's -- in short, life is really complicated, and this is a picture that I think it was, Descartes used? It's a -- there was a picture of somebody tried to make a robot of a duck. And essentially you put food in the other end and it produced poop on the other, put food in one end and produce poop on the other end. It was a hoax, of course. And we still can't make a model of a duck right we can't even have something that eats and poops, let alone have it run on that. We really don't know how all the components work and if we know a little part of the component not going to work with the other ones. So what to me the understanding finally that I had about biology was really reductionist science doesn't work for living systems. You can't take it apart into pieces and expect anything you know about the whole. So one way that this is being addressed in biology is using statistics. And there are some projects like the human microbiome project consortium, where they simply get a whole budge of measurements. They get a lot of money here they get 242 healthy adults sample at 15 to 80 body sites up to three times, you don't want to think about how invasive that is, and then you have some data, right? And then again the large sample size. So this is another study where people just take a whole bunch of people and maybe if you have enough data then you can see something remotely sensible. But the -- so what you can think about is to help this to help bring biology forward and to help improve the way in which reasoning occurs in science is to enable what I call incidental collaboratories. So everybody knows the concept of a collaboratory, that's pretty familiar? So in, in many areas of big science that has happened, that once place, for instance, collects the data, the other analyzes the data. A third, you know works on the sample and changes the experiment and sends it back to lab one etc., or atmospheric, for instance, atmospheric observatories balloons over Greenland that you can access online. A lot of citizen science. There's data collected in one place and analyzed in completely different place and then somebody else altogether says something about the analysis. So you distribute the process of doing science. To do that, you want to store data at the level of the experiment because you want to add you want to combine experiments. You want to connect it, you want a allow analyses over similar certain experimental types and experiments done on similar things. It doesn't have to be in biology of course, but you want to have, have something that you can actually compare and put all these measurements together with. And you want to keep it for a long time because in fact old -- old records are very important in a lot of this. If you look for instance in earth science it's essential that you have old records and keep them, but also for biology. It's possible that evolutionary changes or other changes are really going to affect this. So you want to -- and plus, getting more data. It gets you more data if you just store everything. So you need to have -- also, there's the policy end of this. Data management plans are becomes more and more important. So you need to have systems where you have gated access to where the researcher can say this is my data but I'm sharing it with these people and I get to decide who gets to see it when. Now if you look at a typical lab, essentially what happens is they have a fridge that has unlabeled tubes in it and they have a place whether they -- this is, sorry. That's a X mouse, and -- and a little X mouse's brain and there are neurons that are being taken from this mouse's brain and put into a system. It's messy, right, there's like blood and guts, like literally in these biology labs. And then how do they record their -- their -- this is, I took this picture in December of last year. This is about less than a month old I believe. So you see here a printed-out page with some drawings next to it, mostly it's just written in the book. So every researcher -- and this is a very good lab. This is not a, this is in America. Very highly funded good lab. It's stored on paper. The metadata is simply stored on paper. And often you'll see people get rid of all their iPhones and pads and laptops, get rid of all that, then they take their paper book and they start doing science. And that's essentially what really happens now. And then there is a PI who wants to know about what all happened today and the only thing that he can access actually are the direct measurements, so he doesn't know any of the metadata, any of the surroundings and can't do any analysis over that. And then you say, well, wouldn't it better if you would just store it electronically and maybe then you could access other people's data as well, right, and advance science more quickly because you wouldn't have to rely on your own research and you wouldn't have to go into people's paper notebooks. So they say, well, our lab notebooks are all on paper. So I think -- and something that we're working on as a direction to go to is all of these people have smart phones. They all have tablets. If you just have an app by which just like when they're in the grocery store and they can scan the bar code to check whether it's the right brand of bread that they have to go home -- because you can do that, you can just upload you know, is this -- is this what we usually get, you know, there are tools for that. There's not a lot for that right now in the lab. So having having just data input apps that allow -- that allow you to work in the lab I think is a good idea. Then they say well, I'm have really really busy. I need to see a direct benefit from something I'm going spend my time on. So, what we're thinking is it would be very good for these PIs to have some form of data manipulation where they can actually look at all the different experiments going on inside their labs and compare them and draw conclusions over collections of them. One rebuttal is -- so this -- these are rebuttals against why would you want this data available electronically? I want things to be peer reviewed before I expose them. I was actually talking to some people who were working at medical labs, and he said I can be -- not just the privacy issues those those are there as well but he said I'm uncomfortable with nobody looking at my data before I share it in some way. So one way that this system might work is if you were able to enable researchers to allow access to the experimental data then you would perhaps first have them be reviewed before they're exposed to the outside world. And the other thing is we asked these researchers well would you use somebody else's data? And they said no I really -- I don't trust somebody else's data. Well, except for me maybe the guys I went to grad school with or maybe their students if they say they're good, right? So it's a very very personal thing. This, they -- many famous claim in this area is you'd rather share your toothbrush than your data, right? So you want to get over this fear of cooties or whatever. You know, there -- they're if, I think, I firmly believe if biology wants to get anywhere, if science wants to get anywhere, some point you have to start putting together all of these experiments. You have to start making communal sense over all of it. And so the only way to do that is to access each other's data and to start maybe trusting it. But so, if you want to build a system for this, it's very important that you know who you're talking to, whose data is this, and not just that lab but which person is that the newbie who doesn't know anything or is it the woman who's been doing it for twenty years already within every lab. And was that the good batch of antibodies, or was that the week that nothing worked kind of thing? So that's also important. And I think -- so the with big big rebuttal however is I think other people might scoop my discoveries if I make my data available, and I think this is kind of a societal, you know, funding body challenge is it's essential if we want to get anywhere that the reward system moves from a competitive system where one of you gets the prize for one of you doing just a little bit better at creating this data to some sense of a shared mission. And I was in a very interesting workshop in San Diego at the end of 2012 and which Phil Borne [phonetic] put together on the virtual cell, and they were saying well you know where this works is when you want to go to Mars. So we all want to go to Mars, and everybody throw in all your knowledge, because this is where we want to go. And it doesn't matter who gets there first. It doesn't matter who gets credit for this nut or that bolt. We all have to get there together. And so that sense of a shared mission if we could create that, I think that's one of the essential things. >>: Phil Borne's [phonetic] a good example because with the protein data bank, they changed the reward system from become a beaker to being a you've got a into the shared database and you get credit for that. >> Anita de Waard: Right exactly exactly. So that's an example where when he managed to really around the whole culture around that. And it takes things like that. So just a moment more about why it's so hard to do this in biology. I was thinking about why can't it be done why can it be done in biology -- why can it be done in astronomy and big physics and not in biology, and these are just some of my ponderings. It's a small field right? The size of things is human size. It's all macro scale and you'd say about physics. And a scientist can work alone on an experiment, which is unthinkable in physics, even was twenty years ago. Certainly is like that today. And then you have the King and the subjects culture as its called, so you have the PI, you know, and all the people at the court, and we're doing that. So you have this very centralized hierarchical structure. It's messy. It doesn't happen behind a terminal. It's very competitive. You have people with similar skill sets who are vying for the same grants as opposed to say at CERN, there are that many people who can program this or that, there are that people who can actually run the algorithms. There are other people who can build the -- the whatever it is the magnetic detectors. So you know, you have very complementary skill sets but here. You have a lot of people who can learn this. So it doesn't promote collaboration. But I think what would be really interesting in the project that we're working on is to see when you have these entities such as chemical entities or biological entities, and the example here is antibodies, is an antibody, apparently looks like that. You can imagine these antibodies are actually, they're created by -- by manufacturers, often drug companies or specific antibody manufacturers. They are then shipped by vendors all over the place, and what happens right now is they're shipped to many many different labs and the labs do their own analysis and they write papers you can't even detect what the antibody was. This is Mary Ann Martoun [phonetic] had a very interesting talk about you can't find what the antibody was from from looking at the paper. So, if you could track reagents if you could track it starting at the source, you could figure out which antibody was it which strain, which batch even. Then you could collect all that back up into an observational database. And then you could build, what I'm thinking of is a virtual reagent spectrogram. So this particular batch of antibodies, how was that used? What was the outcome? Can you concatenate that back? So it's like you're doing this huge experiment. The world is your lab, so to speak. You send a handful of stuff into it, different people do very different things with it. But as long as they keep track of certain aspects of what they did with it, you can go back and connect it. So that's kind of an idea that we're working on. It's not enough just to upload the data. This is a data sheet. It's uploaded like that, I believe in this was from figshare. But to use it obviously you need to know what the data means and all of you will be familiar with this kind of list. So you need to know who made this, when, and why, how was it made exactly, what were was the tools the antibodies the strains. What are the units? How precise is this? I thought this was a hilarious data sheet because it has -- one, two, three, four, fix, six, seven, eight, nine, ten, 11, 12, 13, 14 -- okay it has 14 digits, does it really have 14 digits? And then there's a factor, but wait, this is minus 21st to the minus 15th to the what? So it can't be that they actually know this to 14 digits. It just, it happens to be how their spreadsheet was set up, right. But if you want to do anything with this data, you need to know how precise it is, and you need to know to what digit that's actually valid. So that's pretty pretty important. How many trials did you do, was this the one trial that I happen to be looking at or did you do 400? Can you give me the average? Generally, when you look at data, you can't tell this kind of thing. So it's important to open up the spreadsheet and I'm hoping so we'll be talking with Kirsten and data up it's a great tool, but you need to get much more specific and that's something we want to be working on. So the new department that David started and I've joined him on this endeavor is the goal of it and so this is a new part of Elsevier where we're looking at to what extent do we as a publisher play a role in this whole space the data space the research data space? Particularly we're interested in helping increase the amount of data shared from the lab and to enable such incidental collaboratories as we were talking about. We want to increase the value of the data. We want to increase the quality of the annotation and normalization and -- and amount of providence that is stored with this datum however you define that to enable enhanced interoperability. And then as Alex was mentioning, help measure and deliver credit for shared data. For the researchers the institute the funding body and eventually help figure out how to make this sustainable. A lot of research databases currently are at the end of their grant life cycle and it's unclear where they're heading. So is there a business in there somewhere? And who are the players in that business and how does it all work? It's obvious the data needs to be available to the academics but is there a business model in there somewhere that might work? Either an open access business model or talk to grant agencies or are there commercial spin offs that you could do so all of that we're looking at. Anyway, I'd be very interested to -- this is just the three parts that I talked about I just want to make clear that the first bit is mostly my thesis research, the second bit is what I've been largely working on within labs and the third part is what David and I are really now headings towards. So I'd very b very interested in questions and thoughts and, you know, possible [inaudible]. All right. Thank you. Oh here are the references. I'll go back. >>: [applause] >> Anita de Waard: Thanks. Okay. >>: Fabulous talk. >> Anita de Waard: [laughter] Okay. >>: So, so um [inaudible] also have organized this big consortium of collecting data, like there was this in code encyclopedia of the elements where they measure a lot of functional elements of the genome and pooled the data together, and they also kind of tried to address this competitiveness by imposing some sort of embargo theory, like you publish a data and you have like first six months, you can, you have exclusive trial or something. >> Anita de Waard: Ah, yes. >>: [inaudible] the data. >> Anita de Waard: This is Enco [phonetic]? >>: Yeah Enco is one. >> Anita de Waard: I've heard about it. I haven't looked at it. It's really interesting how different things are in different fields. And how -- how very culturally -- people point to genomics as they have it all figured out. And then they also say yeah but their data sets are really simple. So this is just these are just snippets I've heard. But it'd be interesting to look at that. >>: Genomics is particularly a field where it's so big so it's like no one lab can -- it is impossible to do that by themselves. >> Anita de Waard: Yes. >>: And cancer is another example of this, kind of cancer [inaudible]. So different center they basically sequence samples and then publish the data they gather. >> Anita de Waard: Yes. >>: [inaudible] >> Anita de Waard: So it'd be really interesting to look at exactly how that process happened from not sharing data to sharing data and at what time drives are and whether we can, whether that can be generalized and used in other fields. >>: Enco [phonetic], I think is there was some funding process it's like they give out huge amount of money to different institution and so the, so they give you money but then you have to publish data in this consortium. >> Anita de Waard: Oh, I see. I see. So I think absolutely the funding agencies are -- are an essential partner in this whole process. You know I was just at a -- earlier this week I was at an earth science budget meeting where people are are trying to connect some of earth science data that's being generated and it's very largely driven by the NSF who are saying look, enough already, just get all of data connected. So I think funding, funding agencies have a really really big role in this absolutely, but it's often a single officer who can really make a difference there. It's very interesting to see the social processes that happen. But it'd be interesting to get some references. Yeah? >>: [inaudible] task, one slide experiment there. Do you have a sense of how frequent it is in the literature that you get that sort of distribution that you showed where you actually have a citation context for the methodology as well as for experiment as well as for the incompletions? First these citations or going in on one specific claim? >> Lucy Vanderwende: So I think that's something that will come out of more extensive data analysis because we just did a few papers just to a prototype. But as you said, you know, now we have funding supposedly close to having funding to produce some annotation for about should be 20 -- >> Anita de Waard: Yeah. >> Lucy Vanderwende: -- or 30 papers that that with all of their citations but you know, 34 paper. And I personally think I expect to see that a paper might be first cited for its results and then I think over time it is cited for different facets. That is what I think I expect to see. >> Anita de Waard: Yeah I think there's -- there's a reason a paper is cited overall. And I think with Lucy this changes over time. And some papers of course it depends on the paper too. Some papers are famous because of the method. You had this machine learning paper that ended up being essentially citation was a synonym to the -- it was the first time that a particular technique was described in a paper so the paper then comes to stand for that technique even though the topic of the paper was maybe completely different. So then over time you see it kind of converging on this it's a shorthand for saying you know you're referring now to that technique. In the case of Voorhoeve, for instance, it was partly the technique that they did, but it was also their conclusion, and so you do see those different citations. There is a sentence within which a paper is cited so the sentence tells you something in general. The questions going to be whether you can link it back to a specific place in the paper or just an aspect. That is, that's trickier I think. >> Lucy Vanderwende: It's tricky, but I think it will be so much, because now if you have a citation to a paper and it looks interesting where you're going to go you're going to go to the abstract of that paper. But if the abstract of that paper doesn't cover the kind of the content of what that paper's being cited for you're not going to pursue it. Whereas it might have been crucial. >> Anita de Waard: And I think another important point that you've always been making was, it it's very healthy to see how the author actually formulated the claim next to the way it's been represented in the cited paper. And I've heard people of highly cited papers authors of highly cited papers complain because they said well my paper's always cited as saying that but I never said that. Often you hear that. I never said that. So it would be good to see what the author did say. You know, next to the citation apart from everything else. >>: [inaudible] >> Anita de Waard: Yes that's true [laughter]. Even if they did say it, they said they didn't say it. This is very true. >> Lucy Vanderwende: And the other thing would be, the other thing that I hope to come out of this work would be where the citing author refers to the work with different terminology than the cited paper, but then subsequently, that gets used in the rest of the literature, with a simple keyword search you'll never find the original. So hopefully you'll be able to associate additional terms to the original cited work. >> Anita de Waard: Other? >>: [inaudible] because not only you have no idea information comes from or who wrote it -- >> Anita de Waard: Yes. Yes. Or how true it is, and what their agenda was and all of that. Absolutely. Absolutely, yeah that's a good point. And you really have to dig deep. I mean you don't I don't think there's any way that a specific statement in Wikipedia can be tracked back to a particular author. I don't know enough about. >>: You can [inaudible]. >> Anita de Waard: Yeah. You can, I guess yeah. >>: So I think I know your answer to this based on a couple of previous conversations, but given the sort of hedging context that you're talking about in claims, how -- how do you -- how do you feel about the nano publication movement in general and the whole fact extraction as a way of -- of representing knowledge? >> Anita de Waard: Right right right. I actually at one point I gave a talk that says why triples is not enough, or are not enough. So a fore thing I always balk at the term fact extraction. I think if you say claim detection, I'm a lot more happy, because you know. But -- and I've had many conversations [indiscernible] about this. Biology's is now moving into the oh, well, let's just point them at the full text article. I actually -- and this was at last year's seashells meeting. It's a meeting we're having in Boston again in February, by the way, big little plug, but it's semantics and healthcare and life sciences meeting. At that meeting I was presenting this idea of having these triples with an epistemic wrapper around them. You know, I think -- and so I was going more in the structured direction. And this is when [indiscernible] said well we can't do anything about attribution and figuring out how true something is, so we just point to the full text paper, which I think is -- I think -- I think you're giving up a lot there. So I do think it would be interesting to see if we could create the -- I think if you look, for instance, at the headers of in a cell paper the headers of the sections, those are almost perfect triples often. We did a very preliminary analysis of that I mean I only looked at cell ensure many other biology articles are the same, but the headers of the experimental results section in particular. They're little headers that say what was the key finding of that experiment? So those are interesting places to start for, I think for extracting triples because they will be claimed knowledge updates. They will be things that these are things these are not things that are presumed to be true. These are not things the author is citing. You don't call your paragraph, you know, a new mechanism for blah-de-blah, or H influences Y or whatever, you know. You put that statement there because it's a claim you're making. That document structure has the has the purpose of saying this is my claim and then the the paragraph is that claim. So I think for instance taking those as a first step as a place to go to the claim knowledge updates and then that means already that you know the attribution you know something about the value they're presuming it's been true but it hasn't been shown, hasn't been validated, hasn't been hardened by citations, and there's data in it because otherwise they wouldn't dare call it a subject header so to speak, unless the subject header can have hedging as well as potential and all of that. So if you were to take a representation like that and and around the triple you would wrap something similar to our orca ontology, but there's lots of other ways to represent it but then I think you could get a nice structural representation of of the claims in the paper. So -- so I'm not saying don't extract triples but don't call them facts, you know. And if it doesn't seem to me to be so hard to detect this hedging. There are very clear linguistic markers. The number of reporting verbs is very limited, and anybody with any kind of degree in conversational linguistics can probably plug it in, you know. And this is first attempt was really pretty good. All the things she pulled out we validated with a domain scientist more relevant statements. She didn't get all of them. We didn't get all of them. There might be linguistic things that we -- but of course the ideal way you'd want to do this and I remember your talk, Lucy, and when you have the author writing the paper. Now you have the triples. Now the the author can give the validation, yes indeed, this is my triple. Now the author can have a claim evidence, now work representation of their own paper, and you can identify the entities. You can say oh wait a minute, you didn't say what antibody strain, here's the catalog, it one of these ten? Oh by the way your reference has a misspelling. Which one is it? You know. So, you can do all of that. And then I think make a nice network. >>: When people represent these things as facts, is it meant to be human readable or machine readable? >>: Lucy Vanderwende: Where? >>: Where we -- so if you do these -- >> Anita de Waard: Oh. Many publications are machine readable. The point that bottom mostly -- yeah? >>: How do they do negation? >> Anita de Waard: They don't. [laughter] >>: A minimal case. >> Anita de Waard: You'd think so, yeah. Yeah. I think they throw it out if it has a negation. I believe. I don't think they [indiscernible], they just get rid of it. I think, I think I asked him this question not too long ago and I think that was his answer. >>: His examples always come up being I'm interested in the X causes Y versus Z causes Y so I can compare conflicting claims rather than compare positive and negative claims. >> Anita de Waard: Right. >>: But if it's going to be machine readable, the machine might -- you know. Want to be aware [inaudible] -- >> Anita de Waard: I can't hear you. >>: [inaudible] >> Anita de Waard: Because he doesn't know how to handle them, I think. I think. Because he's not sure whether the negative goes with the claim or not. So if he comes cross a negative, he's like, go away. What he wants is huge volumes to make inferences that were not otherwise be made. That's what he wants. He wants to get to genes and proteins and relationships and find relationships that there otherwise that couldn't be found. That's his goal. But -- >>: [inaudible] based on claims, and idea was so if you take a computer science but [indiscernible] okay, you understand, in many cases you understand [indiscernible] the less you understand [indiscernible] you start is a collection of claims five six claims [indiscernible]. This -- this is what I've done okay, this is what my paper difference, okay -- >> Anita de Waard: Yes. >>: -- and if you cannot write it about it -- >> Anita de Waard: Yes. >>: -- it's about -- your paper is about nothing. >> Anita de Waard: Yes. >>: And the idea was not only to -- to ask to provide [indiscernible] these claims but also variation paper [indiscernible] claims. >> Anita de Waard: Yeah. >>: How the claims come about, how the investigation -- >> Anita de Waard: Yeah. >>: Okay. You thought they won't have to give you the paper [indiscernible], and they also want you to these claims together is the paper so again a lot of time trying to understand what it is about, just look at these claims and know what -- >> Anita de Waard: Right, right. So I don't know if I ever showed you this ABCDE thing. So it's a little lay tech script it was called ABCDE. BCD was about background contribution discussion. We said computer science, this was meant for your computer science conference proceedings. Computer science papers at least have a background, they have the contribution that the author's made, and they have some discussion. This was like the simplest format of a paper. People essentially just put tags around that. And then there was no abstract, but to highlight certain sentences in those three parts that were their core sentences, and you could then of course pull out those sentences and they would be a little structured abstract. So you could say for instance if you have proceedings where you have, you know, a thousand posters, you could only look at the core contribution sentences where the people said we built an X, it did Y, right. Instead of the semantic web was conceived by Tim Burnsleigh [phonetic], like yawn, you don't want to know. So the background -- there is still background and maybe you want to throw the background together to see what's it's about or you can mine it but you can just look at the contribution. And that was we have all the script. The A was for -- I think it was on annotation. So we thought this is just a little set of these statements that you then wrap with some Dublin Core annotations and the entities so we thought references shouldn't be under the paper they are just hyperlinks to other entities. And then the idea was you would link to other sections in other papers. >>: If I think in our idea, the interesting part was that nobody [indiscernible] provide any annotations, but in this case they will have to to make them paper accepted. >> Anita de Waard: Yeah. >>: It's -- it's their utmost interest to write out good claims. >> Anita de Waard: Yeah. >>: [indiscernible] and proceeding so published these claims separately. It's a lot of literature. But as I said. You need to [indiscernible]. >> Anita de Waard: Yes. Other questions? >>: So I'm kind of curious, so first I would begin with recognizance, so you mentioned that the figure contained the data, the metadata. So I actually I'm aware of also incidents where the figure actually contain much more detail of the conclusion compared to even the text and that's in the pathway describing biological pathway. Often it's, you have this graph with chain engine products and how they're connect to each other and a lot more detail in the figure, and that's kind of like what the paper is about, where as text is like just a some sort of scarce reference to some pieces or the most interesting part. >> Anita de Waard: Yeah. >>: So now it seems like first of all, in terms of like, extracting most knowledge, it seems like in the unique process may not be that difficult, like we can probably do some OCR or something and could graph the image and actually then we can create a pathway [indiscernible] at least. How about all of this including other things that you mentioned like, like the key finding for the data sector -- >> Anita de Waard: Uh-huh. >>: So all those, like, from the data mining point of view they require the full text -- >> Anita de Waard: Uh-huh. >>: But a lot of this full text, especially in the bio med field is [indiscernible]. So I'm just curious, what is your kind of -- what do you think will eventually put like maybe the change of publishing model or something that can -- make a lot of this kind of treasure be -- >> Anita de Waard: Right. Right. So the simple answer to that is I believe, all -- all major commercial publishers' papers are now stored in pub med? After, help me out here after how much time I believe after six months or something. I think it's six months. >>: [inaudible] six months. >> Anita de Waard: Yeah. Right, right. >>: The pub med central? >> Anita de Waard: Right. So you know there's that. For Elsevier, there are, we're -- >>: Also, some journals require a deposit of data sets, whether it be paper or in a public database. So that somebody can go back to [inaudible] -- >> Anita de Waard: Right. I think that if a figure is a useful representation of a pathway in the paper it should not be supplied inside a figure, you know. We just need to move to a space where those are the entities, these are the relations and you have some formal representation. Bell is an example. Ariadne is as example. There are other languages. Keg I think publishes pathways in, you know, like that. >>: [inaudible] that there was standard formats. >> Anita de Waard: Yeah. >>: The trouble was often that nowadays there is just a PDF of one image, but certainly, I mean if the publisher insists they could -- the author probably will be able to provide the format style or something. >> Anita de Waard: Yes. Yes, yes. And I think we do need tools to make that very very simple to do and then require it. We were just talking about that. I actually, my research came out of a proposal to have exactly such a graphical representation of papers in cell where you have entities and their relationships. A inhibits B, you know, B might excite C or whatever. To have that be a representation of the paper and then have the relationships link into the text. The entities are largely known but the -- the relationships are what's being argued in the text, you know, through the LATS inhibitor or not. So having that then link to the right paragraph -- and so what I was mentioning the subheadings of the experiments, those can be often easily represented in triples but also in a graphical format. So I started out my project -- I ended up getting in a happy wanderings in linguistics, but I started out my project looking at that kind of a representation. I do think that's where nano publications are very useful with a little bit of epistemic graphing around it, you could certainly do that. So and you guys are closest, closer than anybody to making a tool to allow that creation while during author submission. Now retroactively, yeah, you need to go and go through the PDFs and hire people to do it or mechanical torque it or whatever, pull it out. I don't know. I'm -- I'm not a expert how to do that. Other questions? >>: So, so you imagine this that they are required to put in kind of central -- pub med central seem to like freely accessible portions and could be much smaller than the that be archive. So last time I checked, the pub med central has like two million, they claim to have two million papers, but the free or available like to download online is only a 300K, so -- >> Anita de Waard: I don't know how that works. >>: You don't know how to get access? >> Anita de Waard: I don't. I don't. >>: That -- that may be what David was referring to in terms of embargo period. So some publishers will work differently in terms of the embargo whether or not they trust pub med central to hold back on those embargoes, those embargoed items. Even with open access publishers [indiscernible] rapid central tend to send all of their publications over pub med central, pub med central holds back on them before they're even available [indiscernible] for reasons I don't know. >> Anita de Waard: Yeah. >>: [inaudible] funding, so -- >>: So it's funded by an agency [inaudible] -- >> Anita de Waard: Right, right. >>: It's not funded by the agency might not be -- tricky, you know, it's tricky. >> Anita de Waard: Okay. I think it's -- right? Thanks. Okay.