>> Lucy Vanderwende: Okay, so today it's my great... from Elsevier. I know that you have a new...

advertisement
>> Lucy Vanderwende: Okay, so today it's my great pleasure to introduce Anita de Waard
from Elsevier. I know that you have a new title but your previous title was disruptive
technology which appealed to me very much.
>> Anita de Waard: Yes.
>> Lucy Vanderwende: It's just intriguing to me to be able to work on disruptive technologies
at Elsevier. I got to know Anita while we were working on defining slowly a shared task for
summarization of scientific literature. That's, I come from the summarization world and Anita
comes from a very deep understanding of scientific literature and really brings so much
expertise to this task that we're defining together.
Anita's background is varied. First worked in experimental physics and that's all I can say on
the topic, and I won't try to say anything more. But you worked as a publisher of physics and
other literature since 1988 which gives you a lot of background on this topic, and today you're
going to be talking about supporting scientific sense-making, and I'm very excited to introduce
Anita. Thank you.
>> Anita de Waard: Thank you. Wow, beautiful introduction, thank you Lucy. And thank you
so much for having me over. I think this has been in the works for a while and I'm really really
happy to know be here with David.
Okay. So do I need to wait at all for any online --
>>: Nope.
>> Anita de Waard: -- people or just venture onwards? All right.
>>: Would you like questions throughout or do you want?
>> Anita de Waard: That's fine with me. Please do. Because I'm touching on a bunch of
different topics and other wise people might lose their thread -- the thought thread. So I really
want to talk about three main topics. But hopefully it's all one coherent story, at least in my
head it is so if that doesn't come through please ask me how are these topics related.
I'd like to tell you a little bit about a model of scientific sense-making that I've developed over
the years which I call stories that persuade with data. So essentially what scientists do when
they write papers is they -- they write stories which are intended to persuade and they do so
using data. So I'll explain the different components of that.
I'll talk a little bit about the work that aye, the University of Utrecht linguistics group that I'm
working with there to analyze the the -- the minutiae of scientific discourse. And in particular
looking at verb tense.
A lot of this work to me goes into formulating a way to to view collections of scientific reports.
Scientific documents scientific papers. And kind of look over them as what I'm calling claim
evidence networks and it's called that in the literature although it's also called different things.
But essentially finding ways to make sense of large collections of scientific papers as a
representation in claim evidence networks.
And I'll talk a bit about an essential component to that which which touches on the project that
Lucy was talking about. The -- the topic of hedging, of wrapping your claims into a certain
linguistic representation of uncertainty. And I'll briefly touch on some projects that we've been
involved in to create claim evidence networks. And lastly, I'll -- I'll talk for a bit about data. So
this is how most scientific arguments are validated, how they're motivated is with data.
I'll say a little bit about the little understanding that I have about biology, which is that it's
really, really, really complicated and just mention a couple of ways in which I've understood it
to be. So I'm not a biologist but I'll tell you a little bit about that and some thoughts on
collecting by logical experiments into collaboratories.
So these are sort of three headers, but like I said hopefully they interconnect and I'll start with
the first one. So if you look at scientific paper and you start reading it and actually a large part
of my motivation to look into scientific papers was that I was working with somebody called
Kierz [phonetic] a long time ago and he said well, you know, scientific papers are essentially
just stories. They're fairy tales. Once upon a time there was an experiment, there was a
research question, and the research question needed answering, and so the research
question knocked upon a door.
And I got a grant in 2006 from the from the Dutch NFWO [phonetic], the foundation for doing
scientific research, to really investigate the proposition of what is the structure of scientific
papers. And I started reading a lot about narrative theory and particularly the concept of the
story grammar, which came out of AI in the 70s. I don't know how many of people are you,
maybe Lucy knows about it but you know Wimmelhart [phonetic], Thorndike [phonetic], these
kinds of folks.
And then the work very influential by Trupp [phonetic] in Russia who have models of fairy tails
and essentially they all come up with this very same setting for a story of a fairy tale. And –
and this is only the beginning of it, it goes on there's more on the on the page there, but
generally, you start with some sort of setting, then you have a theme, then you have a series
of episodes and in stories or fairy tales these are generally between two and six episodes.
Then there's some resolution at the end.
So from -- anybody not know Goldilocks? So Goldilocks, girl walks into an empty house, the
bears have just left she tastes the porridge, she eats the little bear's porridge, she tries the
chairs, she breaks the little bear's chair. She goes up stairs and ends up falling asleep in the
little bear's bed. And then the bears come home, and they're like ah, who ate my porridge?
Ah, who broke my chair ah who's sleeping in my bed? And she runs out.
So it's typical fairy tale. And I think what's really an interesting exercise if you think about the
fairy tale and you think about knowledge representation what would it look like if you were to
represent Goldilocks in a database?
So take a moment and imagine the database schema, right? It's not self-evident. I would -unless somebody immediately can picture that. But it's not self-evident because there is a
point to the sequence. There is a point to the fact that you see this empty house this little
innocent girl. Then you have three episodes that occur, the sitting in the -- the eating, the
sitting, the sleeping, it kind of all piles on top of each other, and at the end when the bears
come, it's dramatic, you know, if she just came in and ate the porridge and the bears came
home, it wouldn't be such a big deal, but it's the porridge and the chairs and the bed that
makes it kind -- it kind of adds to the suspense and it makes -- it's very hard to say what the
story is really about, but there is something exciting to it that humans at every level grasp.
And so a paper is a lot like that. Right? So a research paper, you generally have a
background. The protagonist, I believe, is either the object of study or the research question.
So you can -- you can actually frame and build a story frame around research paper what you
say it's really the research question looking for an answer. And the research question then is
tests itself, like a protagonist in a fairy tale in scientific papers there are generally between
biology, I'm talking now specifically I can talk a bit more about the difference between biology
and other subjects, but I've focused only on biology.
They're generally between two and six experiments. And so these are all places where the
research question does something with the world is somehow tested. It it like the research
question tries all the different porridges and finds which one it likes right? So you have a
similar buildup, and you have these little mini-stories that are the little episodes that describe
every experiment.
And as in a paper, as in a fairy tail, there is a story that's woven throughout. There's the
protagonist, the research question starts in one state and ends in another. And the answer at
the end is because the protagonist has actually been changed by it interactions. Inside these
experimental setup. And a good convincing paper says not only did we find -- you know
whatever over expressed dAtx files but we also found this and we also found that. And it kind
of whips you around the head with all these different little stories that happen that all add up in
the end. So -- so I think this is sort of an interesting thing to look at.
Another model that I think is fascinating to look at is the model of -- of Aristotelian rhetoric or
Quintilian's rhetoric. Quintilian was an orator in Rome, and he -- he has a beautiful, very clear
description of what are -- what are the parts of the speech? And if you look at the different
parts of the rhetoric as as both Aristotle and Quintilian, put them forth again you really
recognize both in scientific talks but also in scientific papers. How there are different parts, for
instance, the appeal to different aspects of the audience's brain, right?
So you have an appeal to ethos in the beginning, you want to establish credibility. Typically in
the beginning of a paper, you will just make it clear that you know what you're talking about by
dropping a lot of reference, by using the right type of jargon, you will position yourself, but in
doing so also show that you know, you know what you're talk about.
Then there is the appeal to logos, who you really make your logical case in the middle, and in
the end you have an appeal to pathos, so you have an appeal about why this truly matters
and try to -- try to appeal to the emotions also of your reader.
So I think a lot is, there are a lot of similarities, and at one point I thought it's really funny, it's
kind of like with viruses. The goal of the paper is to be published, and it essentially uses the
author and the journal as the host. And essentially the format of the paper has co-evolved
and it's in a predator-prey relationship with its reviewers, right? So if there are areas where
there's like heavy hunting, and there's a lot of pressure on the -- on the prey there are very
very tight guidelines -- hi -- for what a paper looks like.
If it -- if it's a very sparsely populated area, it's not just a such a busily occupied niche like in
paleontology, papers have very very different formats, but if you're in like the high throughput
area like bioinformatics or microbiology, you have a very clear way in which the paper must
be. And like nature, you know, those are like the -- only the speediest gazelles get through all
the hoops that they need to jump through to get into nature. So the format is much stricter in
these kind of very aggressive fields, it's kind of interesting.
And I think essentially these, these -- the formats of these stories and the formats of these
narratives I think are there's something in -- innately human about them. We have always
been telling stories in this format, and I'll get back to one -- one component of that a little later
on.
So, data, I'll talk more about data later on, but I just think it's fascinating that in biology
generally, I started for a while to try to make argumentative representations of biological text,
but it always ended up that in the end what was the way to convince somebody? It was figure
2A. Right? It's always figure 2A.
So there's some statement and the statement like here dAtx one contains the AXH domain
but not a poly Q-track [phonetic], that's the main claim. What's your evidence? Figure 1A.
Right? Or as figures 2A and 2C show, again, claim.
This is necessary. This shows that this is expressed figures 3A, right? So there's -essentially you jump to a non-textual representation. The argumentation turns out to be very
simple. It's always figure 2A or data not shown, which is also a great one. It's really great.
There are many many claims in biology that are motivated by data not shown if you look
through the literature.
But the point is I suppose the idea is if, if I could show you what it actually looks like. And so
you're jumping to a representation of the knowledge that's outside the text itself. I guess
that's really the point I wanted to make, and that's really really key in biological argumentation.
So, I'll talk a little bit about my more kind of fine-grained works, so this is the -- I was
interested in this story grammar but then I thought yeah, but there's so much happening
inside a sentence and coming from physics, and not knowing anything about linguistics, I was
amazed at what happens inside even a single sentence.
I mean, if you want to -- I should have said this beforehand, but my goal was to represent the
scientific papers in some way that would be more easily digestible by computers, that would
be more easily summarized that could be more easily concatenated, exchanged, etc. And I
foolishly thought, well, you know you just get the nouns and the verbs like many people do.
And then I started to realize that it's not that simple.
And in fact, as I went on, the units, my -- my own units of of study, the units of discourse I was
looking at became smaller and smaller and ended up at the clause. So they didn't become as
small as a word, or a phrase, but ended up as a clause. And here's my little tiny in defense of
the clause.
I think when you look at -- oh let me go back for a second. When you look, these are a
couple of sentences from a paper that I've analyzed from Voorhoeve in 2006, and if you look,
these are four sentences, from that text. These -- contiguous sentence. But if you look you
can see that there's more than one thought unit per sentence. You see that the verb tense
changes and I'll get to that in more detail in a moment.
And there are all kinds of things happening. There are attributions, there are actions, there
are prepositions, they're all within the sentence. But if you look at a clause level, I think it
already starts to make a lot more sense.
So this is from, I don't even know what linguistics textbook I got this from, but generally the
theme, main idea which is at the -- you can probably name the author, probably Halliday
[phonetic] and that whole area, right? F.G. probably, I think?
>>: Yeah, it's Halliday --
>> Anita de Waard: Right. Right. Right I should have put the reference there, sorry. But for
those of you I can dig -- or Lucy could probably tell you more about this. But anyway -- who
said that essentially in a sentence the head is a premise the motivation and the attribution,
and in scientific writing it's often a matrix clause. So this is a clause from these results show
that, and then you get the rest of the sentence. Or, you know, Vanderwende et al. have
demonstrated that, and then you get whatever follows the that, right.
So you see that here typically, you have the premise or the motivation like in sentence two to
shed more light on this aspect, so this is a goal clause. Why did we do this thing? And then
at the end of the sentence, you have the interpretation and the elaboration and the attribution.
One of the reasons that references come at the end of the sentence that's often where we
give references, said Lucy. You know, that kind of thing.
And in the middle, what -- what you generally find you have your main biological statement,
your main assertion, happens in the middle of this. And I -- I -- these are just four examples,
but you do find it a lot.
What I've done specifically is to look at these clauses and assign specific clause types to
them. I made a little taxonomy of different clause types and look at how are these clauses
linguistically similar. And I'll show you a little bit about verb tense in a moment, but just to
show you, so you have these these two infinitive clauses which are goal clauses that you see
a lot. Generally they're at the beginning of the sentence. They're often in the beginning of the
paper. They indicate the -- the motivation behind the experiment.
There are typical regulatory clauses, and again, I'll get back to those in a moment because
they're incredibly important for sense-making aspects. Our results indicate that or figure 4A
shows that so these are -- have a lot to do with the attribution how do I know what I'm about to
tell you.
And then there are, I believe there's a specific -- oh, right then there are methods of course,
which is what we did, this is where you see the first person pronouns always in the plural in
biology. You can never say I, but it's always we, but it's what actually took place in the lab.
And they're -- they're generally in the past tense or in a gerund form like you see in sentence
four.
And then I'm differentiating between results and implication, and for those of you who are
aware of Gully Burns' [phonetic] work, anybody, so he's K-Feb [phonetic] model, so I think he
makes a good distinction between observational assertions and interpretational assertion. So
the observational assertions are what you directly see, and the interpretation assertions are
what you think that means, and I think you can find those quite clearly.
So, what I essentially did was look at, look at sentences these are these are different
sentences from the same paper. Looked at parsed them into clauses and looked at the verb
tense, and again these are contiguous sentences from the Voorhoeve paper and gave them
this clause type which I came up with this little taxonomy, and essentially what -- what you can
see is -- or the way that I see it is there are two types of knowledge that are conveyed, and
you're constantly in the text going back and forth between these two types of knowledge.
One part is the conceptual knowledge. So these are statements about how the universe is
perceived to be. Like seminomas and EC component of nonseminomas share features with
ES cells. These are things that are presumed to be true and presumed to be known they
don't need any attribution, and -- and -- and like, at the bottom, miR-371 expression is a
selective tumorigenesis. If you ignore the rest of the sentence that's again a statement about
the conceptual realm. Right.
And then there is the other type of knowledge which is the experimental knowledge. So this
is what we did. We tested. Was undetectable, things that happened or were directly
perceived by the authors.
And then there are specific -- oops. There are specific sentences to go from one realm to the
oh. So if we we take the second sentence but we turn the picture on its head, and I'll go
through this quickly, essentially you can see that the -- the conceptual sentences. They all
what they -- oh right, how these two realms differ is the conceptual sentences are all in the
present tense and the experimental sentences are all in the past tense. And that is, I believe,
cognitively also how we keep them apart.
And we did a small user experiment where we gave people result sentences and fact
sentences, and we changed the tense, and they would interpret a result as a fact and a fact
as a result, if they changed the tenses. Quite obvious.
So the tense indicates whether this was something that was experienced directly by the
researcher, so it was an observational assertion. It was something they either did or
perceived or it's something that pertains to this conceptual representation of the field. And
then there are specific verb tenses, like the two infinitive which essentially gets you from the
conceptual generally gets you from the conceptual to the experimental.
So this is the case. To investigate this matter, we did this. Past tense, right, and then you can
go back and then generally you have a matrix clause which can be with a gerund or -- or a
present tense. These results suggest that. And then you go back up into the conceptual
realm.
And essentially, these little cycles happen over and over and over again inside biological
texts. They also happen at a macro scale in the introduction's generally more conceptual the
methods and results obvious are more experimental, and the discussion is more conceptual.
So then at a macro level, but they also really happen.
Here we have, I believe -- yeah, three sentences. And within the span of those sentences
you can even have one sentence in which this occurs, so these little cycles sort of taking your
conceptual statement, dipping it into experimental experience and going back up to make
your conceptual claim happen over and over again in biological text.
And what's kind of funny. I -- I talked to this with, with Ed Hoeve [phonetic] in fact and I said,
you know, is there any way we can figure out whether this is, you know, deeply innately true?
Is -- is there something that says that conceptual things are always in the present and
experiential things are always in the past? And he said if you wanted to make a real
statement about that, you'd have to find some other type of corpus that would also have
different tense uses and then maybe you could say that yes, this is how people think or
something.
So, I've only done a very very small dipping my toe into this, but in fact if you look in
mythology, it's very similar. So you have facts in the internal present, they're -- it's called the
gnomic present in fact in ancient, in classical Greek, there's a specific present tense that has
to do with the the qualities of the Gods. So, so Hera, queen of the immortals is she, she is
sister of loud thundering Zeus. This is a statement about how concepts are arranged.
Concepts in these cases are of course the Gods and their properties, similar to small RNA's
regulate gene expression by mechanisms conserved across metazoans. Everybody knows
this, right? This is our representation of truth as we live with.
Then you have events that you've experienced and they are in the simple past.
Vehicle-treated animal spent equivalent time investigating juvenile, or from a disuse now the
wooers turned to dance and to gladsome song and made merry. These are experiences that
happen on earth as it were. This is direct human experience. This is in our in our -- our event
past.
You can have embedded, you can have events with embedded facts. So we generated cells
expressing the -- maybe I should point to this, I don't know if that's necessary, but. Let me
see how that works. We generated BP [phonetic] cells expressing the gene which is only
active when tamoxifen is added. So you have a little clause that is an embedded clause that
refers to the realities that we all share.
And similarly, here's something about Hera, I think, she took her mighty spear where with she
vanquishes the ranks of men, men of warriors. So, again you have, you have this knowledge
about Hera, and about Hera's characteristics embedded within an event, just like you have
the knowledge about the chimera gene which is embedded within the event, the -- the actual
experimental event that you had.
Similarly, attribution it's kind of interesting is in the present perfect. If you look at the verb
tense for attribute sentences like has been shown, that is typically in the present perfect, and
interestingly enough this is Snorri Sturluson, the Norwegian sagas, I have had old stories
written down as I have heard them told. And then implications generally again are in the
present tense, so it is said that whenever the camel sees a place where ashes have been
scattered, he wants to get revenge. This is from a Mongolian mythology. And similarly here,
these results indicate that, this confers complete protection.
So you see also in the verb tense that after telling of the stories that happen sort of on the
ground we know more about what rules govern, you know, life on earth, so to speak.
So I won't go so far as to say that concepts are the mythology of scientists. You could offend
lot of scientists that way, but I think cognitively, we have similar systems where we talk about
conceptual realms that we presume to be true, and we're trying to make sense of how those
laws that govern our lives work by interpreting the experiences we have on earth and you see
that, you see that in tense use quite, quite clearly.
All right. Anyway, how do we how does all this refer to science publishing? So I think one
small lesson for those of you that are text minors, please please please do not throw away
tense. Do not ignore tense. And do -- and if at all possible to look at the clause level, I think it
is very important, do so. And I guess that's my take home from this part. Tense matters a
great deal.
So one thing that's that's very interesting is if you look across collections of documents so
now you have this claim that has been made within a document after you've -- you've taken
your factual, your -- your knowledge of facts and your questions and then you described your
experiments, now you have a claim. Some sort of the claim.
Like Voorhoeve et al., this paper I was just showing you they have a claim about microRNAs
373 and 372, and they say they neutralized the P53-mediated CDK inhibition, whatever that
means, but they don't know exactly what the process is. But they're saying possibly this
happens through direct inhibition of LATS2, some entity.
And possibly is a hedge. So, they cannot say we have shown unequivocally because -- and
there are different reasons for that, I'll get to -- I'll get to the dynamics of hedging in a moment.
This is just to explain what a hedge is, a typical hedge. What you see if you look through the
literature when this paper gets cited, Kloosterman and Plasterk, for those of you who know
Holland, Plasterk is again our minister of education, I believe, so it's quite interesting. It's
actually one of his papers that he wrote on the topic before he was the minister of science -minister of science in education.
So he refers to the Voorhoeve paper and he copies this possibly, right? So this is in the same
year in 2006. Then there are a number of other papers. But in 2011, for instance, you see
now this has become a factual statement. Two on oncogenic microRNAs directly inhibit the
expression of LATS2, period. There's no possibly there anymore. And I happen to have
looked into this case, and there is no other paper investigating this relationship with, between
these microRNAs and LATS2, and in fact I think we, we did a little study on this.
So, simply by being cited and hence the quote from Bruno Latour on from laboratory life, you
can transform fact into fiction by adding or subtracting references. If you, if you -- if a claim
gets cited enough, the claim is hedged at the beginning, but as it gets cited, this hedging
erodes and simply by being cited there for the claim turns into a fact, and Bruno has this other
beautiful, beautiful quote which is a fact is a claim that has been agreed upon by a committee.
And so many many scientific facts simply are based on being cited and not on other
experiments validating a certain statement, all right. So that's just kind of the point.
Just a little excursion here about hedging and I'll zip over this I have a whole bunch of
references for those of you who are interested with this, why do authors hedge? This is
from -- from the area of genre studies authors hedge to make a claim that pends acceptance
in the community. Creating a research space, this is Swales, essentially you need to show
that there is a problem. And insert yourself into the discourse. And you need hedging to do
this. You can't just go and say everybody else is crazy. You need some you need some
hedges you need, there's politeness there's but there's also kind of a knock on the door
saying perhaps there might be an issue here.
And then Meijer [phonetic] says it's the strongest in careful researcher can make you cannot
actually really know that you're -- you're conceptual interpretations of experimental findings
are true. You simply do not know that this so you need to hedge for this.
There's a lot of work in computational linguistics, and you might know a lot about this, but
about finding hedging cues and speculative language and modality and negation and number
of workshops. I think Light et al. is the first reference that's generally mentioned. I'm finding
speculative language, Wilbur have five forms of hedging that they identify and then Thompson
and Nanyadu [phonetic] also looked at levels of speculations and types of source of the
evidence in level of certainty.
And again it was and Hovy who pointed me towards the similarity to sentiment detection,
right? So in, in product reviews and such there's often a lot of sentiment detection. There's
the holder of the opinion, the strength and the polarity, and what, what I really got from Ed and
it was very insightful to me was this concept of the hedge as a mathematical function that's
acting on the key assertion.
So essentially there is a claim. But it it's wrapped around with the hedge. And so we look -well I looked a little bit on how could you actually model this? And so this is more or less
borrowed from sentiment detection but using some of the taxonomies in the other papers. If
you have some sort of proposition, some sort of statement in biology, it's epistemically marked
as an evaluation where you have three essentially three dimensions of -- of the the epistemic
modality, so the truth value of it. One is the value which can be assumed true possible
probable or unknown. This is taken largely from Anyaru's [phonetic] work, they have these
this four part tied scale I suppose, then you could also have the negative ones but the
interesting thing is in biology it is largely unstated if something is not true. So you find very
few incidences of this.
Then there's the basis, you can say something is based on reasoning, or it's based on data,
or there should be a zero category with that as well as, it's just not indicated, and then you
can say the source, you can say the speaker is the author it can be explicit or implicit, where
he found that. Or we believe that, that kind of thing. Or it can be implicit, these results show.
There can be an explicit or implicit and author and again there should be a zero category
added to that.
Oh, clicker. And actually, with Jodie Schneider from dear re, Jodie made a very nice little al
ontology of this. And the purpose which is exactly what I showed you but then, you know. Let
me go ahead for one, nope sorry, sorry. I wanted to show you this. We'll get to the others in a
moment.
The purpose of doing that is what you can now do if you have this ontology if you can take
statements that are biological statements and this is with a Bell markup for those of you that
know Bell with Selventa [phonetic] it's a project out of -- out of Boston and they have a formal
representation of biological statements, let me see, here. So this is the Bell representation of
this statement.
But what Bell and other formal representations of biological knowledge never model is what is
the modality of this. Right and here it said that this happens and this is possibly how it is
being done. Right?
So, so the proposal of the ontology is that you could wrap these types of statements which
are generally what the output of a text mining program is, with these types of epistemic
evaluations. And I'll have -- I have another example, this is using a tool called Medscan,
that's used by a company called Ariadne which is now part of Elsevier as well. Similarly, they
have a representation of formal representation of the biology in this statement. And you can -you can throw the, so it says we present evidence that. So, a value that is probable if they're
presenting evidence, they say it's probable. The source is the author who is explicitly
mentioned and the basis is data, because they say that they have evidence.
So you can imagine layering biological statements with this epistemic modality shell, so to
speak. And this would give you a lot more -- a lot better representation of -- of biological texts
than simply showing the triples is my argument.
Just to get back to where we were -- right. So if you look, and this was a manual corpus
study that I did, if you look at how does this hedging occur, by far the most prevalent form is a
clause of the form that is these results suggest that. And this is from one paper, I think there
were 42 incidences where the hedging happened in this form and this is all the content in
those instances.
So you have a matrix clause, you have a clause of this form and, like these are all the values
there are no more in a particular paper. So it's, it's very very limited, so these results suggest
that, and instead of suggest, there are different verbs that are used there are specific verbs
for indicating a lack of knowledge.
And so this is based on a manual corpus analysis that I did, by now I've done ten full text
papers about three thousand clauses and have gone through them and far the most
prevalent, you know, you get -- 75 percent pretty much. If you look simply at these clauses,
these form, there are certain verbs used specifically for hypotheses, for probability, and for
presumed true, and and show and demonstrate are -- are very strong among those.
With [indiscernible] of Xerox, I did a small project to look at claimed knowledge updates and
this was a very kind of small tentative proof of concept. She looked at using their Xerox
parser that she looked at detecting these claim knowledge updates where the authors present
the CKU as factual, so the strength is certain certainty, it's derived from experimental work the
basis is data and the ownership is explicitly attributed to the author of the articles. So it's an
explicit article. So if you do this, you can pull out the -- the statements, the propositions inside
the paper that the authors attribute to their data which gives you I think a good summary of
what are the authors saying, and surely a lot better than just finding all the triples in the paper.
All right. So the ideas that this will all lead to the creation of -- of networks of claims in
evidences. And I want to quickly introduce -- explain to you two examples of claim evidence
networks a that we've been working on.
This is a project called Data2Semantics that was funded by the Dutch government and we're
currently working on it with the free university in Holland which was the to improve the speed
of integration of medical research into medical practice. And there what we're looking at is to
go from patient records to clinical guidelines and then from guidelines into how is this
guideline motivated.
So here -- here we're going to speak the other way around as it were, so this is really -- this is
really the claim and here it has become fact. But the point is that that the facts that are often
used downstream in -- in a medical process in patient records and, and treatments decision
trees and such. There's a great time lag between these too, and so, it can take up to you
know ten years or seven years, or terrible time periods during which patients could have been
treated. Because the knowledge was there, it just hasn't gone through the system quickly
enough.
So this project is to link the patient data and diagnosis to the guideline recommendation and
then to base the guideline recommendation into evidence, but then also to do the other way
around. So to trying to find such claims using some of these linguistic analysis tools to see if
we can pre-populate guidelines and then connect them to the patient records more quickly.
There's a another project with University of Pittsburgh where we look at drug-drug
interactions, and define a model of drug-drug interactions. This is with underlying conceptual
model that we're trying to detect and then automate the process and store it as linked data,
where again, at the step where you're going from here to there, if you can distinguish between
known drug-drug interactions and newly created, newly argued drug-drug interactions, you
can really build and prove systems.
Okay. Now this is actually our work. Lucy and I didn't want to go into it too long because I
wasn't sure to what extent people were aware of its, but I'll briefly discuss this project.
>> Lucy Vanderwende: Please do, because --
>> Anita de Waard: Okay. Well we can talk about this more later on -- okay. It's -- it's I just
had this one little slide but. So together with Lucy, and Kevin Cohen at Colorado and I believe
we're now in stage where we're --
>> Lucy Vanderwende: Getting close.
>> Anita de Waard: Getting close [laughter].
>> Lucy Vanderwende: Funding for the data, yeah.
>> Anita de Waard: Yes. Yes yes yes. The goal is to make I -- I'll preface this by something
that's not on the slide. So if you look if you go back to the -- to the fact becoming -- coming
from the claims through the documents. Conversely you can say that a citation of a document
presents a certain type of summary of the cited document. And these are -- this is well-known
in the conversational linguistics community there's a lot of work done on that by [indiscernible]
and others, and these are called citances.
So the citing sentence of a paper says something about the cited paper, and you can consider
the collection of sentences that cite a specific paper as a a possible summary of that paper.
And the project that Lucy and I are engaged in is to do -- to build a corpus and do a challenge
that has to do with summarization where you're looking at the citing text.
So here -- here are two citations, the the top ones at there, by other people of this Voorhoeve
paper, and we're -- we're trying to find what part of the original text is being cited. And
specifically what aspect of the original text is being cited. So here you have for instance,
Voorhoeve employed a novel strategy by combining miRNA vector library etc. Or Agami and
coworkers performed a cell based screen. These are pertaining to the method of the paper,
and so you can try to look for the most relevant sentence in this case, or -- or section of the
paper.
Similarly other citations have to do with the results. Voorhoeve et al. identified these
microRNAs, it was described as an oncogene. They were found to permit proliferation etc.
So these have to do with the experimental results and this is representation of that. Or you
can look at the interpretation. This is the clause I was talking about before. Through direct
inhibition of LATS2, might reduce the selective pressure for P-53 in activation, so this has to
do with the implication.
And our premise is that you can build enhanced summaries and enriched summaries of
scientific publications by identifying these citations and then grouping them according to the
facet as we're calling them the facet of the text the method the interpretation the results.
These types of things. And that's what the -- what the project with Lucy's about, so if you
want to hear more about to that happy to talk more about it.
All right. So lastly, now let's get to data. I still have fifteen minutes right, it's okay? Okay
good, good. Great. So all of this in the end -- and when you speak to scientists, they say
yeah, well, you're talking about linguistics, but really what we have nothing to do with
language. It's very interesting, and if you ever read, I highly recommend Charles Bazerman
shaping written knowledge. It -- it's a beautiful book and he -- one of the things that he says
at the beginning is, he said, I started all these interviews with scientists, and I said you as a
writer, and he said I'm not a writer, I just observe, you know. And he said but you've written
ten books, and two hundred papers. Yeah but I'm not a writer. I just, I just put on paper what
I see. So the self-perception of scientist is that they're not committing rhetoric and that they're
not persuading anybody. They're just say what they observed, right.
So this is very interesting. And they say, and then you show this to biologists they kind of
chuckle and say yeah ha ha but really I only look at the figures. Actually, if you do an
experiment a user experiment and you take away all the text and you only show them the
figures, there's no way they know what the paper is about, right?
In fact they parse, it's, it's pretty clear that -- that and there's been psychology research on
this -- they simply parse the structure of the structure so quickly that they don't become aware
of the fact that they're reading or looking through it and it allows them to focus on the parts
that are new to them. They have a schema in their mind of what the paper will look like.
They compare it with the known schema and they only pick up the differentials and that's why
scientists can process, any scientists, you give them a stack of papers and they'll sift through
them in, you know, in no time flat and pick out only the ones that are interesting or novel to
them. Because the schema is so embedded and that's how we how we think anyway.
Anyway, that's a little digression. So biologists will all say, well, none of that matters. All this
talk is totally irrelevant. What I care about is the data. And so the current project that I'm
working on with David is about data sharing in biology and why that is so important. I only
started to realize when I read a little bit about biology -- and I have a -- a longer talk about this
but I'll zip through it right now because I'm also not really the one to talk about this.
But essentially biology -- what biologists are finding all over the place is a biology's really
really really really complicated. It's not like physics. In physics you can take something apart
study the parts and put it back together and you have an understanding of the whole. But
biology doesn't work that way right. If you take a mouse apart you can't put it back together.
It's just not working.
So for one thing there's an inter-species variability. Within the species, one human is
drastically different from another human which we're finding out now of course with genetic
and personalized medicine. My -- the drug that works for me will not work for you and how it
all works is very different.
The gene expression, so, so what genes do and what they make is very different. They found
this huge level for instance if rats were just fed or not, the gene expression would vary a factor
of ten essentially you know annihilating the results of scores of research.
So any circumstance. They never use female rats because they have menstrual cycles,
which as every mammal knows, can seriously influence your behavior, and everything else.
So, so there there's a lot going on there.
The whole microbiome theory which you probably all know about, you know, there is ten times
more RNA in your body from nonhuman sources than there is from human sources. The -the, every part of your body is -- is a little ecosystem, is living together with a whole bunch of
germs, essentially, that work for us and we work for them and it's all very symbiotic but you're
not looking at one entity when you're looking at any creature, and systems biology, the whole
is more of the sum of its parts. If you take a cell out of its context and you put it back, it's total
different thing, right?
So, and I think the other thing is -- is if you look at models and you're looking at experiments,
there -- in every area of biology there are huge doubts whether, if you're modeling a system
are you really modeling the thing that you're measuring. So if you put all this together, you
know any representation that you make of any of these systems is going to come up short.
So, and -- oh right, then there's dynamics. Right, life is not in equilibrium. Life is constantly,
thank you, if you let everything degrade if you let everything happen, it would all be over very
quickly. There are constantly these systems that change, and the individuals evolve and
species evolve and you as an entity keep changing all the time. So any snapshot you take is
incorrect anyway.
It's -- in short, life is really complicated, and this is a picture that I think it was, Descartes
used? It's a -- there was a picture of somebody tried to make a robot of a duck. And
essentially you put food in the other end and it produced poop on the other, put food in one
end and produce poop on the other end. It was a hoax, of course. And we still can't make a
model of a duck right we can't even have something that eats and poops, let alone have it run
on that. We really don't know how all the components work and if we know a little part of the
component not going to work with the other ones.
So what to me the understanding finally that I had about biology was really reductionist
science doesn't work for living systems. You can't take it apart into pieces and expect
anything you know about the whole.
So one way that this is being addressed in biology is using statistics. And there are some
projects like the human microbiome project consortium, where they simply get a whole budge
of measurements. They get a lot of money here they get 242 healthy adults sample at 15 to
80 body sites up to three times, you don't want to think about how invasive that is, and then
you have some data, right? And then again the large sample size.
So this is another study where people just take a whole bunch of people and maybe if you
have enough data then you can see something remotely sensible.
But the -- so what you can think about is to help this to help bring biology forward and to help
improve the way in which reasoning occurs in science is to enable what I call incidental
collaboratories.
So everybody knows the concept of a collaboratory, that's pretty familiar? So in, in many
areas of big science that has happened, that once place, for instance, collects the data, the
other analyzes the data. A third, you know works on the sample and changes the experiment
and sends it back to lab one etc., or atmospheric, for instance, atmospheric observatories
balloons over Greenland that you can access online. A lot of citizen science. There's data
collected in one place and analyzed in completely different place and then somebody else
altogether says something about the analysis. So you distribute the process of doing science.
To do that, you want to store data at the level of the experiment because you want to add you
want to combine experiments. You want to connect it, you want a allow analyses over similar
certain experimental types and experiments done on similar things. It doesn't have to be in
biology of course, but you want to have, have something that you can actually compare and
put all these measurements together with.
And you want to keep it for a long time because in fact old -- old records are very important in
a lot of this. If you look for instance in earth science it's essential that you have old records
and keep them, but also for biology. It's possible that evolutionary changes or other changes
are really going to affect this. So you want to -- and plus, getting more data. It gets you more
data if you just store everything.
So you need to have -- also, there's the policy end of this. Data management plans are
becomes more and more important. So you need to have systems where you have gated
access to where the researcher can say this is my data but I'm sharing it with these people
and I get to decide who gets to see it when.
Now if you look at a typical lab, essentially what happens is they have a fridge that has
unlabeled tubes in it and they have a place whether they -- this is, sorry. That's a X mouse,
and -- and a little X mouse's brain and there are neurons that are being taken from this
mouse's brain and put into a system. It's messy, right, there's like blood and guts, like literally
in these biology labs.
And then how do they record their -- their -- this is, I took this picture in December of last year.
This is about less than a month old I believe. So you see here a printed-out page with some
drawings next to it, mostly it's just written in the book.
So every researcher -- and this is a very good lab. This is not a, this is in America. Very
highly funded good lab. It's stored on paper. The metadata is simply stored on paper. And
often you'll see people get rid of all their iPhones and pads and laptops, get rid of all that, then
they take their paper book and they start doing science. And that's essentially what really
happens now.
And then there is a PI who wants to know about what all happened today and the only thing
that he can access actually are the direct measurements, so he doesn't know any of the
metadata, any of the surroundings and can't do any analysis over that.
And then you say, well, wouldn't it better if you would just store it electronically and maybe
then you could access other people's data as well, right, and advance science more quickly
because you wouldn't have to rely on your own research and you wouldn't have to go into
people's paper notebooks.
So they say, well, our lab notebooks are all on paper. So I think -- and something that we're
working on as a direction to go to is all of these people have smart phones. They all have
tablets. If you just have an app by which just like when they're in the grocery store and they
can scan the bar code to check whether it's the right brand of bread that they have to go
home -- because you can do that, you can just upload you know, is this -- is this what we
usually get, you know, there are tools for that. There's not a lot for that right now in the lab.
So having having just data input apps that allow -- that allow you to work in the lab I think is a
good idea.
Then they say well, I'm have really really busy. I need to see a direct benefit from something
I'm going spend my time on. So, what we're thinking is it would be very good for these PIs to
have some form of data manipulation where they can actually look at all the different
experiments going on inside their labs and compare them and draw conclusions over
collections of them.
One rebuttal is -- so this -- these are rebuttals against why would you want this data available
electronically? I want things to be peer reviewed before I expose them. I was actually talking
to some people who were working at medical labs, and he said I can be -- not just the privacy
issues those those are there as well but he said I'm uncomfortable with nobody looking at my
data before I share it in some way.
So one way that this system might work is if you were able to enable researchers to allow
access to the experimental data then you would perhaps first have them be reviewed before
they're exposed to the outside world.
And the other thing is we asked these researchers well would you use somebody else's data?
And they said no I really -- I don't trust somebody else's data. Well, except for me maybe the
guys I went to grad school with or maybe their students if they say they're good, right? So it's
a very very personal thing. This, they -- many famous claim in this area is you'd rather share
your toothbrush than your data, right?
So you want to get over this fear of cooties or whatever. You know, there -- they're if, I think, I
firmly believe if biology wants to get anywhere, if science wants to get anywhere, some point
you have to start putting together all of these experiments. You have to start making
communal sense over all of it. And so the only way to do that is to access each other's data
and to start maybe trusting it.
But so, if you want to build a system for this, it's very important that you know who you're
talking to, whose data is this, and not just that lab but which person is that the newbie who
doesn't know anything or is it the woman who's been doing it for twenty years already within
every lab. And was that the good batch of antibodies, or was that the week that nothing
worked kind of thing? So that's also important.
And I think -- so the with big big rebuttal however is I think other people might scoop my
discoveries if I make my data available, and I think this is kind of a societal, you know, funding
body challenge is it's essential if we want to get anywhere that the reward system moves from
a competitive system where one of you gets the prize for one of you doing just a little bit
better at creating this data to some sense of a shared mission.
And I was in a very interesting workshop in San Diego at the end of 2012 and which Phil
Borne [phonetic] put together on the virtual cell, and they were saying well you know where
this works is when you want to go to Mars. So we all want to go to Mars, and everybody
throw in all your knowledge, because this is where we want to go. And it doesn't matter who
gets there first. It doesn't matter who gets credit for this nut or that bolt. We all have to get
there together. And so that sense of a shared mission if we could create that, I think that's
one of the essential things.
>>: Phil Borne's [phonetic] a good example because with the protein data bank, they
changed the reward system from become a beaker to being a you've got a into the shared
database and you get credit for that.
>> Anita de Waard: Right exactly exactly. So that's an example where when he managed to
really around the whole culture around that. And it takes things like that.
So just a moment more about why it's so hard to do this in biology. I was thinking about why
can't it be done why can it be done in biology -- why can it be done in astronomy and big
physics and not in biology, and these are just some of my ponderings. It's a small field right?
The size of things is human size. It's all macro scale and you'd say about physics. And a
scientist can work alone on an experiment, which is unthinkable in physics, even was twenty
years ago. Certainly is like that today. And then you have the King and the subjects culture
as its called, so you have the PI, you know, and all the people at the court, and we're doing
that. So you have this very centralized hierarchical structure.
It's messy. It doesn't happen behind a terminal. It's very competitive. You have people with
similar skill sets who are vying for the same grants as opposed to say at CERN, there are that
many people who can program this or that, there are that people who can actually run the
algorithms. There are other people who can build the -- the whatever it is the magnetic
detectors. So you know, you have very complementary skill sets but here. You have a lot of
people who can learn this. So it doesn't promote collaboration.
But I think what would be really interesting in the project that we're working on is to see when
you have these entities such as chemical entities or biological entities, and the example here
is antibodies, is an antibody, apparently looks like that. You can imagine these antibodies are
actually, they're created by -- by manufacturers, often drug companies or specific antibody
manufacturers.
They are then shipped by vendors all over the place, and what happens right now is they're
shipped to many many different labs and the labs do their own analysis and they write papers
you can't even detect what the antibody was. This is Mary Ann Martoun [phonetic] had a very
interesting talk about you can't find what the antibody was from from looking at the paper.
So, if you could track reagents if you could track it starting at the source, you could figure out
which antibody was it which strain, which batch even. Then you could collect all that back up
into an observational database. And then you could build, what I'm thinking of is a virtual
reagent spectrogram. So this particular batch of antibodies, how was that used? What was
the outcome? Can you concatenate that back?
So it's like you're doing this huge experiment. The world is your lab, so to speak. You send a
handful of stuff into it, different people do very different things with it. But as long as they
keep track of certain aspects of what they did with it, you can go back and connect it. So
that's kind of an idea that we're working on.
It's not enough just to upload the data. This is a data sheet. It's uploaded like that, I believe
in this was from figshare. But to use it obviously you need to know what the data means and
all of you will be familiar with this kind of list. So you need to know who made this, when, and
why, how was it made exactly, what were was the tools the antibodies the strains. What are
the units? How precise is this?
I thought this was a hilarious data sheet because it has -- one, two, three, four, fix, six, seven,
eight, nine, ten, 11, 12, 13, 14 -- okay it has 14 digits, does it really have 14 digits? And then
there's a factor, but wait, this is minus 21st to the minus 15th to the what?
So it can't be that they actually know this to 14 digits. It just, it happens to be how their
spreadsheet was set up, right. But if you want to do anything with this data, you need to know
how precise it is, and you need to know to what digit that's actually valid. So that's pretty
pretty important. How many trials did you do, was this the one trial that I happen to be looking
at or did you do 400? Can you give me the average?
Generally, when you look at data, you can't tell this kind of thing. So it's important to open up
the spreadsheet and I'm hoping so we'll be talking with Kirsten and data up it's a great tool,
but you need to get much more specific and that's something we want to be working on.
So the new department that David started and I've joined him on this endeavor is the goal of it
and so this is a new part of Elsevier where we're looking at to what extent do we as a
publisher play a role in this whole space the data space the research data space? Particularly
we're interested in helping increase the amount of data shared from the lab and to enable
such incidental collaboratories as we were talking about.
We want to increase the value of the data. We want to increase the quality of the annotation
and normalization and -- and amount of providence that is stored with this datum however you
define that to enable enhanced interoperability.
And then as Alex was mentioning, help measure and deliver credit for shared data. For the
researchers the institute the funding body and eventually help figure out how to make this
sustainable. A lot of research databases currently are at the end of their grant life cycle and
it's unclear where they're heading.
So is there a business in there somewhere? And who are the players in that business and
how does it all work? It's obvious the data needs to be available to the academics but is there
a business model in there somewhere that might work? Either an open access business
model or talk to grant agencies or are there commercial spin offs that you could do so all of
that we're looking at.
Anyway, I'd be very interested to -- this is just the three parts that I talked about I just want to
make clear that the first bit is mostly my thesis research, the second bit is what I've been
largely working on within labs and the third part is what David and I are really now headings
towards.
So I'd very b very interested in questions and thoughts and, you know, possible [inaudible].
All right. Thank you. Oh here are the references. I'll go back.
>>: [applause]
>> Anita de Waard: Thanks. Okay.
>>: Fabulous talk.
>> Anita de Waard: [laughter] Okay.
>>: So, so um [inaudible] also have organized this big consortium of collecting data, like
there was this in code encyclopedia of the elements where they measure a lot of functional
elements of the genome and pooled the data together, and they also kind of tried to address
this competitiveness by imposing some sort of embargo theory, like you publish a data and
you have like first six months, you can, you have exclusive trial or something.
>> Anita de Waard: Ah, yes.
>>: [inaudible] the data.
>> Anita de Waard: This is Enco [phonetic]?
>>: Yeah Enco is one.
>> Anita de Waard: I've heard about it. I haven't looked at it. It's really interesting how
different things are in different fields. And how -- how very culturally -- people point to
genomics as they have it all figured out. And then they also say yeah but their data sets are
really simple. So this is just these are just snippets I've heard. But it'd be interesting to look
at that.
>>: Genomics is particularly a field where it's so big so it's like no one lab can -- it is
impossible to do that by themselves.
>> Anita de Waard: Yes.
>>: And cancer is another example of this, kind of cancer [inaudible]. So different center they
basically sequence samples and then publish the data they gather.
>> Anita de Waard: Yes.
>>: [inaudible]
>> Anita de Waard: So it'd be really interesting to look at exactly how that process happened
from not sharing data to sharing data and at what time drives are and whether we can,
whether that can be generalized and used in other fields.
>>: Enco [phonetic], I think is there was some funding process it's like they give out huge
amount of money to different institution and so the, so they give you money but then you have
to publish data in this consortium.
>> Anita de Waard: Oh, I see. I see. So I think absolutely the funding agencies are -- are an
essential partner in this whole process. You know I was just at a -- earlier this week I was at
an earth science budget meeting where people are are trying to connect some of earth
science data that's being generated and it's very largely driven by the NSF who are saying
look, enough already, just get all of data connected.
So I think funding, funding agencies have a really really big role in this absolutely, but it's often
a single officer who can really make a difference there. It's very interesting to see the social
processes that happen. But it'd be interesting to get some references. Yeah?
>>: [inaudible] task, one slide experiment there. Do you have a sense of how frequent it is in
the literature that you get that sort of distribution that you showed where you actually have a
citation context for the methodology as well as for experiment as well as for the
incompletions? First these citations or going in on one specific claim?
>> Lucy Vanderwende: So I think that's something that will come out of more extensive data
analysis because we just did a few papers just to a prototype. But as you said, you know,
now we have funding supposedly close to having funding to produce some annotation for
about should be 20 --
>> Anita de Waard: Yeah.
>> Lucy Vanderwende: -- or 30 papers that that with all of their citations but you know, 34
paper. And I personally think I expect to see that a paper might be first cited for its results and
then I think over time it is cited for different facets. That is what I think I expect to see.
>> Anita de Waard: Yeah I think there's -- there's a reason a paper is cited overall. And I
think with Lucy this changes over time. And some papers of course it depends on the paper
too. Some papers are famous because of the method. You had this machine learning paper
that ended up being essentially citation was a synonym to the -- it was the first time that a
particular technique was described in a paper so the paper then comes to stand for that
technique even though the topic of the paper was maybe completely different. So then over
time you see it kind of converging on this it's a shorthand for saying you know you're referring
now to that technique.
In the case of Voorhoeve, for instance, it was partly the technique that they did, but it was also
their conclusion, and so you do see those different citations. There is a sentence within which
a paper is cited so the sentence tells you something in general. The questions going to be
whether you can link it back to a specific place in the paper or just an aspect. That is, that's
trickier I think.
>> Lucy Vanderwende: It's tricky, but I think it will be so much, because now if you have a
citation to a paper and it looks interesting where you're going to go you're going to go to the
abstract of that paper. But if the abstract of that paper doesn't cover the kind of the content of
what that paper's being cited for you're not going to pursue it. Whereas it might have been
crucial.
>> Anita de Waard: And I think another important point that you've always been making was,
it it's very healthy to see how the author actually formulated the claim next to the way it's been
represented in the cited paper. And I've heard people of highly cited papers authors of highly
cited papers complain because they said well my paper's always cited as saying that but I
never said that. Often you hear that. I never said that. So it would be good to see what the
author did say. You know, next to the citation apart from everything else.
>>: [inaudible]
>> Anita de Waard: Yes that's true [laughter]. Even if they did say it, they said they didn't say
it. This is very true.
>> Lucy Vanderwende: And the other thing would be, the other thing that I hope to come out
of this work would be where the citing author refers to the work with different terminology than
the cited paper, but then subsequently, that gets used in the rest of the literature, with a
simple keyword search you'll never find the original. So hopefully you'll be able to associate
additional terms to the original cited work.
>> Anita de Waard: Other?
>>: [inaudible] because not only you have no idea information comes from or who wrote it --
>> Anita de Waard: Yes. Yes. Or how true it is, and what their agenda was and all of that.
Absolutely. Absolutely, yeah that's a good point. And you really have to dig deep. I mean
you don't I don't think there's any way that a specific statement in Wikipedia can be tracked
back to a particular author. I don't know enough about.
>>: You can [inaudible].
>> Anita de Waard: Yeah. You can, I guess yeah.
>>: So I think I know your answer to this based on a couple of previous conversations, but
given the sort of hedging context that you're talking about in claims, how -- how do you -- how
do you feel about the nano publication movement in general and the whole fact extraction as
a way of -- of representing knowledge?
>> Anita de Waard: Right right right. I actually at one point I gave a talk that says why triples
is not enough, or are not enough. So a fore thing I always balk at the term fact extraction. I
think if you say claim detection, I'm a lot more happy, because you know. But -- and I've had
many conversations [indiscernible] about this. Biology's is now moving into the oh, well, let's
just point them at the full text article.
I actually -- and this was at last year's seashells meeting. It's a meeting we're having in
Boston again in February, by the way, big little plug, but it's semantics and healthcare and life
sciences meeting. At that meeting I was presenting this idea of having these triples with an
epistemic wrapper around them. You know, I think -- and so I was going more in the
structured direction. And this is when [indiscernible] said well we can't do anything about
attribution and figuring out how true something is, so we just point to the full text paper, which
I think is -- I think -- I think you're giving up a lot there.
So I do think it would be interesting to see if we could create the -- I think if you look, for
instance, at the headers of in a cell paper the headers of the sections, those are almost
perfect triples often. We did a very preliminary analysis of that I mean I only looked at cell
ensure many other biology articles are the same, but the headers of the experimental results
section in particular. They're little headers that say what was the key finding of that
experiment?
So those are interesting places to start for, I think for extracting triples because they will be
claimed knowledge updates. They will be things that these are things these are not things
that are presumed to be true. These are not things the author is citing.
You don't call your paragraph, you know, a new mechanism for blah-de-blah, or H influences
Y or whatever, you know. You put that statement there because it's a claim you're making.
That document structure has the has the purpose of saying this is my claim and then the the
paragraph is that claim.
So I think for instance taking those as a first step as a place to go to the claim knowledge
updates and then that means already that you know the attribution you know something about
the value they're presuming it's been true but it hasn't been shown, hasn't been validated,
hasn't been hardened by citations, and there's data in it because otherwise they wouldn't dare
call it a subject header so to speak, unless the subject header can have hedging as well as
potential and all of that.
So if you were to take a representation like that and and around the triple you would wrap
something similar to our orca ontology, but there's lots of other ways to represent it but then I
think you could get a nice structural representation of of the claims in the paper. So -- so I'm
not saying don't extract triples but don't call them facts, you know.
And if it doesn't seem to me to be so hard to detect this hedging. There are very clear
linguistic markers. The number of reporting verbs is very limited, and anybody with any kind
of degree in conversational linguistics can probably plug it in, you know.
And this is first attempt was really pretty good. All the things she pulled out we validated with
a domain scientist more relevant statements. She didn't get all of them. We didn't get all of
them. There might be linguistic things that we -- but of course the ideal way you'd want to do
this and I remember your talk, Lucy, and when you have the author writing the paper.
Now you have the triples. Now the the author can give the validation, yes indeed, this is my
triple. Now the author can have a claim evidence, now work representation of their own
paper, and you can identify the entities. You can say oh wait a minute, you didn't say what
antibody strain, here's the catalog, it one of these ten? Oh by the way your reference has a
misspelling. Which one is it? You know. So, you can do all of that. And then I think make a
nice network.
>>: When people represent these things as facts, is it meant to be human readable or
machine readable?
>>: Lucy Vanderwende: Where?
>>: Where we -- so if you do these --
>> Anita de Waard: Oh. Many publications are machine readable. The point that bottom
mostly -- yeah?
>>: How do they do negation?
>> Anita de Waard: They don't. [laughter]
>>: A minimal case.
>> Anita de Waard: You'd think so, yeah. Yeah. I think they throw it out if it has a negation. I
believe. I don't think they [indiscernible], they just get rid of it. I think, I think I asked him this
question not too long ago and I think that was his answer.
>>: His examples always come up being I'm interested in the X causes Y versus Z causes Y
so I can compare conflicting claims rather than compare positive and negative claims.
>> Anita de Waard: Right.
>>: But if it's going to be machine readable, the machine might -- you know. Want to be
aware [inaudible] --
>> Anita de Waard: I can't hear you.
>>: [inaudible]
>> Anita de Waard: Because he doesn't know how to handle them, I think. I think. Because
he's not sure whether the negative goes with the claim or not. So if he comes cross a
negative, he's like, go away. What he wants is huge volumes to make inferences that were
not otherwise be made. That's what he wants. He wants to get to genes and proteins and
relationships and find relationships that there otherwise that couldn't be found. That's his
goal. But --
>>: [inaudible] based on claims, and idea was so if you take a computer science but
[indiscernible] okay, you understand, in many cases you understand [indiscernible] the less
you understand [indiscernible] you start is a collection of claims five six claims [indiscernible].
This -- this is what I've done okay, this is what my paper difference, okay --
>> Anita de Waard: Yes.
>>: -- and if you cannot write it about it --
>> Anita de Waard: Yes.
>>: -- it's about -- your paper is about nothing.
>> Anita de Waard: Yes.
>>: And the idea was not only to -- to ask to provide [indiscernible] these claims but also
variation paper [indiscernible] claims.
>> Anita de Waard: Yeah.
>>: How the claims come about, how the investigation --
>> Anita de Waard: Yeah.
>>: Okay. You thought they won't have to give you the paper [indiscernible], and they also
want you to these claims together is the paper so again a lot of time trying to understand what
it is about, just look at these claims and know what --
>> Anita de Waard: Right, right. So I don't know if I ever showed you this ABCDE thing. So
it's a little lay tech script it was called ABCDE. BCD was about background contribution
discussion. We said computer science, this was meant for your computer science conference
proceedings. Computer science papers at least have a background, they have the
contribution that the author's made, and they have some discussion. This was like the
simplest format of a paper. People essentially just put tags around that.
And then there was no abstract, but to highlight certain sentences in those three parts that
were their core sentences, and you could then of course pull out those sentences and they
would be a little structured abstract.
So you could say for instance if you have proceedings where you have, you know, a thousand
posters, you could only look at the core contribution sentences where the people said we built
an X, it did Y, right. Instead of the semantic web was conceived by Tim Burnsleigh [phonetic],
like yawn, you don't want to know. So the background -- there is still background and maybe
you want to throw the background together to see what's it's about or you can mine it but you
can just look at the contribution. And that was we have all the script.
The A was for -- I think it was on annotation. So we thought this is just a little set of these
statements that you then wrap with some Dublin Core annotations and the entities so we
thought references shouldn't be under the paper they are just hyperlinks to other entities. And
then the idea was you would link to other sections in other papers.
>>: If I think in our idea, the interesting part was that nobody [indiscernible] provide any
annotations, but in this case they will have to to make them paper accepted.
>> Anita de Waard: Yeah.
>>: It's -- it's their utmost interest to write out good claims.
>> Anita de Waard: Yeah.
>>: [indiscernible] and proceeding so published these claims separately. It's a lot of
literature. But as I said. You need to [indiscernible].
>> Anita de Waard: Yes. Other questions?
>>: So I'm kind of curious, so first I would begin with recognizance, so you mentioned that the
figure contained the data, the metadata. So I actually I'm aware of also incidents where the
figure actually contain much more detail of the conclusion compared to even the text and
that's in the pathway describing biological pathway. Often it's, you have this graph with chain
engine products and how they're connect to each other and a lot more detail in the figure, and
that's kind of like what the paper is about, where as text is like just a some sort of scarce
reference to some pieces or the most interesting part.
>> Anita de Waard: Yeah.
>>: So now it seems like first of all, in terms of like, extracting most knowledge, it seems like
in the unique process may not be that difficult, like we can probably do some OCR or
something and could graph the image and actually then we can create a pathway
[indiscernible] at least. How about all of this including other things that you mentioned like,
like the key finding for the data sector --
>> Anita de Waard: Uh-huh.
>>: So all those, like, from the data mining point of view they require the full text --
>> Anita de Waard: Uh-huh.
>>: But a lot of this full text, especially in the bio med field is [indiscernible]. So I'm just
curious, what is your kind of -- what do you think will eventually put like maybe the change of
publishing model or something that can -- make a lot of this kind of treasure be --
>> Anita de Waard: Right. Right. So the simple answer to that is I believe, all -- all major
commercial publishers' papers are now stored in pub med? After, help me out here after how
much time I believe after six months or something. I think it's six months.
>>: [inaudible] six months.
>> Anita de Waard: Yeah. Right, right.
>>: The pub med central?
>> Anita de Waard: Right. So you know there's that. For Elsevier, there are, we're --
>>: Also, some journals require a deposit of data sets, whether it be paper or in a public
database. So that somebody can go back to [inaudible] --
>> Anita de Waard: Right. I think that if a figure is a useful representation of a pathway in the
paper it should not be supplied inside a figure, you know. We just need to move to a space
where those are the entities, these are the relations and you have some formal
representation. Bell is an example. Ariadne is as example. There are other languages. Keg
I think publishes pathways in, you know, like that.
>>: [inaudible] that there was standard formats.
>> Anita de Waard: Yeah.
>>: The trouble was often that nowadays there is just a PDF of one image, but certainly, I
mean if the publisher insists they could -- the author probably will be able to provide the
format style or something.
>> Anita de Waard: Yes. Yes, yes. And I think we do need tools to make that very very
simple to do and then require it. We were just talking about that. I actually, my research
came out of a proposal to have exactly such a graphical representation of papers in cell
where you have entities and their relationships. A inhibits B, you know, B might excite C or
whatever. To have that be a representation of the paper and then have the relationships link
into the text.
The entities are largely known but the -- the relationships are what's being argued in the text,
you know, through the LATS inhibitor or not. So having that then link to the right paragraph --
and so what I was mentioning the subheadings of the experiments, those can be often easily
represented in triples but also in a graphical format.
So I started out my project -- I ended up getting in a happy wanderings in linguistics, but I
started out my project looking at that kind of a representation. I do think that's where nano
publications are very useful with a little bit of epistemic graphing around it, you could certainly
do that.
So and you guys are closest, closer than anybody to making a tool to allow that creation while
during author submission. Now retroactively, yeah, you need to go and go through the PDFs
and hire people to do it or mechanical torque it or whatever, pull it out. I don't know. I'm -- I'm
not a expert how to do that. Other questions?
>>: So, so you imagine this that they are required to put in kind of central -- pub med central
seem to like freely accessible portions and could be much smaller than the that be archive.
So last time I checked, the pub med central has like two million, they claim to have two million
papers, but the free or available like to download online is only a 300K, so --
>> Anita de Waard: I don't know how that works.
>>: You don't know how to get access?
>> Anita de Waard: I don't. I don't.
>>: That -- that may be what David was referring to in terms of embargo period. So some
publishers will work differently in terms of the embargo whether or not they trust pub med
central to hold back on those embargoes, those embargoed items. Even with open access
publishers [indiscernible] rapid central tend to send all of their publications over pub med
central, pub med central holds back on them before they're even available [indiscernible] for
reasons I don't know.
>> Anita de Waard: Yeah.
>>: [inaudible] funding, so --
>>: So it's funded by an agency [inaudible] --
>> Anita de Waard: Right, right.
>>: It's not funded by the agency might not be -- tricky, you know, it's tricky.
>> Anita de Waard: Okay. I think it's -- right? Thanks. Okay.
Download