>> Lucy Vanderwende: So good morning, everyone. Thank... introduce Sophia Ananiadou. She'll be giving a talk that's...

advertisement
1
>> Lucy Vanderwende: So good morning, everyone. Thank you for coming. It's my pleasure to
introduce Sophia Ananiadou. She'll be giving a talk that's titled extracting events from
biomedical relevance from text.
So currently Sophia is the director of the National Centre for Text Mining, University of
Manchester, as well as a professor in computer science there.
I've had a long -- the pleasure of knowing Sophia for quite a long time, and the first BioNLP
workshop -- which was it in 2001 or 2000? I remember attending ->> Sophia Ananiadou: '2.
>> Lucy Vanderwende: '2. I remember attending that workshop because I was really interested
in text mining for biomedical -- from biomedical text. And Sophia and Professor Tsujii both
cautioned me. They said: You shouldn't go into this unless you have a partner in biology. It is
just too -- too much otherwise.
And so I heeded their warning, but now I have partners, both at Microsoft and at University of
Washington. But it's really been a very interesting field, and I'm excited that Sophia will tell us
more about it. Thank you.
>> Sophia Ananiadou: Thanks, Lucy, for the introduction. So I would focus on one only area. I
mean, we're doing several things in -- at the -- at NaCTeM, I'll just give you the first slide to
understand what -- who we are and what we are doing.
NaCTeM resides at the University of Manchester, and, as Lucy said, you need to be collocated
with biology, so the place is Manchester Interdisciplinary Biocentre with mainly chemists,
biologists, and people like that who are actually quite interested in what we're doing.
So we started around 2004 and '5, and we focused predominantly on providing text mining
service solutions for the bio domain, the medical -- increasingly the medical bio domain.
Now we are sustainable center. I just provided only two or three of our funders. We have been
funded by industry in other different types of funders.
So all the things I'm going to say today are been part of the center, the work we have been doing,
but mostly I will focus on more most recent work on extracting events of biological relevance.
I'll explain what's that.
Why we're talking about events. In biomedicine we have what we call this fragmentation in
different types of specialisms, sub specialisms. There are different components aspects in
systems biology, or in systems medicine increasingly, that deal with chemistry, the biology,
medicine, and lots of -omics, different levels, from transcriptomics to genomics, metabolomics,
proteomics, et cetera.
So all this type of the translation to deal with the various translational aspects for different
2
applications, one needs to go into a kind of deeper type of analysis of text, not just of course
keyword extraction or association mining, but trying to find out more complex information, so
I'll discuss what is an event.
I will also talk and place some emphasis in my talk in some applications which are related with
events. So I'll skip over the rest, so the whole discussion is going to be about event extraction
and how this is has been realized in different applications.
So in case you haven't heard before, I don't know about here, but some people of course know all
about the natural language processing people, but maybe some of you are not, if we take a simple
sentence that we find many paper in MEDLINE, like expression of aurora B enhances
phosphorylation of S6K1 and 4E-BP1, which is a normal sentence, this sentence includes various
types of events. Some are complex, some are simple.
A simple event of a type is an expression which is first realized with a [inaudible] expression. It
has a three [inaudible] and the theme of the object is aurora B. You have another event dealing
with phosphorylation, whose theme is S6K1, and another event, and again, which it starts from
phosphorylation, another object 4E-BP1.
Now, those events also have -- if you look at the Event 4, which is under enhances, under
enhances, you have two types of positive regulation. And the positive regulation has a kind of
cause and effect, and the cause includes another one event, the Event 1, the expression, which
you will see here -- do you have any pointer perhaps or -- ah, yes, you do. Yeah, this one.
So you have this, so the Event 1. And then the Event 2, which is phosphorylation, which is the
theme here, and another more complex event, just to give you an idea of the types of analysis we
need to do.
Why we need do that is because you're going -- we need to go to a level which is bridging the
universe of text, the sphere of text with knowledge. And biologists care really much more about
first, you know, more sort of biological events and biological pertinence, what is most important,
than what is actually the expression in text.
So now that you've seen what actually we're trying do, by event is beyond relation. So people in
the past have really focused on extracting protein [inaudible] interactions, gene disease
associations, so there's still a kind of binary type of relation, which is in sense somehow simpler
to do.
But what we are dealing here is much more complex, is really going into a type of bridging to the
knowledge sphere of biology. So it's what we call a more -- an event is a dynamic, a
bio-relation, which most of the time is -- has many arguments. So very rarely it's a kind of
binary.
And all those events, how we draw them from reality, we use ontologists, very often we use GO
ontology, but you can create, use, define or measure different types of ontologies. And they have
a lot of different participants. In linguistics we call them theme cause or whatever kind of
3
arguments. But these roles are really geared by the domain, basically they're domain dependent.
And the various types of participants could be, as you saw before, other entities, like bio-entities,
proteins or genes, but increasingly they have other events. So a bio-event has -- it's quite
complex because it draws from ontologists and includes other types of other entities.
So this is going to be basically the focus of my talk. I'll just focus on events. And I will explain
how this -- what kind of technology we have used to build that, as part of members of my team,
and how we applied this event extraction to search, the semantic search, the existing systems we
already had like MEDIE, or systems where actually mine associations direct and direct
associations where you can do this including events.
Increasingly events of course are very important for like extracting bioprocesses, like
angiogenesis. And also if you want to go talking about my first slide, the various -omics types of
problem, if you want to integrate multiple levels of biological organization, which I will discuss.
In addition, we have included -- integrated this technology towards pathway reconstruction. And
I don't say construction because we don't do automatic construction yet, that's very difficult, but
how you can actually produce the types of evidence to enrich the pathways.
And last but not least is actually quite we think a very interesting and upcoming area is what we
call the event interpretation, so where are the experimental findings or what is known
information or old or hypothetical or speculative, that actually could be very important when
you're building for search and for [inaudible] and for pathways.
For that I will not discuss you need of course to have -- most of our techniques currently have
been supervised, so you need training, you need to have training data. And I'm not going to talk
about this, but you can find the one I think quite a few people know about, GENIA, another
event corpora that we have basically built, and a last but not least I'll end up my talk with the
shared tasks are extremely important in the community, first of all, to inform and evaluate our
tools but also to obtain the training corpora to be able to build more tools.
So this is -- the focus on the applications is basically semantic search. Hypothesis generator, this
is very important for medical and clinical applications in terms of to mine direct and indirect
associations. Extracting events across multiple domains. Enriching pathways. And also to do
that I will very briefly allude to an environment, a platform we have built, which integrate the
processing components and annotations which is very important for the curators.
So basically, what I said before, it's a kind of nice diagram, and because I've done it, why not.
It's exactly the same thing I told you before. This sentence is basically kind of represented by the
various events, phosphorylation, binding, different types of arguments. Here you have, for
instance, a side theme and then the top basically event is negative regulation.
This is what the biologists when they want to search they're interested in negative regulation and
positive regulation and in addition and so on. Very rarely, if you start searching with keywords,
you would have an enormous amount of knowledge with not too much relevance for biology.
4
So for that we have -- one of the tools I'll present briefly, and if you want there are lots of papers
that -- it's basically Miwa's work, who used to work with Jun'ichi's team in Tokyo, where he
started this actually tool and now he's working with NaCTeM.
So EventMine is basically detects, extracts event structures and is using a deep parser, Enju. I'm
not going to talk about Enju because it's basically extracting predicate argument structures. So
it's a kind of more deeper syntax.
So what it does, what EventMine does, maps from the deep parse results into event structures.
And then here's -- Miwa has used all sorts of different features, experiments you will see for
classification, shortest path, bag-of-words, and so on.
And it's an SVM type of -- it's training classifiers with SVM using various annotated corpora for
each module. Annotated corpora have been mostly used by the shared tasks, that's why they're
very important, and of course the work of GENIA that we have [inaudible] has built over several
years.
So how this actually work, very briefly, if you want more, you can read especially the [inaudible]
papers where he describes in detail EventMine. It's a pipeline. It's a kind of additional pipeline
of an event extraction system which has fallen components.
So each component is done in sort of -- more independently. You have the trigger, the entity
detector, so you have like a phosphorylation extracting, identifying triggers is a quite challenging
stuff often because lots of ambiguity, and there's been quite a lot of work of people trying to
improve actually the extraction of triggers. So you have negative relation inhibits. Another
trigger from binding, so the event binding is binding, and entities.
Then the next part is actually once you identify the triggers, you have to find the arguments or
the edges. And for this you have -- it's based on Enju, on deep parser, so you have inhibits and
phosphorylation, inhibits to binding, and in relation of theme the arguments binding in CD40 and
so on.
I think what we believe is a very interesting, as you will see later on, part is how you're dealing
with multiple arguments. So you have then a multi-argument event detector, so you have here
inhibits has causality, so you have this type of information, and binding, a theme, and so on,
complex basically events.
And multi-argument event detection is extremely important for the types of complex events we
were talking about. And this will be seen with some of the results they have produced.
The last but you have on top of that, once you finish with that multi-argument detection, you
have then modification, and the modification is mostly information like paging, speculation,
contradiction, negation, and so on.
So in the BioNLP task, it receives -- actually here it's been -- we got 58 percent, 15, which is
5
really one of the top F-scores.
So some further information which you can find on the latest paper is the type of -- it's basically
a multi-class, multi-label classification problem, and some of the feature types are described here
for triggers, for arguments, assorted paths, terminal nodes, words around candidate pairs, and so
on, and for multi-argument and modification.
So there have been several extensions to EventMine, which we thought they were quite
important, especially when we're dealing with full papers. But I think the most important is
when you're trying to adapt EventMine to different domains, not even -- even within biology and
biomedicine. So if you're trying to extract, for instance, for pathways, you have signaling to
metabolic pathways, you have different types of arguments and you need to adapt your type of
event extraction.
So the first -- so this is basically -- in this recent paper everything is described there, so I'm not
going to repeat the same paper, but I'll just very, very briefly talk about the coreference
resolution and the domain adaptation, and in the end I will talk about meta-knowledge
assignment.
So in a sense EventMine, after you extract multi-argument events, does three more things. It
does coreference, it already has include some component for domain adaptation using various
corpora, [inaudible] meta-knowledge assignment, basically hypothesis, negation, and
speculation.
Well, very simple thing, what is the coreference resolution? You have to -- you link mentions
and antecedents. So in this case it's very simple example of M-CSF treatment was also
associated with a rapid induction of jun-B gene, although expression of this gene was prolonged
compared to that. So you have -- you need to link mentions with antecedents.
This is increasingly important. It's very important for full papers rather than less than abstracts
so the results can be seen when you're dealing extracting events from full papers. Although, I
have to warn you, you don't find the kind of wow and fantastic, you know, improvement in the
EventMine in the event recognition. So still it's a difficult problem. That's [inaudible].
So this is -- basically includes a kind of rule-based coreference where you actually have to detect
the mentions candidates, then the antecedents, and then the links.
So basically how those -- sorry -- how those results are integrated into the event extraction
system is by modifying the parse results so mentions and antecedents share the dependencies.
This is the PR feature. And then extending the features you have coreference mentions to
argument detector feature, the FE.
So what you see here is basically this is the best performance here by adding all those features,
and it's about 58 from the baseline for 58.15, 58.81.
So the tiny one, tiny one, so this was actually not trained. We trained on abstracts and applied to
6
full papers. So it's actually not bad. I think we believe if we had annotated corpora on full
papers with coreference, then it would have been slightly better.
And the domain adaptation, which I think is much more interested, he use the two methods for
domain adaptation, the stacking method and the weighting method which he has applied for the
instance weighting method which has applied for the two types of shared task, the GENIA 2009
and '11.
The interesting thing is here, actually. In the last shared tasks, there have been different -- there
have been full papers and abstracts. And we had also types of relations and events which are of
quite different types. So the infectious diseases, for instance, or the genetic corpus have different
types of events, and of course the other phosphorylation and the additional shared tasks.
So when you're actually comparing the performance, you have to see it across different domains,
in this case is they be genetics or the infectious diseases, and different types of text, full and
abstracts. So actually from here you can see a quite big jump from 47 percent to 51 percent and
50 to 52.39.
So basically the -- by including those components -- [inaudible], oh, yes, compared with other
systems, you might say, well, you know, it's -- event extraction, you will not have the
performances you have in entity recognition. You will still be in the white -- I think top close to
60, currently it's very -- it's a good actually result.
But it is important in comparison to other systems to be able to deal well with full papers and
abstracts, to be able to deal well across various different types of a corpora which deal with
different types of events, and basically this is in a sense for full papers you do quite a bit better,
much better, if you incorporate coreference.
So this is basically some of the enhancements of the EventMine, which actually boosted a lot of
the performance and outperforms other systems. If you want more about all the details, it's in the
paper, so ->>: [inaudible]
>> Sophia Ananiadou: [inaudible] is the name and [inaudible] in Finland. So those are the top -so this is how we compare it with a top system.
>>: [inaudible] system combination?
>> Sophia Ananiadou: It's a system company, yeah.
>>: [inaudible] Stanford and Massachusetts?
>> Sophia Ananiadou: Yeah, sorry, I forgot [inaudible]. So those are the top -- I mean, we
compare with the top system. So I'll -- basically saw that, all the details in the latest
bioinformatics paper.
7
So how we use that now. So what's -- okay, fine. Did we have fun improving performances in
minute details?
So, first of all, biologists want to search, want this type of system to be as accurate as possible, as
is possible to do search. So we use the MEDIE, which has been built by Jun'ichi when he was in
Tokyo before he joined Microsoft. And we enhanced it. We -- MEDIE already was doing
semantic search base on facts, it was a system which was actually in 2006 quite very novel, was
actually extracting facts from the whole of MEDLINE based on deep parsing.
And so, for instance, you can extract what is activated by circadian clock, what cyclins are
regulated -- so you are basically utilizing all the syntactic variability that you have in text, and
when you're making this query, you really extract proper subjects and objects, which at the time
it was not possible for the systems.
This is the system you can still -- well, this Enju, so I'm not going to talk about Enju with
Jun'ichi here, and -- but if you want to find about information, it's basically on HPSG.
So this how it looks like before we added the events. So if you put for something on the Web
site of NaCTeM, so you can -- it's open, people can use the Web services or can hook into this if
you want to. It's based on the whole of MEDLINE. And if you asked any kind -- currently we
use a template type of subject, verb, object, so you can ask questions like -- which then translate
into a query -- p53 -- what p53 activates.
What you have here are basically the sentences which, as you see, there are basically you have
may amplify, you have -- it's a kind of an expansion with ontologists and extract sentences which
are pertinent to this query. So -- so basically, as you can see here, it deals very well with
[inaudible] and all that, which is very important.
And also you can change the format. You could look at in a more tabular, you know, form. So
you can see here you have verbs like amplify, mediated, activated, which are synonyms and
they're very relevant to your query.
And immediately the user can see from what p53 activates from those sentences is like the
sentences. If they're of interest. And if they're of interest, then they can click to the title of the
paper. And also they're all linked to all the various databases. So you can just click into a gene
or to a disease and have access to all the various databases.
So going back to events, all this kind of multi-argument, how we can actually now change
MEDIE and add events, how we can search with events.
So we use now this type of events based, again, on the shared tasks. That's why shared tasks are
important. Because it's quite a lot of work. You have to -- in a sense -- in molecular biology,
these are the types of events of upper-level that people are looking for. So if you ask them what
else do you need, they will come back to quite high-level phosphorylation, binding, positive
regulation, and so on. This is what they want to search.
8
So if I just put my query as localization, so you will have the interface localization of you don't
have to -- you can specify the type of thing, the object, or not. If you don't specify, what you will
have now are sentences which basically are retrieved within a specified location and theme.
Okay. So there are still sentences retrieved with a localization event as a query.
You can then, just to give an example, you put localization of TNF-alpha, you specify, so these
are the sentences extracted with this specific type of argument. And you can have -- oh, this is
actually in a different tabular form. So you can see immediately -- oops, I'm going to go back.
Yeah. And this is a much more complex where you have a positive regulation and another event
as well, phosphorylation, of -- in various arguments.
So although this sounds quite complex, in a sense this is exactly the type of information that if
you want to, for instance, reconstruct pathways or if you want to ask questions of biological
relevance, this is the type of upper-level information that people want to know.
So what you have in fact are just various instances, various realizations, of this upper-level
biological event. And you can specify of course the site or the cause if you want to. But
currently gives you the sentences automatic from the whole of MEDLINE that respond to this
kind of query.
So -- and that's a different way of represent. So you see here you have -- it's a kind of a frame, a
knowledge frame, really, which is extracted from text right now. So you have various types of
responses to this slot.
So this is actually one type of how complex events, or if I go back to this, can be integrated into
a search system, like MEDIE. So you can update it to just do that, but also you can upgrade it to
a set events of different types of biological pertinence.
So if in this case we have molecular biology, but you can work or you can just absolutely train
EventMine to be able to extract different types of events as long as you have the annotations, the
biological relevance.
So going back to that, another follow-up actually work on events was the realization that most of
our focus for the past ten years was on molecular type of entities. So we were extracting genes,
proteins, chemicals, and drugs, and very often really focused on a set of them, simple, of binary
types of associations, protein-protein, drug-drug.
So a very -- again, what the biologists are telling us is you need to actually expand to go from the
molecule level to organism. So this is a very recent work. We just -- it's just going to be
published in next month in bioinformatics. It's actually event extraction which goes from across
levels. So from the molecular to the anatomical -- cellular components, cells, tissues, and
organs -- to organisms.
And in the end, if you have from this one, basically -- I don't know if I have that -- you want to
be able to extract this type of information in the end as well. So right now we're going
9
somewhere here, with we want to be able to extract about -- growth about organs, about
anatomical information, and so on.
So this is where we have really worked, most of the community, for protein post-translational,
epigenetic regulations, molecular mechanisms. But this is a kind of limitation of going forward,
and especially for health, this is extremely important to go across levels.
So the approach that we did, we have done some work on extracting complex bioprocesses based
on angiogenesis. That was in collaboration with AstraZeneca. Actually last month this project
finished. But we created a very nice corpus which had kind of this type of very detailed
information, which is actually publically available, although it's a small corpus, but it took a lot
of time to prepare.
But initially this one used -- so the type span representation. So the new work we're doing across
level we basically added event representations, we extended the types to have more anatomical
entities and other, which you will see later, not immediately now, and based on OBO, GO, and
CARO, which are anatomical entities.
So here is actually the types of entities we used, examples, organisms, anatomical system,
organs, multi-tissue structure, developing structure, tissues, and so on, organism substance,
pathological formation. This is mostly following CARO.
And for an entity, anatomy-level events, those types are like skin development of fiber
formation, growth of arteries/tumor, remodeling, breakdown, death, cell proliferation, and
planned. This is mostly from [inaudible] level processes from GO.
So what is actually we have used. Actually this is mostly samples of work, and Miwa's. We
used tools, we used EventMine with Enju -- first we used Enju, then we applied EventMine to
adapt to the various types of event, and then -- you need for this specific domain. But what
you're recognizing now here is you see organ, multi-tissue structure, pathological formation,
organism substance and so on. So this is actually extending the problem to go to a much more -to do it more multi-level.
Some of the results here for -- by categories actually combined the baseline 57 [inaudible] 52 and
using various other [inaudible] resources and anatomical, for instance, 81, 76, and molecular, 72.
So it's okay.
The results, all the resources are on our Web site. The corpus, it's called MLEE, and it's actually
an extension of event extraction to various different levels of biological organization. It's very
richly annotated with about 8,000 entity and 60 event annotations. And also in a sense also it
show how EventMine could be also be used in this type of domain. And we used various
resources like ontologists and so on.
So some of the references are here. The initial corpus was that in bioprocesses we did for
angiogenesis, and this is the one which is published, well, very soon, about few -- two weeks'
time.
10
So now -- now I'm going to change again. Again events is the theme, but a slightly different
system, so again an application we have used. And this is actually I think closer to medical
because in the FACTA system, which have developed the -- we have changed a lot since 2008,
we have -- we mine direct and indirect associations. It's very much the Swanson type of
hypothesis of, you know, if A related to B and B to C, A to C as well. So this is a kind of
straight -- quite well-known approach to -- for knowledge discovery and hypothesis generation in
biomedicine.
The system currently has been initially, as I said, operation the whole of MEDLINE and how it
works, if you go on our Web site, if you put a query like caffeine, it gives you -- a priori we have
identified some concepts which we thought were important when you're searching: genes,
diseases, symptoms, drugs, and compounds, which are ranked different types of measures. I'm
not going to talk about the whole FACTA right now, but they're obviously [inaudible]
information and frequency and so on.
So what it does basically if you're clicking into caffeine, the relations, direct relations between
caffeine and fatigue, you will have these types of snippets of text extracted from MEDLINE
abstracts.
So this is the types of -- actually most of the medics quite like FACTA. They like very much the
direct but also the indirect associations.
But this -- FACTA operates on queries, and queries can be complex, but basically nouns, you
know, concepts, concept associations.
So I'll tell you also a bit slightly one slide about the indirect. Indirect is a two-step Swanson type
of hypothesis. You're doing a query from pivot concept. So you want to say, for instance, how
diabetes affects [inaudible] diseases [inaudible] for instance.
And normally [inaudible] FACTA with E-cadherin is another example. Here will tell you if your
query is E-cadherin, you go -- in this case our target is diseases related with E-cadherin via
proteins.
So this type of indirect associations tell you this is really the most interesting part for most
medics, is that E-cadherin is associated with Parkinson's disease via CASS4 and transcription
factor EB and so on.
Now, events again. So we thought, well, we can enhance factor to do this type of search, but we
can add events. So -- so not only E-cadherin, but in this case we use the GENIA ontology for the
molecular again, on the molecular level, so this can be of course enhanced with different types of
events.
So if you're searching for positive regulation, what the FACTA now will do would extract not
only the associations between tumor and E-cadherin but related with positive regulation. So it
does concept-based direct and indirect associations with events.
11
And this is another example actually, which is positive regulation. So I should have given you
an example which I skipped. Now, this is actually a level towards you can enhance FACTA with
different concepts and with different event types. And with that also it's very nice to visualize it
because people fed up looking at the long list of names.
So this is how it looks if you find direct association with E-cadherin. So you see here how
important is that where you show other entities concepts. And you can see the indirect
associations. And you can see here how the indirect associations are linked in with melanoma
with various other concepts. This has to do with we have chosen disease and gene.
And this is actually more various other indirect associations from different other ways. And that
one here is with events. So basically you see indirect or direct associations with an event where
you can visualize it.
So you can see immediately that -- well, not automatic, so how E-cadherin is indirectly
associated with the nervous system disorders basically, like Alzheimer's, Parkinson's disease, and
epilepsy. And you can just go and drill down through the documents.
But in this case we thought -- I thought it was very interesting to show you an application how
event extraction can be embedded into existing set systems that do either more complex like
MEDIE who have more -- they use deep parsing or FACTA which is more on concept
associations.
So now somehow chancing the shift, and I'll show you another application of why -- how again
how event extraction has been used -- is currently being used for pathways. Just slightly
different topic, but the same theme.
So I don't know how many of you know about pathways construction or don't know the real
background at all, but pathways are, again, very -- like a cord of systems biologist and systems
medicine. And increasingly people want to see how we can link evidence from text to pathways.
So -- and that's a very challenging problem, very -- automatically constructing pathways is really
like a Holy Grail, but we think we can do a lot towards providing lots and lots of evidence to
allow people to make decisions and construct models.
And very actually, just to give you an example, for the mTOR pathway, people to construct this
pathway had to read 519 papers. So this is a manual process till now. They identified 964
entities and about 800 reactions.
So because this is manual, clearly there are lots of things that are amazing, just going through the
literature is how you first do search to find the documents and how you identify which
components are important to basically create, to say this is a reaction which interacts with
another reaction and so on.
So this is where we started actually. This is work we started with Jun'ichi a few years ago. And
12
we wrote the grant in 2006, I think. And the system I will talk about is PathText, which is still
ongoing. We keep on upgrading and updating it, and it's actually using -- linking with all sorts
of different pathways.
So the architecture of PathText is basically if you have various models or other kind of various
parameters here, so you have interactions or reactions between those modifiers, or we could use
in this case I think CellDesigner, but you could use any kind of editor to -- SVML model to
represent this kind of knowledge. You need to breach that gap that -- between model with text.
So in our case our PathText links to two of the systems. That's why I explained to you to
understand a bit about pathways. I'm not going to talk about this. So it's basically named-entity
search. But FACTA and MEDIE are providing, especially enhanced with events, are providing
the type of information that is needed to link pathways to text, how.
So -- well, this is exactly one kind of snippet to see how these various results. So here is the
pathway. In this case pathways are independent. In this case we use Payao. Payao is a kind of
interface between CellDesigner, which is a very common way of annotating the pathways, the
different editors of pathways.
But what is important for us to see how basically we can use the various publications from
PubMed or full papers, use our systems to integrate it with this type of model and give the
evidence.
So what PathText does is actually giving you the evidence to update your model.
And for this you also need a workbench platform to allow people to make decisions about the
ranking of reactions and the ranking of documents. I'm sorry if it sounds -- I'll just try to make it
as simple as possible, because it's a bit sometimes too much biological knowledge here.
Anyway, this is how it looks. We can forget it.
Now, remember [inaudible] is actually bridging the gap. The reason I put that is it's again based
on events. So when you're looking at reactions, the reactions are events. So if you see here this
information, like which is a protein, let's put that one here, you have this little square here, it's a
reaction.
That reaction is an event, and this event goes to this catabolism basically event. So this protein
vif is linked here, is actually degrades A3G but also, as another one here, this kind of diamond
which is actually induces this activity as well.
So this is how in order to link this type of representation, in this case with CellDesigner you have
a square or a diamond, in other editors you might have other types of semantics notations, this is
relevant in the end what you're doing, we're trying to find out, is linking events, finding events in
text with various entities.
And this is how basically PathText is doing that, is very much based on extracting events and
13
linking them with pathways.
So an example is if you're clicking, for instance, right now to this specific part, this one, you'll
have about 844 text mining. You see here you have automatic text mining, you can do it manual.
So you can do annotations as well and give it back to the system.
So the automatic goes to a FACTA automated mostly, and we'll extract this type information
enriched with events, and the curators, the biologists, will see which one is of relevance.
So what is basically you can do is now you start querying reactions by events. So in order to link
text with pathways you need to have a kind of interface.
So you're doing for like -- for instance, heterodimer association from this too complex, it's
basically the query is a protein reaction, an event, and your result is basically from MEDIE a
binding event.
So this is exactly the type of information the biologists get automatically from text to be able to
update and reach and find the evidence in the pathways.
So to do that, if I just want to put the whole architecture of the whole image of what we are doing
right now is here are your pathways, your users, your biologists, what you use here is that -could be anything. We're using CellDesigner because our systems biologists use CellDesigner
[inaudible]. So this is the kind of interface.
But what we are doing is basically we are working on building the queries using events, we are
very much working on using -- governing the relevant feedback from the biologists, curating
basically the results, and to do that we have our toolkit, the one on Enju, EventMine, but -- or
systems, but to this part of component I will talk about briefly now, it's a platform that allows
curation.
So what you need to do is when people are giving you this type of information, when you're
extracting automatically the information, are the biologists interested in that? Do they think it's
relevant? So you need to be able to get the feedback to improve the ranking. So that was
actually you have in the query. Then because it's based on machine learning, we're taking all this
information and every time we're improving the system for the specific type of pathways.
So I -- in a sense this kind of sort of closes the loop of why you need events, why you need to
extract deeper information, why you need multi-arguments if you wanted to link the information
from pathways, which is at the core of systems medicine, with text.
So I don't know -- how much time do I have? Because I have quite a -- it's up to you.
>> Lucy Vanderwende: You can keep going.
>> Sophia Ananiadou: So a very small [inaudible] but this is important because we suggest to
use it for our shared task as well, so I should talk about it.
14
So one of the way of using that is very often now people use components for processing for text
mining, processing components and annotation components somehow separately. So we have
lots of annotation tools and of different sophistication, but we also have text processing
components very much based on GUI architecture and philosophy.
So it's important to actually integrate the processing with tools with annotation tools but also to
allow users to create text mining workflows which actually they can store, they can use, they can
share, they can use, and so on.
So two systems that we have done, one was a U-Compare, which started [inaudible] team, and
we expanded this by using multi-lingual system, I'm not talking at all right now. Another I'm
talking about, Argo with [inaudible] here, which very briefly I'll say what it does.
So what it does is basically it links with U-Compare as well, takes lots of processing
components. It's a Web-based application. It doesn't have an installation. You can access it
through a Web browser. And it's very interactive. So this is actually what curators can use to
see the annotations and to decide from the text mining results if the annotations are okay, choose,
and then basically, you know, fit it back to the system.
So very briefly, this is basically the whole thing, this is for both developers, for workflow
designers, and for annotators. So for developers you -- this is linked with U-Compare, you can
actually have all sorts of search engines, named-entity recognizers, target editors, XML editors
and so on.
So the workflows that people were actually allows you to design if you want to do, for instance,
extract named entities to include targets, species [inaudible] and end up with a named-entity
recognizer and actually compare as well the various types of workflows, you can actually
process all the workflows remotely without people looking at this, and then the annotation editor
is actually allows you to look at the results and make changes if you don't like. So this is very
important for curation basically.
And it's using various Web services. So now Web site you can have a better look. It'll just tell
you later.
So here is the workflow. You can have a risk component. You can add your own documents.
You can actually allow basically a link to other people's documents. And you have kind of -you can also store workflows, so at least the current and past workflows.
And here is actually the panel where you design, you just drag and click the work, the
components. So you pick up various components and you just select the workflow of KLEIO is
a set system, a species tagger, itself has another workflow, annotations, and various CAS writers.
And this is how they look like basically, which is manually explained.
So what basically does here is it's an example of a workflow, is to -- you can actually store these
various CAS writers by basically -- you can actually even allow to have different formats,
15
plaintext, XML, and so on.
So you can actually create and store a simple workflow with an annotation editor. And in this
case we have used -- this is [inaudible] for events, we have also used another system which
actually [inaudible] sample have worked is Brat. This is for events. So we're using those two
annotation environments for -- Argo mostly for entities and creating workflows and Brat for
events.
And this is how it looks like for -- you can actually remove, change, add, and put various
properties, and this is where you find the system. We're still developing it, but it's already sort of
a decent stage to have a look at. And of course we are very interested in actually having other
people contributing and sharing workflows and processing components.
Another just very, very briefly is also allows you to evaluate. So if you have different
components, you can just do the comparison towards -- in the base of a reference evaluator.
Right.
So, now, some people are tired, so me too, and I'll finish my talk very -- with the last one, which
is the most -- the very recent work is the last enhancement of EventMine which allows you to do
extra modification of events.
So as you realize till now, we are all event based [inaudible] but so what it does,
meta-knowledge annotation is nothing new. People have talked about this for many, many years.
Our differences were all based -- this kind of -- if we can talk pragmatic and discover
information on events. So this is the main difference.
So it gives you different dimensions, different types of information, different based on an event.
And the important thing is allows you basically to detect what is new knowledge from this kind
of meta-knowledge and various types of contradictions that you have in text. It's extremely
important also for applications, for search of course, but also for [inaudible] communications
because we can use citation counts, all sorts of things we can integrate. It's quite interesting area
of research, we think.
So this is actually an annotation example to show you that the same type of event about X
activates expression of Y could be presented in text in completely different ways meaning
completely different things.
So even if you have an event which says about activation of a hypothetical protein with a
hypothetical gene, what -- how -- what it also wants to say about that.
So the first thing you can say is about we found that Y activates the expression, so it's -- you
know, you have a kind of knowledge type examined. You can very -- this also suggests, so it's a
bit of a speculative, not certain, or that has no effect the polarity or slightly increased the manner
or might affect certainty.
So there are various cues in around an event that tell you that this thing is perhaps not so certain,
16
it has a different would be negated, could be speculative, and so on.
So just don't have to look so much. Basically this is the whole schema, but different manners,
certainty, source, if it's in this specific paper, other people are citing that, if it's negative, positive,
if it's an investigation, an observation and method of fact and other.
So what all those things are telling you basically are combined, new knowledge, or hypothesis.
To do that, we took the GENIA event corpus, which [inaudible] has done in 2008, I think, or '9,
and we annotate it -- it was quite a lot of work, actually -- with meta-knowledge. So we took all
the event types and we created about 56 -- well, the existing one, and we had -- we used two
annotators, a biology expert and an linguistics, and annotated the whole corpus with bio
meta-knowledge. Was actually a very good inter-annotator agreement.
So sort of the corpus that this -- you can see actually the certainty level, and tells you a lot about
how people write in text, so you can have different types of knowledge types, investigation and
methods, observations, very -- certainty, L3, but also lower certainty, the 6 percent and the 2.1
percent we think is very interesting. So you have facts which are reported with not so much
certainty. So that might be quite interesting if you want to construct pathways on the basis of
not-so-certain facts.
So you can put weights, for instance, polarities and manners and so on.
So rather than -- so how now we integrated all that is -- this is a well-known EventMine. We
added meta-knowledge to the EventMine. So we have the pipeline. In the end you have
meta-knowledge annotation, and what EventMine does, it tells you it extracts events and also
tags them if they're negative, if they have -- its analysis if it's high and so on. So basically you
have this type of extra information here.
So, as I said, the difference -- so you have a knowledge analysis, certain L2, not so negative,
amount high, and source current from this paper. So basically what you do, you're actually going
a step further to provide more analytics to events, and which can be used again for search and for
pathways, as I said before, and for output -- for instance, for [inaudible] communications. So I'll
just go very quickly on that.
So some results on the EventMine which we have used actually on the annotated corpus, and
also we added on the share task that we had, it's about -- you have different types of
performances on them and knowledge types, certainty, polarity here.
And for the negation and speculation, as you realize, there's a lot of work to be done yet. So
we're really struggling around the 35 percent and so on. It's actually with using various -- it's
actually doing quite well, EventMine, with all the various clues and based on different -- this was
GENIA applied on the share task.
But basically we are all about, you know, some people -- it performs actually quite well across
various negation speculation. The total is overall better. And it's more -- what I said, the
17
performance is more stable around various types of hedging.
I think this is a very, as you can see some -- you know, in some case other people you have lower
negation but much better speculation, so I think it's quite important to have more stable perhaps
results across documents and also across the various types of information like of hedging.
And this is actually on abstract and full papers. This was also trained on abstracts, so do it again
with full papers. So you see again we are reaching about the 37 percent, which is the top
basically performance.
This is -- I wanted to stop basically with that [inaudible] with this, is that we -- I think this is a
very important area of research. It's hugely important if you want to take types of information,
what are the really certain, are the negated, are the contradictory, and how we can integrate, so
we need to improve -- we need much more work on that, the community needs much more work
on that to embed into our existing systems.
I'll finished because I'm really tired now to tell you the future again which is a very important
project funded by the UK government, and we work on full paper, and not only abstracts, with
all this event of the open access. Out of the 2 million papers 12 percent are open access.
What we have produced now is what we call the EvidenceFinder, which you'll find here. And
what it does, we're going to embed events now and meta-knowledge, and that's why I finish with
this. What it does, you have a query like EGFR and breast cancer. Because our users are medics
and they don't want to even think about templates and subjects -- if you say subject and object
they will never use the system -- what we do, we're generating the questions for them.
So based on -- so we used [inaudible] parsing and so on extracting facts. But once the system -oops -- you put a query, the system extracts, creates a number of questions, which we know of
course they will be answered because they exist in our stored parse results, and then they looked
at the extracts, and if they like the answers, they'll click on that.
So this type of system now, they want to add -- they're very interested in the meta-knowledge,
which is a challenge for us, because we are in the [inaudible] so they're really -- which is actually
kind of the future, people are very interested in the kind of hedging medicines and that, is
speculative, it's contradictory, who said again, of course, there are other components in UKPMC,
so this is the text mining part.
And so just go very quickly. And then if you go there, you just go through various entities as
well, which are highlighted.
But this type of system will now -- finish my talk now -- is going to be for the next couple of
years enriched with events of different types, [inaudible] with EventMine and also
meta-knowledge.
So because we have 2 million full papers, this is going to be a kind of full-scale analysis of full
papers of set system based on full papers and abstracts on events and on actually
18
meta-knowledge.
Thank you for your patience. I hope I didn't tire you too much. Any other things that I said
today, all the services are on our Web site on services. All the tools, EventMine, everything is
on our Web site and all the publications. So any misrepresentation is utterly mine or there.
The people who actually have been extremely important -- I don't mention Genise [phonetic]
because he's been -- well, he's still our scientific brains. But the people who have been involved
current -- are currently involved in the center are these, and of course extremely indebted to all
their hard work.
And now I finish with what we're going to talk after. So I want to introduce you now, well, our
suggestion for a cancer genomics BioNLP shared task in 2013, which we will talk later during
the break. And we want to work on abstracts and full papers and can select -- basically this is a
follow up of [inaudible] events of course. It's a follow-up also of the angiogenesis corpus, and
also the corpus, [inaudible] corpus which we made available to the community, and we'd like to
extend it to new areas to work with people, oncologists in the area of cancer and add more
perhaps processes which would be of interest to have a shared task on cancer genomics.
Now, we would very much like to make available for the community and user our [inaudible]
platforms for people to use and prepare also the share task. And we're calling for people to work
with us, and this is where I stop. Thank you very much.
[applause]
Download