1 >> Lucy Vanderwende: So good morning, everyone. Thank you for coming. It's my pleasure to introduce Sophia Ananiadou. She'll be giving a talk that's titled extracting events from biomedical relevance from text. So currently Sophia is the director of the National Centre for Text Mining, University of Manchester, as well as a professor in computer science there. I've had a long -- the pleasure of knowing Sophia for quite a long time, and the first BioNLP workshop -- which was it in 2001 or 2000? I remember attending ->> Sophia Ananiadou: '2. >> Lucy Vanderwende: '2. I remember attending that workshop because I was really interested in text mining for biomedical -- from biomedical text. And Sophia and Professor Tsujii both cautioned me. They said: You shouldn't go into this unless you have a partner in biology. It is just too -- too much otherwise. And so I heeded their warning, but now I have partners, both at Microsoft and at University of Washington. But it's really been a very interesting field, and I'm excited that Sophia will tell us more about it. Thank you. >> Sophia Ananiadou: Thanks, Lucy, for the introduction. So I would focus on one only area. I mean, we're doing several things in -- at the -- at NaCTeM, I'll just give you the first slide to understand what -- who we are and what we are doing. NaCTeM resides at the University of Manchester, and, as Lucy said, you need to be collocated with biology, so the place is Manchester Interdisciplinary Biocentre with mainly chemists, biologists, and people like that who are actually quite interested in what we're doing. So we started around 2004 and '5, and we focused predominantly on providing text mining service solutions for the bio domain, the medical -- increasingly the medical bio domain. Now we are sustainable center. I just provided only two or three of our funders. We have been funded by industry in other different types of funders. So all the things I'm going to say today are been part of the center, the work we have been doing, but mostly I will focus on more most recent work on extracting events of biological relevance. I'll explain what's that. Why we're talking about events. In biomedicine we have what we call this fragmentation in different types of specialisms, sub specialisms. There are different components aspects in systems biology, or in systems medicine increasingly, that deal with chemistry, the biology, medicine, and lots of -omics, different levels, from transcriptomics to genomics, metabolomics, proteomics, et cetera. So all this type of the translation to deal with the various translational aspects for different 2 applications, one needs to go into a kind of deeper type of analysis of text, not just of course keyword extraction or association mining, but trying to find out more complex information, so I'll discuss what is an event. I will also talk and place some emphasis in my talk in some applications which are related with events. So I'll skip over the rest, so the whole discussion is going to be about event extraction and how this is has been realized in different applications. So in case you haven't heard before, I don't know about here, but some people of course know all about the natural language processing people, but maybe some of you are not, if we take a simple sentence that we find many paper in MEDLINE, like expression of aurora B enhances phosphorylation of S6K1 and 4E-BP1, which is a normal sentence, this sentence includes various types of events. Some are complex, some are simple. A simple event of a type is an expression which is first realized with a [inaudible] expression. It has a three [inaudible] and the theme of the object is aurora B. You have another event dealing with phosphorylation, whose theme is S6K1, and another event, and again, which it starts from phosphorylation, another object 4E-BP1. Now, those events also have -- if you look at the Event 4, which is under enhances, under enhances, you have two types of positive regulation. And the positive regulation has a kind of cause and effect, and the cause includes another one event, the Event 1, the expression, which you will see here -- do you have any pointer perhaps or -- ah, yes, you do. Yeah, this one. So you have this, so the Event 1. And then the Event 2, which is phosphorylation, which is the theme here, and another more complex event, just to give you an idea of the types of analysis we need to do. Why we need do that is because you're going -- we need to go to a level which is bridging the universe of text, the sphere of text with knowledge. And biologists care really much more about first, you know, more sort of biological events and biological pertinence, what is most important, than what is actually the expression in text. So now that you've seen what actually we're trying do, by event is beyond relation. So people in the past have really focused on extracting protein [inaudible] interactions, gene disease associations, so there's still a kind of binary type of relation, which is in sense somehow simpler to do. But what we are dealing here is much more complex, is really going into a type of bridging to the knowledge sphere of biology. So it's what we call a more -- an event is a dynamic, a bio-relation, which most of the time is -- has many arguments. So very rarely it's a kind of binary. And all those events, how we draw them from reality, we use ontologists, very often we use GO ontology, but you can create, use, define or measure different types of ontologies. And they have a lot of different participants. In linguistics we call them theme cause or whatever kind of 3 arguments. But these roles are really geared by the domain, basically they're domain dependent. And the various types of participants could be, as you saw before, other entities, like bio-entities, proteins or genes, but increasingly they have other events. So a bio-event has -- it's quite complex because it draws from ontologists and includes other types of other entities. So this is going to be basically the focus of my talk. I'll just focus on events. And I will explain how this -- what kind of technology we have used to build that, as part of members of my team, and how we applied this event extraction to search, the semantic search, the existing systems we already had like MEDIE, or systems where actually mine associations direct and direct associations where you can do this including events. Increasingly events of course are very important for like extracting bioprocesses, like angiogenesis. And also if you want to go talking about my first slide, the various -omics types of problem, if you want to integrate multiple levels of biological organization, which I will discuss. In addition, we have included -- integrated this technology towards pathway reconstruction. And I don't say construction because we don't do automatic construction yet, that's very difficult, but how you can actually produce the types of evidence to enrich the pathways. And last but not least is actually quite we think a very interesting and upcoming area is what we call the event interpretation, so where are the experimental findings or what is known information or old or hypothetical or speculative, that actually could be very important when you're building for search and for [inaudible] and for pathways. For that I will not discuss you need of course to have -- most of our techniques currently have been supervised, so you need training, you need to have training data. And I'm not going to talk about this, but you can find the one I think quite a few people know about, GENIA, another event corpora that we have basically built, and a last but not least I'll end up my talk with the shared tasks are extremely important in the community, first of all, to inform and evaluate our tools but also to obtain the training corpora to be able to build more tools. So this is -- the focus on the applications is basically semantic search. Hypothesis generator, this is very important for medical and clinical applications in terms of to mine direct and indirect associations. Extracting events across multiple domains. Enriching pathways. And also to do that I will very briefly allude to an environment, a platform we have built, which integrate the processing components and annotations which is very important for the curators. So basically, what I said before, it's a kind of nice diagram, and because I've done it, why not. It's exactly the same thing I told you before. This sentence is basically kind of represented by the various events, phosphorylation, binding, different types of arguments. Here you have, for instance, a side theme and then the top basically event is negative regulation. This is what the biologists when they want to search they're interested in negative regulation and positive regulation and in addition and so on. Very rarely, if you start searching with keywords, you would have an enormous amount of knowledge with not too much relevance for biology. 4 So for that we have -- one of the tools I'll present briefly, and if you want there are lots of papers that -- it's basically Miwa's work, who used to work with Jun'ichi's team in Tokyo, where he started this actually tool and now he's working with NaCTeM. So EventMine is basically detects, extracts event structures and is using a deep parser, Enju. I'm not going to talk about Enju because it's basically extracting predicate argument structures. So it's a kind of more deeper syntax. So what it does, what EventMine does, maps from the deep parse results into event structures. And then here's -- Miwa has used all sorts of different features, experiments you will see for classification, shortest path, bag-of-words, and so on. And it's an SVM type of -- it's training classifiers with SVM using various annotated corpora for each module. Annotated corpora have been mostly used by the shared tasks, that's why they're very important, and of course the work of GENIA that we have [inaudible] has built over several years. So how this actually work, very briefly, if you want more, you can read especially the [inaudible] papers where he describes in detail EventMine. It's a pipeline. It's a kind of additional pipeline of an event extraction system which has fallen components. So each component is done in sort of -- more independently. You have the trigger, the entity detector, so you have like a phosphorylation extracting, identifying triggers is a quite challenging stuff often because lots of ambiguity, and there's been quite a lot of work of people trying to improve actually the extraction of triggers. So you have negative relation inhibits. Another trigger from binding, so the event binding is binding, and entities. Then the next part is actually once you identify the triggers, you have to find the arguments or the edges. And for this you have -- it's based on Enju, on deep parser, so you have inhibits and phosphorylation, inhibits to binding, and in relation of theme the arguments binding in CD40 and so on. I think what we believe is a very interesting, as you will see later on, part is how you're dealing with multiple arguments. So you have then a multi-argument event detector, so you have here inhibits has causality, so you have this type of information, and binding, a theme, and so on, complex basically events. And multi-argument event detection is extremely important for the types of complex events we were talking about. And this will be seen with some of the results they have produced. The last but you have on top of that, once you finish with that multi-argument detection, you have then modification, and the modification is mostly information like paging, speculation, contradiction, negation, and so on. So in the BioNLP task, it receives -- actually here it's been -- we got 58 percent, 15, which is 5 really one of the top F-scores. So some further information which you can find on the latest paper is the type of -- it's basically a multi-class, multi-label classification problem, and some of the feature types are described here for triggers, for arguments, assorted paths, terminal nodes, words around candidate pairs, and so on, and for multi-argument and modification. So there have been several extensions to EventMine, which we thought they were quite important, especially when we're dealing with full papers. But I think the most important is when you're trying to adapt EventMine to different domains, not even -- even within biology and biomedicine. So if you're trying to extract, for instance, for pathways, you have signaling to metabolic pathways, you have different types of arguments and you need to adapt your type of event extraction. So the first -- so this is basically -- in this recent paper everything is described there, so I'm not going to repeat the same paper, but I'll just very, very briefly talk about the coreference resolution and the domain adaptation, and in the end I will talk about meta-knowledge assignment. So in a sense EventMine, after you extract multi-argument events, does three more things. It does coreference, it already has include some component for domain adaptation using various corpora, [inaudible] meta-knowledge assignment, basically hypothesis, negation, and speculation. Well, very simple thing, what is the coreference resolution? You have to -- you link mentions and antecedents. So in this case it's very simple example of M-CSF treatment was also associated with a rapid induction of jun-B gene, although expression of this gene was prolonged compared to that. So you have -- you need to link mentions with antecedents. This is increasingly important. It's very important for full papers rather than less than abstracts so the results can be seen when you're dealing extracting events from full papers. Although, I have to warn you, you don't find the kind of wow and fantastic, you know, improvement in the EventMine in the event recognition. So still it's a difficult problem. That's [inaudible]. So this is -- basically includes a kind of rule-based coreference where you actually have to detect the mentions candidates, then the antecedents, and then the links. So basically how those -- sorry -- how those results are integrated into the event extraction system is by modifying the parse results so mentions and antecedents share the dependencies. This is the PR feature. And then extending the features you have coreference mentions to argument detector feature, the FE. So what you see here is basically this is the best performance here by adding all those features, and it's about 58 from the baseline for 58.15, 58.81. So the tiny one, tiny one, so this was actually not trained. We trained on abstracts and applied to 6 full papers. So it's actually not bad. I think we believe if we had annotated corpora on full papers with coreference, then it would have been slightly better. And the domain adaptation, which I think is much more interested, he use the two methods for domain adaptation, the stacking method and the weighting method which he has applied for the instance weighting method which has applied for the two types of shared task, the GENIA 2009 and '11. The interesting thing is here, actually. In the last shared tasks, there have been different -- there have been full papers and abstracts. And we had also types of relations and events which are of quite different types. So the infectious diseases, for instance, or the genetic corpus have different types of events, and of course the other phosphorylation and the additional shared tasks. So when you're actually comparing the performance, you have to see it across different domains, in this case is they be genetics or the infectious diseases, and different types of text, full and abstracts. So actually from here you can see a quite big jump from 47 percent to 51 percent and 50 to 52.39. So basically the -- by including those components -- [inaudible], oh, yes, compared with other systems, you might say, well, you know, it's -- event extraction, you will not have the performances you have in entity recognition. You will still be in the white -- I think top close to 60, currently it's very -- it's a good actually result. But it is important in comparison to other systems to be able to deal well with full papers and abstracts, to be able to deal well across various different types of a corpora which deal with different types of events, and basically this is in a sense for full papers you do quite a bit better, much better, if you incorporate coreference. So this is basically some of the enhancements of the EventMine, which actually boosted a lot of the performance and outperforms other systems. If you want more about all the details, it's in the paper, so ->>: [inaudible] >> Sophia Ananiadou: [inaudible] is the name and [inaudible] in Finland. So those are the top -so this is how we compare it with a top system. >>: [inaudible] system combination? >> Sophia Ananiadou: It's a system company, yeah. >>: [inaudible] Stanford and Massachusetts? >> Sophia Ananiadou: Yeah, sorry, I forgot [inaudible]. So those are the top -- I mean, we compare with the top system. So I'll -- basically saw that, all the details in the latest bioinformatics paper. 7 So how we use that now. So what's -- okay, fine. Did we have fun improving performances in minute details? So, first of all, biologists want to search, want this type of system to be as accurate as possible, as is possible to do search. So we use the MEDIE, which has been built by Jun'ichi when he was in Tokyo before he joined Microsoft. And we enhanced it. We -- MEDIE already was doing semantic search base on facts, it was a system which was actually in 2006 quite very novel, was actually extracting facts from the whole of MEDLINE based on deep parsing. And so, for instance, you can extract what is activated by circadian clock, what cyclins are regulated -- so you are basically utilizing all the syntactic variability that you have in text, and when you're making this query, you really extract proper subjects and objects, which at the time it was not possible for the systems. This is the system you can still -- well, this Enju, so I'm not going to talk about Enju with Jun'ichi here, and -- but if you want to find about information, it's basically on HPSG. So this how it looks like before we added the events. So if you put for something on the Web site of NaCTeM, so you can -- it's open, people can use the Web services or can hook into this if you want to. It's based on the whole of MEDLINE. And if you asked any kind -- currently we use a template type of subject, verb, object, so you can ask questions like -- which then translate into a query -- p53 -- what p53 activates. What you have here are basically the sentences which, as you see, there are basically you have may amplify, you have -- it's a kind of an expansion with ontologists and extract sentences which are pertinent to this query. So -- so basically, as you can see here, it deals very well with [inaudible] and all that, which is very important. And also you can change the format. You could look at in a more tabular, you know, form. So you can see here you have verbs like amplify, mediated, activated, which are synonyms and they're very relevant to your query. And immediately the user can see from what p53 activates from those sentences is like the sentences. If they're of interest. And if they're of interest, then they can click to the title of the paper. And also they're all linked to all the various databases. So you can just click into a gene or to a disease and have access to all the various databases. So going back to events, all this kind of multi-argument, how we can actually now change MEDIE and add events, how we can search with events. So we use now this type of events based, again, on the shared tasks. That's why shared tasks are important. Because it's quite a lot of work. You have to -- in a sense -- in molecular biology, these are the types of events of upper-level that people are looking for. So if you ask them what else do you need, they will come back to quite high-level phosphorylation, binding, positive regulation, and so on. This is what they want to search. 8 So if I just put my query as localization, so you will have the interface localization of you don't have to -- you can specify the type of thing, the object, or not. If you don't specify, what you will have now are sentences which basically are retrieved within a specified location and theme. Okay. So there are still sentences retrieved with a localization event as a query. You can then, just to give an example, you put localization of TNF-alpha, you specify, so these are the sentences extracted with this specific type of argument. And you can have -- oh, this is actually in a different tabular form. So you can see immediately -- oops, I'm going to go back. Yeah. And this is a much more complex where you have a positive regulation and another event as well, phosphorylation, of -- in various arguments. So although this sounds quite complex, in a sense this is exactly the type of information that if you want to, for instance, reconstruct pathways or if you want to ask questions of biological relevance, this is the type of upper-level information that people want to know. So what you have in fact are just various instances, various realizations, of this upper-level biological event. And you can specify of course the site or the cause if you want to. But currently gives you the sentences automatic from the whole of MEDLINE that respond to this kind of query. So -- and that's a different way of represent. So you see here you have -- it's a kind of a frame, a knowledge frame, really, which is extracted from text right now. So you have various types of responses to this slot. So this is actually one type of how complex events, or if I go back to this, can be integrated into a search system, like MEDIE. So you can update it to just do that, but also you can upgrade it to a set events of different types of biological pertinence. So if in this case we have molecular biology, but you can work or you can just absolutely train EventMine to be able to extract different types of events as long as you have the annotations, the biological relevance. So going back to that, another follow-up actually work on events was the realization that most of our focus for the past ten years was on molecular type of entities. So we were extracting genes, proteins, chemicals, and drugs, and very often really focused on a set of them, simple, of binary types of associations, protein-protein, drug-drug. So a very -- again, what the biologists are telling us is you need to actually expand to go from the molecule level to organism. So this is a very recent work. We just -- it's just going to be published in next month in bioinformatics. It's actually event extraction which goes from across levels. So from the molecular to the anatomical -- cellular components, cells, tissues, and organs -- to organisms. And in the end, if you have from this one, basically -- I don't know if I have that -- you want to be able to extract this type of information in the end as well. So right now we're going 9 somewhere here, with we want to be able to extract about -- growth about organs, about anatomical information, and so on. So this is where we have really worked, most of the community, for protein post-translational, epigenetic regulations, molecular mechanisms. But this is a kind of limitation of going forward, and especially for health, this is extremely important to go across levels. So the approach that we did, we have done some work on extracting complex bioprocesses based on angiogenesis. That was in collaboration with AstraZeneca. Actually last month this project finished. But we created a very nice corpus which had kind of this type of very detailed information, which is actually publically available, although it's a small corpus, but it took a lot of time to prepare. But initially this one used -- so the type span representation. So the new work we're doing across level we basically added event representations, we extended the types to have more anatomical entities and other, which you will see later, not immediately now, and based on OBO, GO, and CARO, which are anatomical entities. So here is actually the types of entities we used, examples, organisms, anatomical system, organs, multi-tissue structure, developing structure, tissues, and so on, organism substance, pathological formation. This is mostly following CARO. And for an entity, anatomy-level events, those types are like skin development of fiber formation, growth of arteries/tumor, remodeling, breakdown, death, cell proliferation, and planned. This is mostly from [inaudible] level processes from GO. So what is actually we have used. Actually this is mostly samples of work, and Miwa's. We used tools, we used EventMine with Enju -- first we used Enju, then we applied EventMine to adapt to the various types of event, and then -- you need for this specific domain. But what you're recognizing now here is you see organ, multi-tissue structure, pathological formation, organism substance and so on. So this is actually extending the problem to go to a much more -to do it more multi-level. Some of the results here for -- by categories actually combined the baseline 57 [inaudible] 52 and using various other [inaudible] resources and anatomical, for instance, 81, 76, and molecular, 72. So it's okay. The results, all the resources are on our Web site. The corpus, it's called MLEE, and it's actually an extension of event extraction to various different levels of biological organization. It's very richly annotated with about 8,000 entity and 60 event annotations. And also in a sense also it show how EventMine could be also be used in this type of domain. And we used various resources like ontologists and so on. So some of the references are here. The initial corpus was that in bioprocesses we did for angiogenesis, and this is the one which is published, well, very soon, about few -- two weeks' time. 10 So now -- now I'm going to change again. Again events is the theme, but a slightly different system, so again an application we have used. And this is actually I think closer to medical because in the FACTA system, which have developed the -- we have changed a lot since 2008, we have -- we mine direct and indirect associations. It's very much the Swanson type of hypothesis of, you know, if A related to B and B to C, A to C as well. So this is a kind of straight -- quite well-known approach to -- for knowledge discovery and hypothesis generation in biomedicine. The system currently has been initially, as I said, operation the whole of MEDLINE and how it works, if you go on our Web site, if you put a query like caffeine, it gives you -- a priori we have identified some concepts which we thought were important when you're searching: genes, diseases, symptoms, drugs, and compounds, which are ranked different types of measures. I'm not going to talk about the whole FACTA right now, but they're obviously [inaudible] information and frequency and so on. So what it does basically if you're clicking into caffeine, the relations, direct relations between caffeine and fatigue, you will have these types of snippets of text extracted from MEDLINE abstracts. So this is the types of -- actually most of the medics quite like FACTA. They like very much the direct but also the indirect associations. But this -- FACTA operates on queries, and queries can be complex, but basically nouns, you know, concepts, concept associations. So I'll tell you also a bit slightly one slide about the indirect. Indirect is a two-step Swanson type of hypothesis. You're doing a query from pivot concept. So you want to say, for instance, how diabetes affects [inaudible] diseases [inaudible] for instance. And normally [inaudible] FACTA with E-cadherin is another example. Here will tell you if your query is E-cadherin, you go -- in this case our target is diseases related with E-cadherin via proteins. So this type of indirect associations tell you this is really the most interesting part for most medics, is that E-cadherin is associated with Parkinson's disease via CASS4 and transcription factor EB and so on. Now, events again. So we thought, well, we can enhance factor to do this type of search, but we can add events. So -- so not only E-cadherin, but in this case we use the GENIA ontology for the molecular again, on the molecular level, so this can be of course enhanced with different types of events. So if you're searching for positive regulation, what the FACTA now will do would extract not only the associations between tumor and E-cadherin but related with positive regulation. So it does concept-based direct and indirect associations with events. 11 And this is another example actually, which is positive regulation. So I should have given you an example which I skipped. Now, this is actually a level towards you can enhance FACTA with different concepts and with different event types. And with that also it's very nice to visualize it because people fed up looking at the long list of names. So this is how it looks if you find direct association with E-cadherin. So you see here how important is that where you show other entities concepts. And you can see the indirect associations. And you can see here how the indirect associations are linked in with melanoma with various other concepts. This has to do with we have chosen disease and gene. And this is actually more various other indirect associations from different other ways. And that one here is with events. So basically you see indirect or direct associations with an event where you can visualize it. So you can see immediately that -- well, not automatic, so how E-cadherin is indirectly associated with the nervous system disorders basically, like Alzheimer's, Parkinson's disease, and epilepsy. And you can just go and drill down through the documents. But in this case we thought -- I thought it was very interesting to show you an application how event extraction can be embedded into existing set systems that do either more complex like MEDIE who have more -- they use deep parsing or FACTA which is more on concept associations. So now somehow chancing the shift, and I'll show you another application of why -- how again how event extraction has been used -- is currently being used for pathways. Just slightly different topic, but the same theme. So I don't know how many of you know about pathways construction or don't know the real background at all, but pathways are, again, very -- like a cord of systems biologist and systems medicine. And increasingly people want to see how we can link evidence from text to pathways. So -- and that's a very challenging problem, very -- automatically constructing pathways is really like a Holy Grail, but we think we can do a lot towards providing lots and lots of evidence to allow people to make decisions and construct models. And very actually, just to give you an example, for the mTOR pathway, people to construct this pathway had to read 519 papers. So this is a manual process till now. They identified 964 entities and about 800 reactions. So because this is manual, clearly there are lots of things that are amazing, just going through the literature is how you first do search to find the documents and how you identify which components are important to basically create, to say this is a reaction which interacts with another reaction and so on. So this is where we started actually. This is work we started with Jun'ichi a few years ago. And 12 we wrote the grant in 2006, I think. And the system I will talk about is PathText, which is still ongoing. We keep on upgrading and updating it, and it's actually using -- linking with all sorts of different pathways. So the architecture of PathText is basically if you have various models or other kind of various parameters here, so you have interactions or reactions between those modifiers, or we could use in this case I think CellDesigner, but you could use any kind of editor to -- SVML model to represent this kind of knowledge. You need to breach that gap that -- between model with text. So in our case our PathText links to two of the systems. That's why I explained to you to understand a bit about pathways. I'm not going to talk about this. So it's basically named-entity search. But FACTA and MEDIE are providing, especially enhanced with events, are providing the type of information that is needed to link pathways to text, how. So -- well, this is exactly one kind of snippet to see how these various results. So here is the pathway. In this case pathways are independent. In this case we use Payao. Payao is a kind of interface between CellDesigner, which is a very common way of annotating the pathways, the different editors of pathways. But what is important for us to see how basically we can use the various publications from PubMed or full papers, use our systems to integrate it with this type of model and give the evidence. So what PathText does is actually giving you the evidence to update your model. And for this you also need a workbench platform to allow people to make decisions about the ranking of reactions and the ranking of documents. I'm sorry if it sounds -- I'll just try to make it as simple as possible, because it's a bit sometimes too much biological knowledge here. Anyway, this is how it looks. We can forget it. Now, remember [inaudible] is actually bridging the gap. The reason I put that is it's again based on events. So when you're looking at reactions, the reactions are events. So if you see here this information, like which is a protein, let's put that one here, you have this little square here, it's a reaction. That reaction is an event, and this event goes to this catabolism basically event. So this protein vif is linked here, is actually degrades A3G but also, as another one here, this kind of diamond which is actually induces this activity as well. So this is how in order to link this type of representation, in this case with CellDesigner you have a square or a diamond, in other editors you might have other types of semantics notations, this is relevant in the end what you're doing, we're trying to find out, is linking events, finding events in text with various entities. And this is how basically PathText is doing that, is very much based on extracting events and 13 linking them with pathways. So an example is if you're clicking, for instance, right now to this specific part, this one, you'll have about 844 text mining. You see here you have automatic text mining, you can do it manual. So you can do annotations as well and give it back to the system. So the automatic goes to a FACTA automated mostly, and we'll extract this type information enriched with events, and the curators, the biologists, will see which one is of relevance. So what is basically you can do is now you start querying reactions by events. So in order to link text with pathways you need to have a kind of interface. So you're doing for like -- for instance, heterodimer association from this too complex, it's basically the query is a protein reaction, an event, and your result is basically from MEDIE a binding event. So this is exactly the type of information the biologists get automatically from text to be able to update and reach and find the evidence in the pathways. So to do that, if I just want to put the whole architecture of the whole image of what we are doing right now is here are your pathways, your users, your biologists, what you use here is that -could be anything. We're using CellDesigner because our systems biologists use CellDesigner [inaudible]. So this is the kind of interface. But what we are doing is basically we are working on building the queries using events, we are very much working on using -- governing the relevant feedback from the biologists, curating basically the results, and to do that we have our toolkit, the one on Enju, EventMine, but -- or systems, but to this part of component I will talk about briefly now, it's a platform that allows curation. So what you need to do is when people are giving you this type of information, when you're extracting automatically the information, are the biologists interested in that? Do they think it's relevant? So you need to be able to get the feedback to improve the ranking. So that was actually you have in the query. Then because it's based on machine learning, we're taking all this information and every time we're improving the system for the specific type of pathways. So I -- in a sense this kind of sort of closes the loop of why you need events, why you need to extract deeper information, why you need multi-arguments if you wanted to link the information from pathways, which is at the core of systems medicine, with text. So I don't know -- how much time do I have? Because I have quite a -- it's up to you. >> Lucy Vanderwende: You can keep going. >> Sophia Ananiadou: So a very small [inaudible] but this is important because we suggest to use it for our shared task as well, so I should talk about it. 14 So one of the way of using that is very often now people use components for processing for text mining, processing components and annotation components somehow separately. So we have lots of annotation tools and of different sophistication, but we also have text processing components very much based on GUI architecture and philosophy. So it's important to actually integrate the processing with tools with annotation tools but also to allow users to create text mining workflows which actually they can store, they can use, they can share, they can use, and so on. So two systems that we have done, one was a U-Compare, which started [inaudible] team, and we expanded this by using multi-lingual system, I'm not talking at all right now. Another I'm talking about, Argo with [inaudible] here, which very briefly I'll say what it does. So what it does is basically it links with U-Compare as well, takes lots of processing components. It's a Web-based application. It doesn't have an installation. You can access it through a Web browser. And it's very interactive. So this is actually what curators can use to see the annotations and to decide from the text mining results if the annotations are okay, choose, and then basically, you know, fit it back to the system. So very briefly, this is basically the whole thing, this is for both developers, for workflow designers, and for annotators. So for developers you -- this is linked with U-Compare, you can actually have all sorts of search engines, named-entity recognizers, target editors, XML editors and so on. So the workflows that people were actually allows you to design if you want to do, for instance, extract named entities to include targets, species [inaudible] and end up with a named-entity recognizer and actually compare as well the various types of workflows, you can actually process all the workflows remotely without people looking at this, and then the annotation editor is actually allows you to look at the results and make changes if you don't like. So this is very important for curation basically. And it's using various Web services. So now Web site you can have a better look. It'll just tell you later. So here is the workflow. You can have a risk component. You can add your own documents. You can actually allow basically a link to other people's documents. And you have kind of -you can also store workflows, so at least the current and past workflows. And here is actually the panel where you design, you just drag and click the work, the components. So you pick up various components and you just select the workflow of KLEIO is a set system, a species tagger, itself has another workflow, annotations, and various CAS writers. And this is how they look like basically, which is manually explained. So what basically does here is it's an example of a workflow, is to -- you can actually store these various CAS writers by basically -- you can actually even allow to have different formats, 15 plaintext, XML, and so on. So you can actually create and store a simple workflow with an annotation editor. And in this case we have used -- this is [inaudible] for events, we have also used another system which actually [inaudible] sample have worked is Brat. This is for events. So we're using those two annotation environments for -- Argo mostly for entities and creating workflows and Brat for events. And this is how it looks like for -- you can actually remove, change, add, and put various properties, and this is where you find the system. We're still developing it, but it's already sort of a decent stage to have a look at. And of course we are very interested in actually having other people contributing and sharing workflows and processing components. Another just very, very briefly is also allows you to evaluate. So if you have different components, you can just do the comparison towards -- in the base of a reference evaluator. Right. So, now, some people are tired, so me too, and I'll finish my talk very -- with the last one, which is the most -- the very recent work is the last enhancement of EventMine which allows you to do extra modification of events. So as you realize till now, we are all event based [inaudible] but so what it does, meta-knowledge annotation is nothing new. People have talked about this for many, many years. Our differences were all based -- this kind of -- if we can talk pragmatic and discover information on events. So this is the main difference. So it gives you different dimensions, different types of information, different based on an event. And the important thing is allows you basically to detect what is new knowledge from this kind of meta-knowledge and various types of contradictions that you have in text. It's extremely important also for applications, for search of course, but also for [inaudible] communications because we can use citation counts, all sorts of things we can integrate. It's quite interesting area of research, we think. So this is actually an annotation example to show you that the same type of event about X activates expression of Y could be presented in text in completely different ways meaning completely different things. So even if you have an event which says about activation of a hypothetical protein with a hypothetical gene, what -- how -- what it also wants to say about that. So the first thing you can say is about we found that Y activates the expression, so it's -- you know, you have a kind of knowledge type examined. You can very -- this also suggests, so it's a bit of a speculative, not certain, or that has no effect the polarity or slightly increased the manner or might affect certainty. So there are various cues in around an event that tell you that this thing is perhaps not so certain, 16 it has a different would be negated, could be speculative, and so on. So just don't have to look so much. Basically this is the whole schema, but different manners, certainty, source, if it's in this specific paper, other people are citing that, if it's negative, positive, if it's an investigation, an observation and method of fact and other. So what all those things are telling you basically are combined, new knowledge, or hypothesis. To do that, we took the GENIA event corpus, which [inaudible] has done in 2008, I think, or '9, and we annotate it -- it was quite a lot of work, actually -- with meta-knowledge. So we took all the event types and we created about 56 -- well, the existing one, and we had -- we used two annotators, a biology expert and an linguistics, and annotated the whole corpus with bio meta-knowledge. Was actually a very good inter-annotator agreement. So sort of the corpus that this -- you can see actually the certainty level, and tells you a lot about how people write in text, so you can have different types of knowledge types, investigation and methods, observations, very -- certainty, L3, but also lower certainty, the 6 percent and the 2.1 percent we think is very interesting. So you have facts which are reported with not so much certainty. So that might be quite interesting if you want to construct pathways on the basis of not-so-certain facts. So you can put weights, for instance, polarities and manners and so on. So rather than -- so how now we integrated all that is -- this is a well-known EventMine. We added meta-knowledge to the EventMine. So we have the pipeline. In the end you have meta-knowledge annotation, and what EventMine does, it tells you it extracts events and also tags them if they're negative, if they have -- its analysis if it's high and so on. So basically you have this type of extra information here. So, as I said, the difference -- so you have a knowledge analysis, certain L2, not so negative, amount high, and source current from this paper. So basically what you do, you're actually going a step further to provide more analytics to events, and which can be used again for search and for pathways, as I said before, and for output -- for instance, for [inaudible] communications. So I'll just go very quickly on that. So some results on the EventMine which we have used actually on the annotated corpus, and also we added on the share task that we had, it's about -- you have different types of performances on them and knowledge types, certainty, polarity here. And for the negation and speculation, as you realize, there's a lot of work to be done yet. So we're really struggling around the 35 percent and so on. It's actually with using various -- it's actually doing quite well, EventMine, with all the various clues and based on different -- this was GENIA applied on the share task. But basically we are all about, you know, some people -- it performs actually quite well across various negation speculation. The total is overall better. And it's more -- what I said, the 17 performance is more stable around various types of hedging. I think this is a very, as you can see some -- you know, in some case other people you have lower negation but much better speculation, so I think it's quite important to have more stable perhaps results across documents and also across the various types of information like of hedging. And this is actually on abstract and full papers. This was also trained on abstracts, so do it again with full papers. So you see again we are reaching about the 37 percent, which is the top basically performance. This is -- I wanted to stop basically with that [inaudible] with this, is that we -- I think this is a very important area of research. It's hugely important if you want to take types of information, what are the really certain, are the negated, are the contradictory, and how we can integrate, so we need to improve -- we need much more work on that, the community needs much more work on that to embed into our existing systems. I'll finished because I'm really tired now to tell you the future again which is a very important project funded by the UK government, and we work on full paper, and not only abstracts, with all this event of the open access. Out of the 2 million papers 12 percent are open access. What we have produced now is what we call the EvidenceFinder, which you'll find here. And what it does, we're going to embed events now and meta-knowledge, and that's why I finish with this. What it does, you have a query like EGFR and breast cancer. Because our users are medics and they don't want to even think about templates and subjects -- if you say subject and object they will never use the system -- what we do, we're generating the questions for them. So based on -- so we used [inaudible] parsing and so on extracting facts. But once the system -oops -- you put a query, the system extracts, creates a number of questions, which we know of course they will be answered because they exist in our stored parse results, and then they looked at the extracts, and if they like the answers, they'll click on that. So this type of system now, they want to add -- they're very interested in the meta-knowledge, which is a challenge for us, because we are in the [inaudible] so they're really -- which is actually kind of the future, people are very interested in the kind of hedging medicines and that, is speculative, it's contradictory, who said again, of course, there are other components in UKPMC, so this is the text mining part. And so just go very quickly. And then if you go there, you just go through various entities as well, which are highlighted. But this type of system will now -- finish my talk now -- is going to be for the next couple of years enriched with events of different types, [inaudible] with EventMine and also meta-knowledge. So because we have 2 million full papers, this is going to be a kind of full-scale analysis of full papers of set system based on full papers and abstracts on events and on actually 18 meta-knowledge. Thank you for your patience. I hope I didn't tire you too much. Any other things that I said today, all the services are on our Web site on services. All the tools, EventMine, everything is on our Web site and all the publications. So any misrepresentation is utterly mine or there. The people who actually have been extremely important -- I don't mention Genise [phonetic] because he's been -- well, he's still our scientific brains. But the people who have been involved current -- are currently involved in the center are these, and of course extremely indebted to all their hard work. And now I finish with what we're going to talk after. So I want to introduce you now, well, our suggestion for a cancer genomics BioNLP shared task in 2013, which we will talk later during the break. And we want to work on abstracts and full papers and can select -- basically this is a follow up of [inaudible] events of course. It's a follow-up also of the angiogenesis corpus, and also the corpus, [inaudible] corpus which we made available to the community, and we'd like to extend it to new areas to work with people, oncologists in the area of cancer and add more perhaps processes which would be of interest to have a shared task on cancer genomics. Now, we would very much like to make available for the community and user our [inaudible] platforms for people to use and prepare also the share task. And we're calling for people to work with us, and this is where I stop. Thank you very much. [applause]