>> Michael Gamon: Thanks, thanks for coming. I think we should get started, even though we're only five minutes late. But we have a little bit of a different format today. We have three talks. And they're all joint projects by University of Washington interns and Microsoft Research mentors. So the interns are going to present. And it's pretty packed. I mean we're going to try to stick to 20 minutes each talk with five minutes questions, which should still give us a little bit of time afterwards. But I mean I know what these guys have been working on, and there's plenty more to talk about than just 20 minutes. So we'll see how we can stick with the timeframe. For those of you who don't know the event, this is like the 19th in a series. So the next one is the big one, the 20th anniversary. And it's basically just a place to connect research from the UW with research from MSR and sort of you know make personal connections, meet people and sort of socialize a little bit and keep the connection open between one side of the lake and the other. Normally we have two talks, one from each UW and MSR, but again, you know, this time we're experimenting a little bit, and depending on how it goes, you know, if anybody has suggestions about, you know, different kinds of formats that we could use, that would be also very welcome. So for the three talks there's actually only one of the mentors could barely make it. The other two are not here today. One is actually in Ottawa, accepted -- Colin Cherry accepted a job at the national research council up there. So he's in our group anymore. The three interns are Stanley Kok, Hoifung and Al Ritter back there. The three talks -- so Stanley is going to start with hitting the right paraphrases in good time. It's joint work with Chris Brockett. Then the second one is a really nice title, toward the Turing test, conversation modeling using Twitter. That's Alan's talk and that's joint work with Colin Cherry. And then Hoifung is going to talk about joint inference for knowledge extraction from biomedical literature, and that was joint work with Lucy. So again, you know, we'll try and stick to 20 minutes and then five minutes questions. >> Stanley Kok: I'll try my best. Hi, I'm Stanley Kok, from the University of Washington. My talk is on hitting the right paraphrases in good time. So this is the overview of my presentation. I'll begin with some motivation then I'll cover some background, then I will go into detail about our system, hitting time paraphraser. Then I'll describe some experiments and finally I'll end with some future work. So the goal in this project is to build a paraphrase system, a system that takes an input a query phrase such as is on good terms with, and outputs a list of paraphrases such as is friendly with, is a friend of. Several applications can benefit from paraphrase system such as query expansion, document summarization, natural language generation and so on. So in the application of query expansion a search engine could receive a query phrase from a user. Then a search engine provide the query phrase to a paraphrase system and obtain a list of paraphrases which is then returned to the search engine. The search engine could then retrieve documents that are relevant not only to the original phrase but also relevant to the paraphrases, there by proving the quality of its results. I'd like to point out that our system as well as the systems that we compare against uses and additional resource in the form of a bilingual parallel corpora. So what's a bilingual parallel corpus? In such a corpus we have sentences from two languages. Over here sentences English and German. So these sentences are aligned. And the phrases in the sentences are also aligned. For example under control is aligned with unter kontrolle. I'm not sure I'm pronouncing the German right, so just bear with me. And over here in check is also aligned with unter kontrolle. Now, from those sentences, from these phrases, we can count the number of times that a phrase occur as well as the number of times an English phrase co-occur is aligned with a German phrase. This allow us to obtain phrase tables. So for example, under control appears four times in this corpus and three of the four times under control is aligned with unter kontrolle, then the probability of a German phrase given the English phrase would be three quarters or .75. Now, these phrase tables are used by our system as well as the systems that we compare against. So in 2005, Bannard and Callison-Burch introduced the BCB system. I named the system after the artist. So this is a paraphrase system and it works as follows: It compute the probability that E2 is a paraphrase of E1 by summing over all foreign phrases, German, say, are the product of E2 given G and the product of -- and the probability of G given E1. Straightforward approach. And if you have multiple corpora, simply sum over all the corpora. So in 2008 Callison-Burch improve upon this similar by cleverly introducing syntax. So if you look at the two equations used by the systems, you can see that they are very similar. What SBP does is it introduces syntax here. It constrains the paraphrases E2 to have the same syntax as the original phrase E1. And this syntax information is obtained by a parse trees. The interesting thing to note is that -- no, I mean that's it, use the leaves of the parse trees but you also use the subtrees inside the parse tree. Now, this is a very general high level description of SBP. For details I refer you to paper. Now, we can take a graphical view of these two approach by considering the phrases as nodes. So here I got all the English phrases, German phrases, French phrases. And [inaudible] exists between two nodes if there's a corresponding entry in the phrase table. Now, once you take the graphical view, we can see that the SBP approach -what the SBP approach actually doing is to sum up the probabilities along paths of length two. So it's summing up the probabilities on this path and this path to get the probability of E2 being a paraphrase of E1. Now, once we see things graphically we can think of several ways that we can improve upon this system. First, there's no reason why we should restrict ourselves to paths of length two. If you consider paths of length four, the E tree over here could very well be a good paraphrase of E1 by tracing this path of length four. And E tree could be a good paraphrase of E1 if the probabilities along this path are fairly high. Second, note that this graph is a bipartite graph with English nodes on one side, foreign languages nodes on the other. There's no reason for us to restrict ourself to bipartite graph, kind of general graphs. Lastly, we need not restrict nodes to represent phrases. This is a graph nodes can represent anything. Specifically we could have nodes to represent the domain knowledge. So our system leverages these tree points to improve upon the state of the art. And it does so using the notion of random walks at hitting time. So let me quickly cover the background of random walks at hitting time. Now the term random walk is self explanatory. It's just a walk by traversing randomly in a graph. I'm just going to illustrate this using this simple graph over here. Say we start at node one. Next we randomly pick a neighbor of node one according to the transition probabilities. So there's a probability of moving from node one to two, probability from one to three and one to four and all these probability sum to one. So in these probabilities we randomly pick one, say you pick two, then we move to two and we repeat the process. So next step you could have moved to node three. So a simple idea of random walks. >>: You said you move to a neighborhood. How did you get from three to four? >> Stanley Kok: So three to four is -- three and four, they are not neighbor nodes, because there are no [inaudible] between them, so they are first to move from three to one and then one to four. Yeah. Okay. So what's the hitting time? The hitting time from node I to J is expected number of steps starting from node I before node J is reached or hit for the first time. The intuition here is that the smaller the hitting time the closer node J is to I. That is the most similar node J is to I. In 2007 Sarkar and Moore introduced the notion of truncated hitting time where random walks are limited to a fixed number of steps. Let's say T steps. In 2008, they improved upon their results by showing that you can compute the hitting time efficiently and with high probability by sampling random walks. So each sample is a random walk. Now, let me show you one such sample. Again we start at node one. The box over here will show you the order in which we traverse the nodes. And I limit the random walk to a half T steps. To half five steps. So let's say that we -- next we move on to node four and to five, then to four and six and finally to five. So five steps. And these are hitting times. The hitting times for node one to one is zero, because [inaudible] one. One to 4 is a one because it just took one step to go from one to four. One to six has four steps because it took one, two, three, four, four steps it reached six. And hitting time for node two and three are set to a maximum of five because they are not reached at all. So this is how Sarkar Moore defined truncated hitting time. Now, how do we compute the hitting time from all these hidden samples? So as the usual case we defined random variable XIJ. And XIJ is the first time node J is hit starting from node I. And if a random walk never hits a node J, then XIJ takes on the value of T. Then hitting time HIJ just an expected value of this random variable. And we compute this expected valuable by taking the simple mean, a straightforward approach. Now, Sarkar and Moore showed that with high probability our sample mean deviates from the true hitting time by a very small value if we are able to sample at least this number of samples. So N is the number of nodes in the graph. So that's the background. Let's move on to our system, the hitting time paraphraser. Just a quick recap. Our system HTP takes this input, a query phrase. Phrase tables could be more than one. These phrase tables could be English, alignments of English to foreign languages as well as foreign language to another foreign languages and outputs a list of paraphrases. And it ranks the paraphrases in increasing order of hitting times. As mentioned earlier, we could create a graph from these phrase tables. We have nodes representing phrases and ages existing between nodes if there's a corresponding entry in the phrase table. Now, this is fine if we have small phrase tables. However, for [inaudible] is fairly huge. If you're going to do this, you're going to end up with a very, very big graph, which is not tractable. So what we do is well to start from the query node and then perform breadth-first search up to a depth of D, up to a maximum of some number of nodes. In our experiments we used a depth of 6 and maximum of 50,000 nodes. Now I'd like to zoom into the nodes of the query for a read of this graph. So let's do you might be asking a question of so how do we handle the transition probabilities of ages that goes outside of this graph? Those ages that crosses their periphery. We use a straightforward approach introducing a place-holder node, collects the ages together and sums up the probability. And this place on the node have a transition probability of .5 to the blue node and a self transition probability of .5. Now, this is a heuristic approach that works pretty well in as you'll see our experiments. So now we are in the steps where we have created the graph from breadth-first search. Now we could proceed to draw the samples, the random walks, the run M truncated random walks to estimate a truncated hitting time of every node in the graph. We limit the random walks to 10 steps and we drove one million samples. So with these numbers, the numbers of T and M, we could be 95 percent sure that our estimate of hitting times is at least is no more than .03 away from the true hitting times. So it's a pretty good estimate. Once we have this random walk, I mean once you perform this -- once you have drawn all the samples and estimated the hitting times we could rank these nodes. And we could find those nodes with hitting times of T. Now, these nodes are nodes which are pretty far away from the query nodes and therefore they're not very similar to the query nodes and we could just throw them away. This helps to prune the size of the nodes even further. Now, once you have this prune graph, we can proceed to add more nodes to this graph, to supplement the knowledge. We do so by adding what I call feature nodes. There are three finds of feature nodes. The first is Ngram node. So over here I snow a snippet of the prune graph. Again the brown nodes are the English phrased nodes, and you have the foreign languages nodes over here, blue and purple. To avoid clutter, I'm just going to avoid -- I'm just going to remove the foreign languages nodes, but just be reminded they are still there. So Ngram nodes are pretty clear right here. So we have all these Ngram. We use one to four grams. There's an H between reach and reach objective because this phrase contain this unigram. And there's and H between achieve the and achieve the aim because achieve the aim has -- contains this bigram. Why we introduce this Ngram node so as to capture the intuition that if two phrases have lots of unigrams, have lots of Ngrams in common, then it tends to be closer together, tend to be more similar. The next kind of nodes are syntax nodes. I put syntax in quotes because it didn't really correspond to syntax as what you obtain in a parse tree. What we do is we try to -- we classify certain words, things like articles, things like interrogatives together into classes every -- determine -- we mark whether each phrase begins with the clause. For example over here the object is linked to start with article because it starts with the word the which is an article. Likewise for the aim is has this link as well, because there are an article. Whose goal is and what goal is are linked to the nodes start with interrogatives because whose and what belong to a class of interrogatives. So again these nodes capture the idea that if you start and end with the same class of words, then you tend to be pretty similar. Like good candidates for paraphrases. Lastly, we have this not-substring-of nodes. Why do we have this? We notice in our experimental results that lots of paraphrases, lots of bad paraphrases were actually paraphrases which are substrings or superstrings of one another. We found that things like reach the or frequently return as a paraphrase or reach the objective just because it's a substring of the paraphrase. Now, clearly this is a bad paraphrase. So what we do is include and order here called not substring, not substring or not superstring of. This node will have a link to the query phrase, assuming that this is a query phrase, and all the other phrases will be link today this node if it is not a substring or superstring of the query phrase. So such nodes will be closer to the query phrase. So note there are four finds of ages emanating from the English language phrase. I mention that we could also have feature, similar features for foreign language nodes. But in our experiments we did not do so. We could easily add those nodes as well. But anyways for each English phrase nodes, there are four kind of ages, ages that go to the regular foreign language phrase nodes and to the three types of feature nodes. Now the question is how do we divide the probabilities as transition probabilities of emanating from these English nodes? Because P1, P2, P3 and P4 have the same total one. So we do so by tuning on a small set of data. I just found that P1 gets appointed 1, P2 gets appointed 1, P4 and P3 gets appointed four tends to work pretty well. Some of you might be wondering why is P1 set to so low? Now, note that this information has really been used when we first construct a graph and prune it down. So we found that we need not give it too much [inaudible] anymore because it's been used to pre-prune the graph to like good set of candidates. So now we are at this stage where we have edit feature nodes. Once you edit the feature nodes you can proceed again to run -- to draw M truncated random walks estimate the hitting times of all the nodes, random nodes in increasing nodes of hidden times and return them. Then the return nodes will be the paraphrases of the query phrase. How well does our system perform? Here comes experiments. We used a Europarl dataset. These Europarl dataset is a dataset of minutes of European union proceedings. These minutes are written in 11 languages, translations of one another. We used six of these languages, English, Danish, German, Spanish, Finnish and Dutch. There are about a million sentences per language. And English sentence are aligned with the foreign languages. The English-foreign phrase alignments are done by giza++, and this is done by Callison-Burch. We used that data available from the web. But they do not have the foreign-foreign phrase alignments. So these are done using the MSR, NLP aligner by Chris Quirk [phonetic]. So the system that we compare against are SBP system. Again, this system's available on the web. It's a Java implementation. And we also did a study, investing how H testimony P does without the feature nodes and how HTP does if it just works on the bipartite graph, rather than general graph. We also used in this dataset, in a NIST dataset was originally used for machine translations. Well, he has four English sentences per Chinese sentence and all together about 33,000 English translations. What we did was to randomly sample 100 English phrases from the one to four grams in both the NIST and Europarl datasets. So these four grams appear in both of these datasets. We exclude things like stop words, numbers, and phrases containing periods and commas. Now, for each of these randomly sample phrase, we randomly select a sentence from NIST dataset that contains the phrase. Then we replace -- then we substituted the to you one to top 10 paraphrase for that phrase. So this means that we evaluate the correctness of the paraphrase in the context of the sentence. Now, we did manual evaluation. We give it three scores, give each paraphrase three scores. So a score of zero means that the paraphrase is clearly wrong, it's grammatically incorrect or does not preserve meaning. A score of two means it's totally correct, grammatically correct and meaning is preserved. A score of one means somewhere in between with a minor grammatical errors, things like subject but disagreement, the wrong tense, et cetera. The meaning's largely preserved but not completely. So we deem the paraphraser to be correct. It is given a score one and two and wrong is given a score of zero. There are two evaluators and the inter annotator agreement as affected by kappa is 0.62. Now, this corresponds to substantial agreement between the two evaluators. So let's look at the comparison between HTP and SBP in detail. First I'd like to point out the SBP only written paraphrases or 49 of the query phrases. So let's focus on these 49. Let's focus on the top one, the top paraphrase as written by this system. HTP got .7 one of these top one paraphrases correct and SBP .53. So suppose I reconsider the top K where K is the number of paraphrases written by SBP. For example, SBP only written three paraphrases so over here we look at the top three for HTP. And performance .56 versus .39. So HTP is still doing better. Now, how did HTP perform if you look at the top 10 paraphrases for HTP system? We got a score of .54. So let's change our focus on the 51 paraphrases that SBP give us no results? How does HTP perform? For the top one paraphrase we got a score of .50, lower than what we got for the 49 top ones. And overall for the top one we got a score of .61. The corresponding number for SBP is .53. Now, let's look at the top 10 paraphrases for this bottom 51 queries. Got a score of.32. And overall the system got during looking at top 10 paraphrases for all queries we got a score of .43. The number of correct paraphrases is 420. For SBP the corresponding score is .39 and 145. So there are lots of numbers going around here. So the -- what you had should note is not only are we returning more correct paraphrases, 420 versus 145, we are doing so with higher precision, .43 and .39. Okay. I should hurry up. Okay. So how does HTP compare to HTP-no features? That means suppose you don't use the feature nodes. This is a simple comparison because both systems retain the same number of paraphrases. You look at the top one paraphrase using features that's better. If we look at the top K -- top 10 paraphrases again using paraphrases that's better. So it's pretty clear that using features we do better. Now, how about HTP versus -- HTP on a general graph versus HTP on a just restricted to bipartite graph? So a bipartite graph does not return paraphrases for five of the queries. So let's look at the five queries for which HTP did return paraphrases. Looking at the top one, HTP got .62, and HTP-bipartite .58. Looking at the top K, K is the number of paraphrases returned by the bipartite, HTP bipartite, we did better, .46 versus .41. Now, let's zoom into the five queries for which had HTP -- for which HTP bipartite return no phrases. Now, these are the harder ones, so we got for top one we got accuracy of .41 -- .4. And overall for top one got .61 and the corresponding number is .584 HTP bipartite. We look at the top K this figure is .2. And overall, over the top 10 paraphrases, our system got .43. As before, 420 correct paraphrases. Corresponding number for bipartite is .41 and 361 correct. Again, not only do we get more correct paraphrases, we do so at higher precision. How about timing? So HTP took 48 seconds per phrase, and the HTP-no features, the bipartite was around the same ballpark. It's faster because it doesn't use as much features. But SBP took about 468 seconds per phrase. Now, note that when you look at these times it's good to remember that HTP -that HTP systems are implemented in C# and SBP in Java and they are really using different kind of data structures. So just bear this in mind. So future work. For future work, I'd like to apply HTP to foreign language paraphrases rather than just English ones. I'd like to evaluate HTP's impact on applications, for example want to see whether we can improve the performance of resource-sparse machine translation systems by putting paraphrases into the phrases tables by such systems. I'd like to add more features, even features used by SBP. I'd like to add like four syntactic trees to see whether they are bursty performance of our system. So in conclusion I presented to you a paraphrase system based on random walks. It uses the intuition that good paraphrases have smaller hitting times. It's able to have a general graph, make use of paths of length larger than two, and it makes it easy to incorporate domain knowledge. And from our experiments we see that HTP outperforms the state of the art. [applause]. >>: So if I understood right, you the [inaudible] of your work [inaudible] then you added features then did another random walk. >> Stanley Kok: Right. >>: So why didn't you add the features right at the beginning? >> Stanley Kok: Because [inaudible]. >>: Okay. So it's just an efficiency point? >> Stanley Kok: Yes. >>: [inaudible] your proposal will [inaudible] languages? >> Stanley Kok: Indo European languages. >>: These are Indo European languages. >> Stanley Kok: But these aren't European languages. >>: Finnish was in there somewhere. >>: I'm curious how much the four [inaudible] of English. So when you translate a language it will encourage you to try something like Finnish early on because it's got much more of a [inaudible] complexity and [inaudible] that you're not seeing in English? >> Stanley Kok: The reason I have such like put in Finnish there and we look at languages I chose it such that it's a good spread of languages rather than the languages are really similar. Finnish was quite different from [inaudible]. >>: English and German and Dutch are all very similar so it's good to have Finnish, but when you said you wanted to do, try this on foreign language instead of just English, Finnish would be a good place to start. But Finnish especially is [inaudible]. >>: So your results were based on whether or not your paraphrases are a score of one or two, so it's whether they are correct or they're acceptable. Do you ->> Stanley Kok: Correct or totally wrong. Yeah, that's right. Zero point. >>: But you count it as a positive paraphrase. If it was either totally correct, or if it was, you know, there's some more [inaudible] mostly okay. Do you roughly remember what the numbers were if you just looked at the ones that were completely correct? >> Stanley Kok: I'm afraid not. But I definitely could put that up. >>: The criteria for really close were things like if we inserted -- we didn't do any modification so if it was an interesting time and the institution was wonderful time for interesting time then we still have [inaudible] so the sort of thing that would be corrected fairly easily in the word grammar check. So in order to say we're getting very close, we might have to do some manipulation here. >>: Specifically because of that, because you want to include that into a phrase table, so if you're getting .6 or something, it still means that two-thirds of what you'll be putting in a phrase table is wrong, one-third of what you're putting in is noisy, right? So you would only want to put it in -- you want to add into your phrase table paraphrases that are actually correct, right, or as correct as possible. You want to try to like minimize how much work you have to do. >>: Right. The issue is when you have a phrase table where are you applying it? The particular instance that we had at that particular point might be had an interesting time with the -- we can't predict that we're going to have -- we might have something like instead of had N we might have something like we have a substitution there. Thank you for the, thank you. Thank you for the suggestion, yes. So thank you for the interesting time. The wonderful time. You might have others being -- may not be necessarily [inaudible]. We don't know where the institution direction is going to be. So [inaudible] put them into the phrase tables in the particular environment where we tested them they may be close but just a little off. >> Stanley Kok: So want my to address that issue of having the one [inaudible] noise the values would be because use the hitting time metric. Just say that well, if your hitting time is smaller than this, then we keep it, otherwise you throw it away, otherwise you are bad. So that would allow us to get some threshold because you could put out all the so called bad paraphrases. >>: Well, the fact that [inaudible] two-thirds of what you decide is a paraphrases, two-thirds are good using this straightforward method. >> Stanley Kok: So this -- I did not have a threshold there, so I just returned I said top one, looking at the top one. Yeah. >>: So how to you decide when [inaudible] and what's the threshold? >> Stanley Kok: If you're going to use the threshold the hitting time to choose then we have some -- okay. One is obviously you have some manual evaluation to look at. It is a good threshold to evaluate. The second would be a better one where we have an application to tell us. >>: [inaudible] your experiments. >> Stanley Kok: Over here these are thresholds. I said just look at the top one, the top one score and up to the top 10 score. >>: Okay. But some of your [inaudible] have more than 10 ->> Stanley Kok: More than 10 but only consider up to the top 10. Well, thank you very much. [applause]. >> Michael Gamon: Okay. So from large graphs we're going to really, really small text now to the world of what is it 140 characters or less. But many of them. So and we're going to be introduced to the Turing test which is something really new. So Alan is going to take it. >> Alan Ritter: Okay. Thanks, Michael. Okay. So I'm going to talk about modeling conversations on Twitter. And this is joint work that I did this summer with Colin Cherry and Bill Dolan. Okay. So first some motivation. So why would we want to model the way that people have conversations? So I think that this sort of main motivation for this is if we want to build conversational agents that are able to go sort of beyond just scripted dialogues and kind of, you know, say things more impromptu and have more of a conversation with the user. And also if we want to build chat-bots that are able to do more than just parrot back what you said to them in a slightly differently phrased way. And some other possible applications for this include sort of like social network analysis, better Web Search for conversational text, things like forms and blogs and what not. And question-answer pair extraction. Okay. So what does Twitter have to do with modeling conversation? So most of the conversations on Twitter look something like this. It's just someone broadcasting information about their life to all their followers and probably nobody cares. But in this -- this is sort of not conversational, but somewhere between 10 to 20 percent of the posts on Twitter are actually in apply to another post. And these form sort of short conversations. So I've just shown a short example here. So this user said you know, I'm going to the beach this weekend, I'll be there until Tuesday, life is good. Somewhere else responded to them saying enjoy the beach, hope you have great weather. And then they responded by saying thank you. And that ends the conversation. And this is sort of a very typical kind of conversation that we find on Twitter. So we gathered a whole bunch of conversations from Twitter using their public API. And so we did this basically by watching the public timeline which gives us 20 randomly selected Twitter posts per minute and we use that just to get a random seed of Twitter users. And from there we can kind of crawl all the users that they're following and crawl their followers and so on, and basically query to get all the posts of each of the users that we have. And so if any of those posts are in reply to another post, then we can kind of follow that back to collect the conversation. And so we -- you know, we got a lot of conversations. We got about a million conversations of length two, 200,000 of length three, 100,000 of length four and so on. And you can see that the frequency seems to drop off very quickly with the length of conversation. And in fact, if you plot frequency versus length on a log log scale, it looks pretty much linear, and this sort of indicates that it's likely to be a power law, which I think is kind of interesting. So our basic hypothesis is that there's some kind of latent structure behind how people have conversations. So basically, you know, each post can be classified into a class which describes its purpose in the conversation. So you might see like an initial status post followed by a comment and then sort of followed by a thank you. And this is sort of the kind of thing we want to model. And so further we're saying that there's sort of a Markov property in that each type of post depends only on the previous one and not on anything else. And then the words in the -- in each sentence depend on the type of post or the dialogue act I guess as I'm going to call it. So in an initial status post the word I is going to have high probability, and then we're going to see a lot of verbs like going because the person is sort of describing what they're doing, whereas in a comment you might see you with very high probability as opposed to I and then words like enjoy and hope and so on. But then there's also these other words that sort of don't have anything to do with the dialogue act but are just sort before the topic of the conversation. And so in this example we see words like beach, weekend, Tuesday, weather, and so on, and so these don't indicate the dialogue act but they're just kind of sprinkled throughout. And they're all sort of semantically related to each other. Okay. So in order to model these dialogue acts and the transitions we looked at this -- these content models from the summarization literature, and they are used specifically for a multi-document summarization on news articles. And so basically they are looking at very specific genres of news articles, for example articles all about earthquakes, and they found that, you know, typically people will sort of in the first sentence describe you know the location and the magnitude of the quake, and then maybe in the second sentence they'll tend to describe, you know, how many people were killed or something like this. So they have sort of a very predictable structure. And they model that using hidden Markov models basically in which the states emit whole sentences and, you know, so they have a language model for each state that's sort of the emission model for the HMM. And then they learn the parameters using Viterbi-EM. Okay. And so here I've just shown sort of a graphical model representation of what they did. And so this is basically just a Bayesian network where each dialogue act depends on the previous one and then it emits a bunch of words. And so these boxes are the sort of plate notation which represents replication. So each act is generating a bunch of words independently. So you can read this box as like a sentence basically. Okay. So the -- you know, this looks like really nice for what we want to do in terms of modeling dialogue acts and Twitter conversations. Because it models this transition between latent structure which corresponds I think very well to the idea of what dialogue acts are. And in addition it's totally unsupervised. And so we don't need any training data other, you know, label data in order to be able to do this, which is really attractive. So there's a couple problems with this, however. And so if you just apply this naively without doing any kind of abstraction, it tends to group together sentences that belong to the same topic instead of the same dialogue act. And so in order to deal with this problem, Barzilay and Lee masked out all the named entities in the text. But this is a bit more of a problem for us because we can't rely on capitalization to do a good job of identifying named entities in Twitter. And so if you don't sort of deal with this, you'll get things like all the sentences involving Seattle will sort of get clustered into the same sentence. Or same cluster, excuse me. So what we really want are clusters like, you know, status, comment, question, response, thank you and not clusters about like iPod, sandwiches and vacations. And so we ran the content modeling approach on our Twitter data and we found -- this is sort of one example of one of the clusters we got. And you can see here these sentences don't really have anything to do with each other in terms of dialogue act, but they're really very much on the same topic. They're all sort of about web browsers and operating systems. So you see words like Safari, Windows, iPhone, Chrome, Firefox and this kind of thing. So this is -- seems sort of problematic for what we want to do. So our goal basically became to separate out the vocabulary into words which indicate the dialogue act and then conversation specific topic words. And in order to do this, we used an LDA-style topic model where basically each word is going to be generated from one of three different sources. So it could be just general English. And this is just sort of a way to flexibly model stop words. It could be specific to the conversations vocabulary. So these are like topics. Or it could come from the dialogue act. And hopefully, you know, we'll get sort of a stronger signal here. And this is similar to the model that Lucy and Aria Haghighi used in this year's [inaudible], and this was also for summarization. Okay. So here's this graphical model representation that I showed earlier from the Barzilay and Lee paper. And basically what we're proposing to add to this is the following: So we've added a hidden state here for each word which determines which source it was drawn from. So it could either come from the dialogue act, it could come from the topic of the conversation, or it could just come from general English. And so in order to do for instance in this model, we used collapsed Gibbs sampling, which is just the standard technique for this the kind of thing, where we sample each variable in turn conditioned on the assignment to all the other variables integrating out the parameters. But we found that there's a lot of different hyper parameters that we have to set. And these things actually really affect the results of the output. So to solve this problem -- well, we tried doing a giant grid search at first, but this was just kind of infeasible, so we used this idea of sampling the hyper parameters where we just treat them like other random variables and just sample them in the Gibbs sampling procedure. And we use the slice sampling approach, which we found to work pretty well. So another big problem is probability estimation. And so this is basically the problem of estimating the partition function which is intractable in these kinds of models in general. So we need to use some kind of approximation. And I'm not going to go into a lot of detail on this, but we use used this Chibb style estimator, which has been proposed recently in the literature for this kind of thing. And we found that it worked pretty well. Okay. So we did a qualitative evaluation. We trained our model on 10,000 Twitter conversations and then sort of looked at the dialogue acts it generated to try and see if they make sense. And so what I've shown here is a visualization of the transition matrix between dialogue acts. So each numbered box represents a dialogue act. And an arrow was drawn between to have acts if the probability of that transition is higher than some threshold. And I believe I set it to like .15 here, and there's about 10 acts, so it's sort of like higher than random chance. Okay. And so now in order to visualize each of the acts, I've shown a word cloud where the size of the word is proportional to its probability of being generated by that act. And so you can see that this act in particular is sort of a starting state, it sort of starts a conversation, so it's transitioned to you from the start stage, and it has the word I with very high probability. And then you'll also notice that it has a lot of words like today, going, getting, these kinds of verbs that describe what someone's doing. And you'll see a lot of words like tonight, morning, night, today, tomorrow, a lot of sort of time words. So this is just sort of like this typical kind of post on Twitter where someone's just describing what they're currently doing. And this sort of I think makes a lot of sense. So this one can be transitioned to from any of the initial starting posts, and it's pretty clearly a question. The question mark has super high probability. And then you also see that you has very high probability here where in the previous one I had high probability but you don't see you. And then you see things like what are your, et cetera. And then this act is also a question, but this is sort of an initial question that's starting a conversation. And so this one you'll see a lot of things as the last question state, but there's also some words that we didn't see, so you'll see things like know, anyone, why, who. And from looking at some of the examples, it seems like this is really someone just asking a question to their followers and it's typically just asking a opinion about something. And then this one is a little bit different. So in just the preprocessing of the corpus I replaced all the user names with this user tack and all the URLs with the URL tag. So this one contains a lot of user names and URLs. And it also has this word RT, which has sort of special significance in Twitter. It stands for retweet, which is like reposting someone else's post that you found interesting. So this is sort of like broadcasting some interesting information you found. And sort of in response to this, we see this sort of reacts state where the explanation point has super high probability, and you see things like haha, LOL, thanks. You can just imagine sort of like reacting to some funny link someone sent you. Okay. So we also did a quantitative evaluation where we try and measure how well our model with predict the order of the sentences in a conversation. And so in order to do this we had a test set of 1,000 conversations. And for each one of those, we generated all possible permutations, all sort of N factorial permutations of that evaluation and evaluated its probability under our model. And then we picked the one with highest probability and compared that order into the original sentence order. And so in order to compare orderings, we used this Kendall Tau metric, which is basically a correlation between rankings. So if the two rankings are exactly the same, the value will be one. If they're exactly opposite, the value will be minus one. And anything greater than zero is like a positive correlation. Okay. And so here I'm just showing some experimental results. And so this is from the content modeling approach from Barzilay and Lee. And these are all trained using EM and raising various levels of abstraction. So we tried using [inaudible] tags and then word clusters but it turned out that just using the raw word unigrams worked best in terms of recovering sentence order. And then our content model -- excuse me, our conversation model with the topics clustered does a little bit better, it gives a sort of a nice boost. However, we found that when we just leave the topics out and just use the fully Bayesian version of the content model it tends to do a little bit better. And I tried running sort of a whole lot of experiments to try and sort of get the topics working, and I mean they beat the EM version, but we weren't able to get them to beat the just fully Bayesian content model. And so why is it that the fully Bayesian content model is winning? So it could be outperforming the version with the topics for a couple of different reasons. So for one thing, it has fewer parameters and hidden variables to set. So this could just be a problem with getting stuck in a local maximum. In addition it could be that modeling the topics transitions is actual important in predicting sentence order. So even though it doesn't give us a really nice looking dialogue act, it could be just important for that task. And so in terms of why it's outperforming EM, I don't think this is super controversial. There's been a couple other papers that show that this full Bayesian inference is better. And there's a couple, you know, possible explanations for that. So we're integrating out the parameters. We're not just using a maximum like we hit estimate. And we're also estimating the hyper parameters from data. And this gave us a little boost, but I don't think it's sort of explaining all of the difference. But I think it's -- probably most of it is due to the sparse prior. So and EM -- and the standard EM, you can't sort of use the sparse prior, whereas with the Gibbs sampling you can. And actually the hyper parameters that we found were sparse, so it was actually using it. Okay. So in conclusion, we've presented a corpus of Twitter conversations which we're actually planning I believe to make publically available now. And we showed that they have some sort of common structure which looks like it would be sort of interesting to exploit for applications. And we showed that the Gibbs sampling and full Bayesian inference seems to outperform EM. Thanks. [applause]. >> Alan Ritter: Yes? >>: I don't tweet much. Can you tell us again how you identify what constitute a conversation and rebound that? >> Alan Ritter: Right. So basically when you hit the reply button, it sort of stores that information. So just in the API it has like sort of like if the tweet's in reply to something, it will have the ID of the post that it's in reply to. And then you can just query for that, you know. And then from that one, if it's in reply, query sort of chain it back, you know. Does this make sense? >>: It does. But I think you may have -- you may have identified conversations that really aren't -- >> Alan Ritter: So like how so? >>: Tonight I say hey you want to get dinner and I tweet back tomorrow sorry I missed your message. Those are related. Those kind of conversation, even though it spans a day. >> Alan Ritter: Right. >>: The following day I text you back something totally unrelated. >> Alan Ritter: Yeah, you're probably right. That probably does happen in our dataset I think most of them are actually real conversations. There are a lot of cases where we sort of miss part of the conversation like it -- they hit -- forget to hit the reply button or something and we get sort of these two parts of it that are separate. So there's issues like that with the data definitely. But for the most part I think it looks pretty good. >>: I don't tweet, I don't follow Twitter, but it seems to me you have one person posting something and then several different people replying so that's not one coherent conversation. >> Alan Ritter: Right. So when it's sort of a conversation tree. >>: Yeah. >> Alan Ritter: Right. So we tried to -- like we didn't -- we tried to explicitly federal out the multiparty conversations just to try and keep the data clean. But, yeah, you could have sort of multiple conversations that began with the same tweet. And, yeah, that is kind of -- I don't know, I'm not sure the best way to handle that. But, yeah, that's a good point. Yeah. Uh-huh? >>: So you found -- so you've had ways to go up the conversations, but is there any way to go down the conversation? >> Alan Ritter: No, not that I could figure out at least. So that is kind of a problem. So we might not see the very end, you know. So we're -- I mean, even if you think about it just going in time, if we pick something that happened right now, we don't know that the person opportunity going to reply. So I think that's always going to be just a problem. Yeah? >>: You mentioned that you were looking for trying to pull out things like status and common questions and stuff like that as the categories. Did you come up with those ideas a priori, or were those based on examining what the model found? >> Alan Ritter: Yes. So those were just sort of my interpretations of what, you know, the clusters looked like basically. You know what I mean? So there's no labeled data. This is all totally unsupervised, and it just sort of found those clusters just by itself, you know. Yes? Or I think you were first. >>: I'm sort of wondering here how you're doing your own -- what -- I guess two questions. So tweets and Twitter seems like a very small domain actually, and this particular problem you're looking at is very similar to like e-mail threads or like Facebook conversations. And there's a whole genre there. Have you thought about extending it or expanding it or trying to relate it some kind of way to other similar kinds of electronic communication -- I don't even know what to call it. Genre is as good a word. >> Alan Ritter: Yes. So there's a lot of things like IRC chat and Facebook and so on. I think Twitter is good for this because it's the default setting on people's accounts is open, totally public so you can go crawl it. Like Facebook the default is private. >>: Okay. >> Alan Ritter: It's sort of harder. And I think people have looked at e-mail and stuff, too. Yeah. Was there a question? >>: Well, you kind of answered. It was leading up to my question. He kind of went to it and you half answered it. I was going to answer why you chose Facebook -- why you chose Twitter instead of Facebook. Was it because of -and I'm also thinking because Facebook would have images and video and ->> Alan Ritter: Yeah. But Twitter actually. >>: Twitter's all text? >> Alan Ritter: Twitter actually does have images, too. There's like this tweet pick or something like that. People will put links to pictures, you know, and be like haha, look at this, you know. And it's -- I mean you need some vision algorithms or something to handle that. I don't know. Yes, Peter? >>: [inaudible] in your transition graph. So do we have to choose a threshold [inaudible]? How do you choose that [inaudible]. >> Alan Ritter: You mean picking the number of clusters? Is this your question? Yeah, so yes, we picked 10 for that, mostly just because it's easy to visualize, you know what I mean? So I think the best performing ones I think were about 20 clusters, if I recall. So we did vary the number of clusters and look at that. Yeah. >>: So when you parsed those it was all strictly and straight up like the [inaudible] in the new stream or [inaudible] user click in directed at someone in the middle of the message? Did you catch that too or ->> Alan Ritter: So I think your question is about user names. I'm not sure I quite got it. >>: So when you caught your -- when you caught the [inaudible] to start a conversation simply because the start -- you catch the user name at the beginning or was it like through -- or what determined ->> Alan Ritter: Right. I think I see what you're saying. So right. So there's a convention that if you put the user name at the beginning of your post, you're directing not at that person. And so that wasn't actually -- we didn't use that to get this data. So what we did is if you hit the reply button and Twitter actually records that, it's like this post was in reply to this other post, you know what I mean? So trying to recover the conversations you just using the user name I think might be a bit difficult. I don't think it would be as -- because you don't know sort of like which -- you don't know really which post it was in response to, you know what I mean? Yeah? >>: Well, I was just thinking that with the user names, if I remember correctly, Twitter -- if you look at somebody's profile you could see all the people that follow them? >> Alan Ritter: Right. >>: So it seems that if you're looking at -- if you're looking at words that happen in a particular post and grouping them by that, you might be able to look at a particular user and their followers and recreate conversations or reconnect conversations that weren't using the reply. >> Alan Ritter: Right. Yes. So I think it's a little bit more. Because you're never going to know exactly which post it was in reply to. So I mean one user could say a bunch of things, and then another user could say a bunch, you know, and sort of -- I mean, I think it would be sort of an interesting problem to try and do that. I think you'd need some sort of non deterministic thing to figure it out, you know. >>: I'm just thinking that, yeah, yes, one user might say a bunch of things which a different user responds only one thing but that's how conversations tending to, one person is going on and on and on, especially given the artificial length limit on Twitter. >> Alan Ritter: Right. That's a good point. Yeah, yeah. And yeah, I mean there's work people have done on IRC chat logs in trying to distangle conversations, you know, like there's multiple threads going on at the same time. I think that's -- would be interesting to address. Yes? >>: Part of your presentation was an example of styles of conversations. Kind of following what he had just [inaudible], is there a way to take a part of conversation which may have built from a followers and find out if it's a particular stage in a set of conversations? >> Alan Ritter: Yeah, that's a good question. Yeah, we didn't look at that at all, but, yeah, I mean you could take sort of fragments of conversations from different users and sort of like try and see, you know, if are they connected in any way. We haven't looked at that, but I think that would be an interesting thing to do. Uh-huh? >>: Perhaps you could address this in your presentation and you have [inaudible] that in terms of the structures in conversations in Twitter, what -would you be able to -- how could that help -- let me ask -- in terms of technologies and stuff and the structures themselves, what can the -- what will that be able to do for I guess future technologies? >> Alan Ritter: You're just asking about like applications? >>: Yes, application. >> Alan Ritter: Yeah, right. I gave sort of a brief motivation for this at the beginning. I think sort of like I kind of mentioned the idea of sort of like building a conversational agent, you know, I think is sort of like a big -- was sort of the driving thing for us at the beginning of this project. Yeah? >>: Well, what about search? I mean, you think that could play a role in the search and stuff, better to build better search technologies? >> Alan Ritter: Right. Right. So I think, yeah, if you're trying to like search facts or like message boards or something like that, I think something like this might be helpful. Like if you're looking for things that are an answer to a question. I mean honestly I haven't spent like a whole lot of time thinking about this. But I think I mean it seems like you should be able to help with this kind of information somehow. Like how people are searching conversational text, you know. Yes? >>: I think there's added value from considering the timestamps on the tweets like certain types of conversations might be faster or slower? >> Alan Ritter: Right. Right. Yes. I think I noticed most of them seem to take place within the time span of like 20 minutes or something like that. And, yeah, I think that's a great thing to look at. It could somehow indicate some important information. I can't think of off the top of my head what, but, yeah, I mean another thing too, a lot of these -- sort of like happening on mobile devices. So maybe the really quick conversations are more likely to be on mobile than on the web interface or something like that. But, yeah. >>: Going to the time, did you notice that -- say there's a [inaudible] say at two a.m. versus two p.m. [laughter] was there -- >> Alan Ritter: Right. Yeah. I mean, there's definitely the frequency of tweets definitely varies a lot by time. So at night there's probably -- I think there's fewer than doing the -- you know, normal hours in North America or something like that. But, yeah, I don't know about the types of conversations. I didn't look at that. That would be an interesting thing to look at. I don't know. It's a good idea. >>: This is [inaudible]. What did you use to make the [inaudible]. Did you do it by hand? >> Alan Ritter: No, no, I used -- there was should sort of like web interface with a Java applet called Wordle I think it was called. >>: Thanks. >> Alan Ritter: Okay. [applause]. >> Michael Gamon: Okay. So now we're going to go from the short tweets to the long protein names and long concepts and biomedical. So this is Hoifung Poon and he's done some work with Lucy Vanderwende on joint inference for knowledge extraction from biomedical literature. >> Hoifung Poon: Okay. Thanks, Michael for the introduction. So this summer I have the great pleasure working with Lucy on this exciting project of joint inference for bio-event extraction. For those who have been to my intern exit talk, it's pretty much the same talk with compression ratio of three. I will start with some motivation and then I will talk about the task of bio-event extraction and finally I will present our system and some preliminary results. So before we dive into this bio-event extraction, let's step back and take a look at a bigger picture. So in the past decade or so with the invent of the worldwide web there emerge a great vision of knowledge extraction from the web. And the idea is that to go from the unstructured or semi structure text available from the -online and turn them, extract -- structure knowledge from them. Such a great vision doesn't really need much advertisement beyond yourself. Apparently we can extract knowledge automatically and reliably even within a limited extent. We construct a gigantic knowledge base, probably the largest in the world, and then this can facilitate all kinds of great application like semantic search, question answering and so forth. And final by breaching the knowledge acquisition bottleneck the days towards solving AI could be numbered. So in the past decade or so, so this is a great vision and there has been some very great efforts including some local system like very notable system like text runner and my net, however the problem still remain largely unsolved. And the natural question asked is should we start looking at tackling the whole web as a whole from day one or is there a domain that we should start priority on? Presumably such a domain either would be more urgent to tackle or maybe it's easier to make progress on. And also the lessons learned and also the infrastructure for this domain should be general so that we can use it for starting point to tackling the rest of the web. So I will argue that the biomedical literature is such a gray starting point for knowledge extraction from the web. So online from PubMed there are about 18 million abstracts right now. And this is the -- and then the growth is exponential. A few years back, the reporting number I saw is about 14 million. So you can do the math. And the success, if we can do -- have some success here that will have a huge impact on biomedical research. So when you do a biomedical research, you have to have access to a broad spectrum of information. Like if you investigating a disease of diabetes or AIDS, the relevant gene, the number of relevant gene might be in hundreds, even thousands. So you really want to know all this relevant information. And on the other side, each gene under the sun might have already been investigated by some lab somewhere sometime. However, those information is captured, maybe written up in a nice paper but it's buried under these millions of articles and it's very difficult to find them, find them out for researchers. So one biomedical research student in -- at UW actually told me that they couldn't really keep up with all the literature in their own self view, and what they do is they really just follow a few top labs and just pay attention to what they do. So you can imagine how much effort research effort has been wasted because of this and also how much -- how many discovery opportunity has been missed since the result. And also, needless to say, if we can make progress in this direction, there is also flip side on the commercial world that this will bring a huge impact to the drug design and also the big pharmaceutical company will really love it. And finally this domain is attractive for this especially for this audience because they are all written in grammatical English. So from a particular previous talk, you have already seen some of the English may not be -- some of the language there may not be so amenable to processing. However, in this domain, you're supposed to write in grammatical English, and so you can -- and mostly in English, and often in English and you can apply any of the advanced linguistic theory or NLP2 to this domain. So hopefully I have convinced you that this is a worthy endeavor. However -and now the natural question to ask then is why is it -- this problem hard? So here to get a feeling about a domain, here is a typical snippet from just from one sentence in one abstract. So this is not even a full sentence. So you can see this few number of characters actually [inaudible] a great number of knowledge there. So it actually describe a bunch of events marked in red here and also describe a bunch of protein and genes marked in green here and also some of the localization sign marked in blue. And supposedly this actually convey a complex nested structure, so you have an up-regulation event, you have another regulation event signified by the involvement and yet another one by activation and then they also have event argument structure like the IL-10 is the theme of the regulation event, and also the site and the cause and so forth. So again, imagine using techniques like keyword search or pattern match is really -- couldn't really scratch even the surface. So this is the first major reason why the information is so hard to get. Another major reason which is sort of really common to NLP is that you can say the same thing but in many, many different ways. So here is what's called a trigger word that sort of signify the event for negative regulation. And just within 1,000 documents, 1,000 abstracts from the PubMed, so you can see the numbers are basically number of times that following word are used to signify the event. And you can see this clearly. On the flip side of this, the same word can also mean very different events. So for example in this corpus of 1,000 abstracts, the word appearance is actually used to signify five different event types, so including some of the examples here. And also there was some very subtle denotation. Like for example in this sentence, you can see the first line, the first line these cells are deficient in some expression, although their production is normal. So normal here actually signify regulation event, but if you just look at the second part of the clause, just look at this subclause, some production is normal, you really get a clue of the -- of an event only when you consider that in terms of the context that there is a parallel event that's described earlier then this is actually referring to the same event but talking about this event doesn't happen in this context. So by now you have seen that why this problem is hard. And in fact, this is actually a great opportunity for linguists and NLP researchers. And exactly because it's because all this complication involve in the human languages. So bio-NLP is actually a emerging field. It has -- it has -- right now it has regular work shop in [inaudible] ACL and I won't be surprised if it will become a full blown conference very soon. So initially the communities start on tackling the -- just trying to recognize the protein name. And this problem is logically solved. We can get to high 80s. And then they venture into trying to detect whether two proteins interact with each other. So this has been gone on for decade or so, but still the top one is around 60. It's largely unsolved. And it's actually mostly because when people work on this problem, they recognize that in fact instead of working just on treating it as a binary classification using [inaudible] it's actually necessary to go deeper into the language structure and actually tackling, extracting the detail bio-event. So this is the shared task of this year. And obviously the story doesn't end here. There has been already effort undergo -- going on to construct a pathway corpus and then eventually you want to build entire gene network and understand how they interact and so forth. So in this talk we will folks on this bio-event extraction. So by definition of bio-event refer to state change or bio-molecules. So in this shared task they focus on these nine event types. So the first five event types are relatively simple because they can only take one theme. The binding is a lot more complicated. It can take up to five themes so you can have A binds to B and C and D, so forth. And then the regulation and they distinguish regulation into positive, negative, and also ambiguous, which you don't know which way to go. And this is the most complex as we see in our example. So the data given to the shared task is that you are given the text obviously and then you are also -- because protein recognition is the problem that is not really solved, we want to factor it out if the task, so they also label, give you where are the proteins. And then your system is supposed to predict this block of information. So each line either referred to sort of an event type declaration or an actual theme or cause. So the first two lines talk about that there is a regulation event signified by the involvement, there is a positive regulation signified by activation and then the last two in the line actually talk about so E1 is an event and you take the theme of E2 and cause by T3. So here you can see that a regulation event can take another event as argument. So you don't have to look into the number. The whole point for this slide is that this shared task has attracted a great number of participants, and as a result the number are pretty -- the performance are pretty representative of the current state of the art. So the top system is about a group from University of Turku from Finland. So on the simple event, the first five events, they get an F1 of about 70, which is pretty decent. However, on the binding and on the regulation event, those are the really difficult ones. And overall, the F1 is about 52 and the precision about 58. So both are still not very satisfying. And then the top system also have another problem at a high level is that they adopt a pipeline architecture. Actually the top three systems all adopt a pipeline architecture. So they typically first determine a number of candidates events and types. And then starting with this list of candidates they then clarify for each pair of these candidate and a protein are tween the two candidates whether the later is a theme or cause of the former one. So and then to prove system use SVM for classification. So a major drawback for this approach is that there is no way to feed back information once you have committed to the list of candidate. And the Turku system actually has to do some very ad hoc engineering to facilitate the second stage learning. And also another major draw back is that the prediction of each theme or cause, also each event and their types is make totally independent of each other. So you actually lost a lot of opportunity to inform each -- the decision. As another interesting data point is the Tokyo team, they used a conditional random field and they are actually from the organizer and they get an F1 of 37. So the bottom line is this task is very challenging, indeed reflecting how complicated is human language. So when we come to decide our system, our first decision made is that we -even though the top system all user the pipeline architect, we think that that's not satisfying and that's not good for pushing the performance towards the next level, so we think the first criteria is to jointly predict events and arguments. And also we want -- there are certain prior knowledge that we know and we want to incorporate straightforwardly. And more when we saw that the random field doesn't cut it because when -- from the inside it's really always because the label doesn't really correlate in terms of linear context. But they are probably likely so in syntactic context. So for example, if you know that B is the theme of A and then C is in conjunction or B, like B and A, regular B and A, then you will conclude that maybe C is also a theme of A. So because of this, we pick our framework -- we pick Markov logic as a framework. Markov logic is a [inaudible] developed by Matt Richardson and Pedro Domingos and it can compact will go represent very complex relations and also handle the uncertainty. So the bottom line is that you have -- it basically each Markov logic network consists of a pair of first-order logic formulas together with a weight. And this define a joint probability distribution using a [inaudible] linear model and the NI here is the number of true grounding. So you don't really need to know -- pay too much attention to the detail. So in the interest of time I skipped the examples. So Markov logic is a natural framework to conduct joint inference. And more there has been already efficient -- a lot of efficient algorithm available and in an open-source implementation. So Markov logic network for the bio-event extraction consists of a few formulas. So the first group of formula basically -- so another advantage of Markov logic is that you can express this intuitive regularity very compactly and straightforwardly. So for example, we have the prior knowledge the event must have a theme and regulation event only can have a cause and so forth. So we incorporate those as hard constraints. And then there are those soft evidence like they -- for example, activation, the word activation probably referred to a positive-regulation event. This doesn't hold all the time, but it hold statistically for quite a number of cases. And then there is this syntactic evidence which we already mention like for example conjunct of a theme is likely a theme as well. And finally there is this lexical, combining lexical and syntactic, so for example the subject for the word regulate is probably signifying -- actually this should be a cause, I'm sorry. So this last three groups are basically the alchemy can lend way for each specific constant. So you can specify very compactly with field rules but then -- sorry about that. So our base MLN basically consists of the first two. The first three rules you can consider them as joint inference rule because they basically make decision for between events and themes and also different themes and events together. And then the base system consists of the first two views and then the lexical evidence and the syntactic evidence. So here are the preliminaries now on the development set and it's evaluating on the ground atom, so not entirely the same as the evaluating on the events. But this give us an idea. So we get F1 of 64 and more importantly we get a pretty high precision of 74. And let's take a look at the effect of the features. So if we start with just a base MLN, it get an F1 of 34, but as long as we add the further joint inference rule we get a huge improvement. And then if we add some lexical evidence and also some preprocessing we get the 64. And also we see that the training -- so here the training sizes were represented by the number of abstracts. So we can see that going from 50 to 100 we get a huge improvement. And then it start to level off. So here are the number for just on the event types, like whether -- whether our system can predict the event type correctly. So for those who have seen my last talk, this number is slightly better because I have some new results. So what you can see is that the first few type of events, the F1 is pretty high. So one idea would be to actually start using this extraction, maybe we cannot use this for all the events but we can start using this most accurate one to start building a community or attracting initial consumers and et cetera. And also another take-home message from here is that in general our precision are pretty high compared to the pailing systems, and this is arguably good for this domain because whatever you propose better to be correct. It's okay to miss some, but when you propose it better not to be a whole lot of noise. So in future work, we want to incorporate more joint inference opportunity. You have already seen some candidate in earlier slide which we haven't been able to incorporate in this MLN yet. We definitely want to incorporate discourse like coreference, likes whether [inaudible] this coreference or [inaudible] ones. And those are another special kind of joint inference. And then finally we may start looking into some application into some product, commercial product. So in conclusion, I have presented a system for joint inference for bio-event extraction. It's based on Markov logic. The preliminary results is pretty promising, and also our system because of the benefitting from the joint inference it attained -- tend to attain higher precision. And finally maybe we can start thinking about some commercial application, focussing on the most accurate ones. Thank you. And I would like to take questions. [applause]. >> Hoifung Poon: Yes? >>: So do you see under the theme and cause that occur in the same sentence as the event [inaudible] are you seeing the theme and the cause of an event will occur in the same sentence? >> Hoifung Poon: That's an excellent question. So actually no. The shared task and notation there were a number of cases where the theme actually stated in the previous sentence and then later on they are referenced by coreference. There were like about five percent of them. And currently our system didn't address that. I think none of the existing system addressed that. But that's definitely opportunity. You can see something like they mention a bunch of gene and then you have sentence like this gene -- this gene are found to blah, blah, blah and so forth. Yes? >>: I'm wondering if the Markov logic approach would allow you to incorporate information from the ontologies they have in this domain to get more actual training data? >> Hoifung Poon: Yeah, yeah. That's another great point. And in fact, we have considered actually something along the same line but it's more like using some unsupervised approach to solve a construction like clusters and then to multiply the training data. So I haven't quite thought about using the GENIA ontology. The ontology per se might use some information but also there is a question like because none of those textual expressions are marked with those -- are related, directly related to the ontology. So you have ontology available but you also have text and then there is a mapping between text and ontology. So that's always a problem for formal ontology. So I think if -- but if we can somehow construct text based ontology, maybe unsupervised from the text, then it's much easier to relate them, and then we can start multiplying training example that way. So that's an excellent point. >>: So do you actually need to specify all those features before you do the training? >> Hoifung Poon: Yeah. Yeah. So this is totally supervised learning. And we don't do any structured learning. So you do need to specify the feature. >>: So how many features are there [inaudible]. >> Hoifung Poon: So there are roughly about, I would say, between 10 to 15 rules. So Markov logic, one of the advances that you can -- there is some syntactic [inaudible] that allow you to specify in one rule and then the system will learn way for each combination of constants. So for example, you can specify rule saying I tried to learn way for reaching the original word. What event type is signifying? And then you will try to learn way for activation, regular way and so forth. >>: So you have the hard constraints that each event must have at least one theme. Did you find cases where the theme just wasn't mentioned in the text? >> Hoifung Poon: That's a great question. So this is more like artificial effect of how they annotate this shared task. So they -- when they share annotate -- so there is another issue that I didn't get into in the slide is that how they annotate a shared task data actually also plays some -- playing some difficulty in for the learning and training. So in particular they have to -- they will only annotate event if it has one theme that is -- that can trace down to a protein. And it's explicitly stated. So if you have like say you could have a sentence saying IL-2 regulate something. But if that something is not protein, this whole thing is not annotated. So that actually make the training a lot trickier from the inside. Okay. Thank you. [applause]