Document 17859779

advertisement
>>: Welcome back to the short talks session. So for this session, we actually have five speakers,
and we really have a very little amount of time each, so I'm going to limit the questions to one or
two questions, and we'll only have two minutes for questions. So I'm really pleased here to be
able to introduce the five different papers here, so we're going to be starting. The first paper is
on approximate parsing with hedge grammars, and it's going to be Mahsa Yarmohammadi
talking about it. Thank you. Great. Thanks.
>> Mahsa Yarmohammadi: Okay. Can you hear me?
>>: Is your mic on there?
>> Mahsa Yarmohammadi: Yes, it's on. Okay, I'm going to talk about approximate parsing with
hedge grammars today. First of all, thank you, and also NLP for triggering this submission to the
ACL, so I was preparing this two-page extended abstract for Northwest NLP. Then I decided
why not submit the same work to the ACL, so I, Brian and Aaron submitted this paper, also with
a different title, to the ACL and it got accepted. Parsing for hierarchical structure is costly, and
in fact, some of NLP applications skip it altogether and replace it with shallow proxies such as
NP chunks, for example. The models to drive these non-hierarchical structures are finite-state
equivalent, so they are very fast. However, they omit all but the most basic syntactic
segmentation. And in some applications, having some more degree of hierarchical structure
might useful in inference, even if it doesn't end up a fully connected parse tree at the end. So in
this talk, I'm going to present hedge parsing as an approach which provides a full hierarchical
structure for a local window by ignoring any constituent which is out of this local window. So
we defined hedge parsing as discovering every constituent of the length up to some span L. This
is a parse tree for a given sentence, example sentence that we have. Every constituent in this
parse tree has a span, which is the number of words it covers. It can -- for example, this is five
for NP and VP constituents, and it is 10 for S constituents, showing that it covers L words and 10
words. It's 13 for this VP constituent here, and for the part-of-speech tags, the span is just one,
which means that they cover just one word. Okay, to obtain the constituents that cover up to L
words, we do a hedge transform to the original tree. So we remove the constituents that span
more than L words recursively and connect their children to the parent of the node, so this is an
example for L equal to seven, these are the nodes that were subject to remove because they span
more than seven words. So we remove those, and we connect their children to the topmost node,
which is S in this case. Okay, a very interesting property of hedge parse trees is that they are
sequentially connected to the topmost node, as I showed with the dashed line here, so each hedge
constituent is sequentially connected to S, and this property allows us to segment the sentence
before parsing and parse these segments instead of parsing the entire sentence and then
recombine the results. And, as I will show later, this will give us a huge speedup over the case
that we parse the entire sentence. This plot shows the percentage of retained constituents when
we do the hedge transform on the Wall Street Journal Penn Treebank. Over 50% of constituents
have a span of three or less, and for example, for L equal to 15, we have over 90% of
constituents retained in the hedge transform tree compared to the original parse tree. Okay.
What are the methods for doing the hedge parsing? As I said, hedge parsing is the approach of
finding hedge constituents of an input sentence. The baseline approach, which is the brute force
case, is to parse the entire sentence using a full context-free grammar and then hedge transform
the result. As we can imagine, this has the maximum accuracy but minimum speed or minimum
efficiency, because it has the full knowledge -- the rich knowledge of context, so it's so accurate
but very slow. I propose two alternative approaches over this baseline. In both, we constrain the
search space of CYK parser based on the constraint that's defined for hedges, and also use a
hedgebank grammar instead of a full context-free grammar, and a hedgebank grammar which is
trained from the hedge parse tree as opposed to the full parse trees. This is how we constrain the
search space of a CYK parser. Since we limit these kind of non-terminals, we can limit the
search space of a CYK parser, as I've shown here, by closing the cells which are above the L,
which is seven in this case, except the cells which are along the periphery of the chart. And the
complexity of CYK parsing reduces from n-cubed times size of grammar to this number I
showed below. I'm not going through details of we compute this complexity, but we are very
well [indiscernible], and ask me questions in the poster session, which I am presenting the same
poster. Okay. These are the approaches for hedge parsing besides the baseline. The first
approach parses the entire sentence using a hedgebank grammar, and the second one
presegments the sentence, parses each segment independently and simultaneously and then
combines the results together. The task of hedge segmentation is to chunk the input sentence
into complete hedges, so we trained a classifier and applied it at each word to decide if that word
can begin a new hedge or not. So we define two tasks. One is unlabeled and the other is labeled.
For the labeled, we just have the begin and not begin tags and for the labeled, we also include the
type of constituents, like NP or VP, and we use the discriminative log-linear model, using the
average perceptron algorithm to train the parameters, and these are the feature sets that we use.
This is our experimental setup. We use the Wall Street Journal Penn Treebank, the standard
training, development and test section. We train the Berkeley latent-variable grammars with six
split and merge. We use the BUBS parser in exhaustive CYK parsing mode, and we use the
standalone machine with this specification to run the experiments, and we evaluated our parsing
result with the precision and recall and F1 score using the standard EVALB script. These are the
results of parsing the development set Section 24 for L equal to four and seven. You can see that
we get order of magnitude speedup when we parse using the hedgebank grammar as opposed to
parsing with a full CYK grammar in the cost of 3% of accuracy. And this shows the comparison
between no segmentation versus presegmentation before parsing. If we look at the
[indiscernible] line here, you can see that segmentation is potentially very powerful in a fast and
accurate hedge parsing. Well, in action, we cannot achieve this accuracy obviously, so we
achieved this accuracy, and even in that case, we got an order of magnitude speedup over the
case that we do not segment before parsing at the case of almost 5% accuracy reduction. This is
the result on our test set or eval set. Basically, the same pattern holds here, and these are the
results for different L variables from three to 20 on the eval set with our three different
approaches, the baseline, no segmentation and presegmentation before parsing. As you can see
here, we could get a huge speedup when we do segment before parsing, especially for small Ls,
and if you look at this graph, you see that when we segment, we get a consistent degradation in
precision and recall, which points the need for an improved segmentation in our task. I
introduced a novel partial parsing approach for fast syntactic analysis for the input beyond
shallow parsing. It allows for input segmentation before parsing. Our baseline segmentation
model provides significant speedup in parsing but cascade errors remain as a problem. And
these are a few future directions, which is to investigate hedge parsing in combination with the
methods that are available for productization and pruning the cells of a parser. The other
direction is to improve the hedge transform and hedge segmentation model to achieve better
accuracies. And the other is to really evaluate the hedge parsing idea on a real task like in
increment translation. Thank you so much.
>>: Great. Thank you very much. So while the next presenter gets ready, we have time for a
couple of quick questions, and I have the mic here.
>>: So have you heard of vine grammars and vine parsing?
>> Mahsa Yarmohammadi: Yes.
>>: So how is this different from that?
>> Mahsa Yarmohammadi: Yes, that's a different -- the same concept but for dependency
parsing, and they use some kind of similar constraint for length on the dependency parsing, as
well. That's the same idea with [indiscernible] parsing, and we mentioned that as a related work
in the introduction, as well.
>>: Okay, and time for one more quick question while we get that set up. Very quick question
here.
>>: A clarification, I guess. When you're evaluating, are you evaluating against the true full
parse or are you evaluating against a hedged tree?
>> Mahsa Yarmohammadi: No, that's a good question. That's against the hedge transform trees.
Thanks for clarifying.
>>: Great. Okay, let's thank the speaker again. Thank you very much. Okay, we're getting it
set up here for our next speaker. It's going to be Xiao Ling, and the talk is going to be on context
representation for named entity thinking.
>> Xiao Ling: I'm Xiao Ling from the University of Washington. This is joint work with
Sameer and my adviser, Dan. Today, I'm going to talk about context representation for named
entity linking. So let me just briefly tell you about what the task of named entity linking is. It's
about identifying entity mentions in text and connecting to their corresponding knowledge base
entries, usually Wikipedia. So there are a couple of major challenges for this problem. One is
ambiguity, where people to use the same entity dimension to refer to different entities. For
example, we have Seattle in both sentences, but the first one refers to the soccer team Seattle
Sounders, while the second one refers to the city. The other big challenge is variability, also
known as synonymy, where people tend to use different names for the same entity, like the
nickname of Seattle and also acronyms, like MSR for Microsoft Research. So next, I'm going to
tell you about the general architecture of a standard named entity linking system. Assuming the
entity dimension is given, like Seattle here, and within some context, like I'm going to simplify
the context by using only one single sentence. Next, we're going to generate a bunch of
candidate entities for the system to select from. To be able to select the best one, we're going to
rank them, using multiple scoring functions. For example, string similarity to measure the
similarity between the entity dimension and the canonical form of the candidate entity. And
also, we can look at the context around the entity dimension and measure how similar it is to the
representation of the entity. You can have other scoring functions, as well, and in the end, we're
going to sum them up and pick the entity with the higher score. Today, I'm going to talk about -focus on the context similarity. So the majority of previous named entity linking literature has
been using the bag-of-words representation for the context of entity mentions. In the reigning
example, we have a three-word bag, beat, Portland, yesterday, and to be able to compare that to
the candidate entities, we're going to look at the Wikipedia text of the corresponding entities and
extract similar bags of words, and we're going to compare the bag of words I mentioned to each
of them. So we observe a couple of the issues here. First, we're going to -- the words in the bag
are sometimes imprecise and give the predicate to the wrong candidate. Also, it might include
irrelevant features, like yesterday, because yesterday doesn't help disambiguation of the entity
mentioned. What matters here is the verb beats. However, the issue is beat might not occur in
the bag of words for the correct entity. What we are proposing here is to make use of the
dependency graph of the sentence containing that entity mentioned. We are going to focus on
the direct dependency linked to the head of the entity mentioned. We are going to extract
features like those. The first feature basically says the entity mentioned is a subject of the verb -is the subject of the verb beat. And also, we are going to include some more specific ones, like
the second one, which is a conjunction feature. It basically says the entity mentioned is the
subject of beat, and also the object of beat is Portland. To construct similar representations for
the entities, we're again looking at the Wikipedia text, extract features using the same procedure.
And also, as a complement, we are going to look at a bunch of sentences from the web, and the
authors of those will voluntarily link those important mentions to Wikipedia, to the
corresponding Wikipedia pages, and we're going to apply the same feature extraction. And after
the processing, we are going to have a huge matrix, where the row are the entities, and the
columns are going to be the features. And the number of entities we have in this matrix is
around 3 million, and the number of total features -- the total number of the features is going to
be around 700,000. And you might suspect the matrix is very sparse, and it has lots of missing
values. So, for example, we might not observe the expression that Seattle beat some other team
in either Wikipedia text or from the web sentences, so it will be very difficult to compare this
representation to the representation for the entity dimension. So what we propose here is we are
going to perform a matrix completion task by learning a load dimension of embedding of both
entities and features, which signals the correlations of these features and also from the similar
entities within the matrix. To further encourage sharing and propagation of those information,
we're going to add 500 most frequent freebase types and fill in those values accordingly. So the
final matrix is going to be composed by two parts. One is the dependency features we extracted
and the freebase types, and we are going to learn a bunch of load dimensional vectors for each of
the entities and each of the features. So here are some experimental results. First, this is more
like a sanity check. We want to make sure that the entity matrix is making sense, so we kind of - what we did was basically picking up some pairs of ambiguous entities. Like in this case, we
pick up Georgia the state and Georgia the country, and we compare the nearest neighbors
according to the entity matrix. As you can see here, it's kind of making sense, because the
nearest neighbors for Georgia the state are mostly states, and the nearest neighbors for the
country are mostly European countries. So that's nice. And then, we look at the real-world
entity-linking data sets, and the first thing we did is we want to make sure that the proposed
representation is better than bag of words, so to isolate other factors, we basically just used the
context similar to rank the candidates. And the y-axis is showing the percentage of one
representation ranked the gold entity higher than the other. And the blue bars are the winning
percentage of our work and the green bars are bag of words. So in all the data sets we tested, our
proposed representation actually dominates bag of words.
>>: Two minutes.
>> Xiao Ling: Got it. Okay, so this is some preliminary results on end-to-end linking accuracy
of our prototype system. We are still -- the whole system is still under development. For
example, we don't have a joint inference component that considers the prediction of the lengths
together, but still, using simple string similarity and the content similarity we propose here, we
have comparable results, at least in two of the seven data sets, and we are optimistic, when we
further improve the system, it's going to get better. And also, just a quick reminder, our context
representation is independent of any system -- is independent of the whole system, so it can be
plugged in any other linking system. Just a quick recap, we propose a novel representation for
modeling the mention context and to combat sparsity in the entity matrix, we conform the matrix
completion task and we show some preliminary experimental results in the end, so that's it.
>>: Great. Thank you very much. We have time for some questions, and your hand is the first
one up here.
>>: Sure. Can we go back to your results slide? So I notice that you're doing well on a lot of
the complicated data sets, like TAC, but something like MSNBC is such an easy data set to beat.
I was wondering why this work is getting lower performance.
>> Xiao Ling: There could be many reasons why we are getting worse. One particular reason is
the MSNBC accuracy is evaluated over bag of concepts within the whole document, while we
are sort of like predicting the lengths for each entity mentioned independently, so it could be for
the same entity mentioned, we are predicting different things. That will hurt precision if you
look at the whole document as a bag of concepts, so this could be one reason, but there could be
others.
>>: We have time for one more question while the next speaker comes up to get set up, and it's
going to be Victoria Lin who is the next speaker. Another question, at the back there.
>>: Do you have a sense of [indiscernible] of the syntactic [indiscernible]? What is most
important between your syntactic side of the matrix and the side where the other types -- what
happens if I use only the type side or only the syntactic feature part or if I try to weight them?
>> Xiao Ling: I think the syntactic features and the freebase typed features are complementary
to each other. I don't think it's possible to only use freebase types, because there's no such clue
for the entity dimension, so that you don't actually have anything to compare with. I don't have
an exact answer for what's the most important syntactic feature in the matrix, and I think it's
mention independent.
>>: Great. Thank you very much. Let's thank the speaker again. And while we're getting set up
here, I'll have a chance to introduce the next speaker and the title of the next talk, and what's
going to appear on the screen behind me here is the title, hopefully being -- good timing.
Leveraging prior knowledge of output structure for learning with incomplete annotations, and the
speaker today is going to be Victoria Lin, so over to you Victoria for 10 minutes. Thank you.
>> Victoria Lin: So good morning, everyone. Welcome to my talk. I'm Victoria. Today, I'm
going to speak about a research project we did recently. The focus of this project is in scenarios,
learning with only incomplete annotations. However, prior knowledge of an ideal prediction
structure would help us to train [that] classifier. This is joint work with our group from the
University of Washington, with Sameer, Luke, Luheng and Ben Taskar. By the way, if you're
going to search of the Ben Taskar on your phone, that is a term we are going to Google for. So
the problem we look at is multi-label learning. The goal is to align a set of labels to each data
point. This is a problem we are going to frequently encounter in different situations. For
example, nowadays, there is a variety of social bookmarking websites where users could align
keywords to their favorite websites, scientific publications or their everyday photographs. Multilabel learning can be seen as a synced list form of structured prediction, in the sense that the
output structure is a flat set of labels. However, with just this simple structured output, as the
number of labels we consider grows, one could encounter a lot of different challenges when
designing a machine learning algorithm for it. The most immediate one is gathering incomplete
annotations is extremely challenging, so think about this example. For a single image, what is
the possible set of words you could happy to it? We see here there is a castle. There are
pinnacles of the castle, and some people might want to call it a building. There is also the lights
that are scattered, etc. When users assign tags for these kind of images, no effort is made
towards completeness. Hence, the annotations we get from the uses might look like this. In this
case, if we treat those missing labels as inactives, our classifier would be confused, because the
features corresponding to the missing object are still present in the image. Hence, what we really
want to do is to treat the missing labels as question marks in the sense that we are not sure that
they should be present or not. Our model handles this aspect by excluding them from the
definition of our loss function, and on the other hand, we use the prior knowledge about a desired
prediction structure to help recover those information. So here, we present a mathematical
formulation. Our input is represented as the design matrix X and a label matrix Y. Each row of
the matrices here is corresponding to an example, so each row of X is a feature vector and each
of Y is a binary label vector, so that Y(I,J) equals to one if example I is tagged with label J. And
as I said before, we really want to treat those zeros in Y as question marks. So our model goes
by completing the label matrix as training goes along. Our proposed approach is we would have
an inductive multi-label learning prediction model and also a label-completion model, and what
we are going to do is to do joint inference over both of them in training. So the matrix
representation of Y is key to our innovation, because here we see we have observed some entries
of Y which are the ones corresponding to the annotations provided by the users. So by the
theory from matrix completion field, if Y is known to satisfy certain structure properties with
those samples, one is able to recover the original Y matrix with high probability. And the
structure assumption we are going to make about Y comes from two different aspects. Both are
based on our prior knowledge. The first one is the large number of labels are actually highly
correlated with each other, in the sense that groups of words tend to co-occur with each together.
Hence, we would expect that possible labels actually come from a smaller number of word
groups. Mathematically, this means the latent structure representation of Y should be closely
low rank. And also, from the large number of possible tags, each example can only be associated
with a smaller number of them. Hence, we also expect the true label matrix to be sparse. So
with those structural assumptions, we are able to design the completion algorithm, which is fairly
standard. We do this by fixed desired rank K and define the factorization as a product of the two
[indiscernible] matrix U and V, and we compute the U and V by minimizing their error loss over
the complete -- over their observed entries of Y so that you could get a low-rank embedding that
is close to your annotation. And for the classifier prediction part, we just tried the simplest
binary relevance logistic regression, so basically what it does here is to train a logistic regression
for each label. And we modified the standard logistic regression loss to factorize over only the
observed entries of Y. And remember, what we really want to do is to conduct those two parts
jointly, so that when we compute the matrix, the matrix -- the label matrix could take feature
information into consideration. And also, our trained classifier could learn from the low rank
embedding. So we closed this loop by adding a term that minimized the KL divergence between
our classifier output and the low rank embedding. In this sense, we are forcing consistency
among the classifier output, the low rank embedding and our observation. The other way to
understand this model is that we have encoded the structural property, low rank and sparse, onto
the low rank embedding Q we computed, and by allowing Q and P to be close to each other,
those soft constraints propagate from Q to P. Hence, Q acts as a regularizer on the structure of
the output of the classifier. And this is the philosophy of the technique called posterior
regularization. Thereby we named our model as called posterior regularization low rank, short
for PR error. The joint objective defined in the previous slides is non-convex, unfortunately, but
it can be optimized using an EM style algorithm. So we initialized all the variables properly, and
for the M-step, we fix our low rank embedding and update the parameters. In the E-step, we fix
the classifier output and compute the low rank embedding again. Both E-step and M-step can be
done efficiently using stochastic gradient descent, and the entire EM process converged fairly
fast, within a few iterations. So here we go over some of our experimental results. We basically
evaluated the models over three data sets. All of them are gathered from social bookmarking
websites BibSonomy and Delicious, and all of the data sets have fairly large number of labels,
ranging from near 200 to near 1,000. Here shows the number of training examples we have, the
number of features and the number of labels. We calculate the evaluation metrics based on label
average, like example average F1-measure. Both of them are calculated just using the user
annotation provided. So the first set of experiments is to test the advantage of doing joint
modeling, so we compared with three baseline models that can be seen as subcomponents of our
model. The first one is just a naive binary relevance classifier, which is trained by treating those
missing labels as inactives. And the second one is basically the same as the first one, but the loss
function is defined over only the observed entries, and we also add a sparse regularization on the
output. The third one is to do label completion and classifier training in two separate stages. We
can see -- and the last one is our model. We can see here by ignoring the negative examples we
could have a fairly strong posterior already, but on those three data sets, our model that used
label correlation and doing joint inference has a significant and consistent improvement. We
also compared with some of the standard model label learning algorithms, which has previously
published state-of-the-art results, but both of them are trained, creating all those labels as
inactives. And here, we see that our model has a close performance to the best of state of the art
here, HOMER. Especially on the big test data sets, our improvement is quite significant. And
also, it's worth mentioning that the training process of HOMER is very complex, so it didn't
finish within one week on our largest data set. On the other hand, our model is quite scalable.
So in conclusion, I present here like for a complex structure prediction problem gathering
complete annotation would be an obstacle for designing the machine learning algorithm, and our
prior knowledge about the desired output would -- is very effective in providing extra
supervision for pruning the learning space. And posterior regularization is an effective way to
achieve this goal. It performs by enforcing soft constraints on the classifier output. As future
work, we would like to design better evaluation methods, especially for getting the cases where
we only got incomplete annotations. And also, we want to test the promise of our model across
domains. And at last, we should compare to some of the recent machine-learning work on
learning with incomplete labeling. Thank you.
>>: Thank you very much.
>>: Okay, we have time for a question here. Giuseppe?
>>: I was wondering if you use any prior knowledge to initialize the M, what kind of prior
knowledge?
>> Victoria Lin: That's a good question. I think the answer is you can, but we didn't. We just
randomly initialized all the variables. For example, the low rank matrix here, we just initialized
them using random values for the entries, and hopefully the learning process would help you to
get from the initialization to a local minimum.
>>: Great. Thanks, and I think one quick question while we're still setting up there. Any
others? Yes. I'll bring the mic to you back here.
>>: Thanks. So the prior knowledge you encode is that there's a low rank kind of completion ->> Victoria Lin: Yes, low rank and sparse, basically.
>>: So do you imagine extending the kind of prior knowledge you can put into that kind of
optimization? Like, for example, biases in terms of how the data is incomplete?
>> Victoria Lin: Right, right. I think the question is like the data might be incomplete in a
biased way, so certain labels turns to [miss]. I think the answer is still that you can, because the
joint model we present here is very flexible, actually. We didn't make extra assumptions on who
you do the label completion and how you do the training. So basically, if you have a better way
to model the label completion part, you could just add them into the framework. I think the same
similar alternative minimization would still apply here. But I think that part is worth trying,
because I think that's really the problem we should ask.
>>: Thank you again very much. Okay, we have a couple more talks to go here, so welcoming
my colleagues from Vancouver again back to the podium here, we're going to be hearing a
presentation on evaluating open relation extraction over conversational text. Over to you.
>> Yashar Mehdad: Thank you. Hi, again. Well, this is the work that was done by one of our
grad students, Mahsa, but since she couldn't be here today, I am going to present her work. The
work is more a preliminary result and preliminary work about how we can use relation extraction
tools or open information extraction tools for summarization. And mainly, we're trying to look
at some results, evaluation results, to see whether we can use those tools for summarization or
we need some adaptation. Well, as I talked to you about conversational data in my previous talk,
we have very precious source of information in conversational data, so we are already interested
to dig into those data, to run some analytics on data, to gain some information from such data.
And, as we know, it grows fast, it grows exponentially, and we need some way to deal with the
information overload you get from such data. So the intuition behind that is for any kind of text
processing or text analytics that you're going to run over conversational data, what you need is to
extract some value information inside a text. So one of the things I think that we look at is
information extraction, specifically open information extraction where the relations are not
predefined, so it's pretty open, and then we decided to look at these and see if we can extract
valuable information, and then from there, we run our summarization and use such information
in our summarization system. So open relation extraction is to try to extract relations or triple
relations, entity relation. Entity I have here one example, like for example Facebook, about
Whatsapp, argument one, Facebook argument two is Whatsapp and the relation. There are many
advantages of using open relation extraction and open information extraction tools, and one of
them is that these tools are available nowadays. Many groups are working on that, trying to
improve them. They are scalable so you can run, then you can extract the relations quite robustly
nowadays, and you don't need much if any domain-specific knowledge to run those tools or to
run those systems on your data. So we were mainly motivated to use open information
extraction for summarization, especially because we have quite a few number of state-of-the-art
open IE systems called ReVerb, OLLIE, which is an improved version of ReVerb, SONEX,
TreeKernel and EXEMPLAR. But the fact is that conversational data are not like news. They
are very noisy. They are less structured. So we're dealing with more problems with
conversation, so it's not like a result that you get when you run open information extraction on
news would be the same result that you could expect that you'd get over conversations. Talking
about not only forums and blogs but let's say tweets. They are very short, less structured, full of
acronyms, and that makes it difficult for such systems to extract the relations or information.
And we know, actually, that most of text preprocessing systems or tools that we use for different
applications in our research, when we run it on the conversational data are such noisy data, the
performance really decreases. So, for example, when you run named entity tagger or syntactic
parser over tweets, you absolutely get nothing if they are not trained on the domain-specific data.
Or we know that about 8% of missed extraction, 32% of incorrect extraction in those famous
ReVerb, OLLIE, are from incorrect parsing, so if we fix the parsing, we can solve those kind of
problems, as well. So we know that there are many challenges with dealing with conversational
data. So in this work, in this preliminary work, we're trying to first of all sample a good data set
for different sources of conversational data, from firms, blogs, to e-mails, meetings and tweets.
We try to run the current evaluation, open information extraction system on them. At the same
time, since some of the sentences from those kind of data sets are not really well formed or are
quite complex for the system to understand, we try to see if text simplification techniques can
help us to do a better information extraction or not. So that's basically the contributions of our
work. So for data set creations, we have these set of data sets for conversations. We have
reviews, we have e-mails, we have meetings, we have blogs and online discussions, and we have
social networks. So we have all these sets. What we need to do is to have a good sample of each
data set, putting them together and run the system, but sampling in such a kind of valuable and
huge data set is not simple randomization problem. You're going to find a quite consistent
sampling method to get your data set. So what we did, we used [indiscernible] sampling using
some features, some conversational features and some other features from the data sets to have
good distributions that are representing many features that exist in conversational data sets. So
I'm giving you a good kind of set of representations of good sets of features. We have syntactic
features, conversational features. We try to use them to cluster our data sets and sample from
those. For text simplification, we know that complex sentences can be simplified using those
methods. So we use one of the state-of-the-art text simplification systems called Tris, which is a
kind of a statistical sentence simplification. It's a very interesting model, so we try to use that.
So in summary, we have run the information extraction systems, called OLLIE, which is one of
the state of the art, over our data set. We evaluated the result manually. We simplified the data
set. We run it again, and then we compare it to see how the result will change. And then how
we evaluate it, we have the metrics of number of relations extracted, accuracy of the relations
extraction and confidence score coming from OLLIE. So for the results, I am giving you the
important results, so we can see actually, in most cases, the text simplification system helps the
information extraction, which is one of the things that we wanted to know to acknowledge if we
want to use text simplification before running information extraction or not. At the same time,
we know that the results for different data sets are different, and then we know that, of course,
different, and then we know that across different modalities, when we run simplification or not,
the results differ in a number of the cases. For example, when we run OLLIE itself, the resulting
Slashdot and BC3 corpus are the best, but the review corpus are the worst, and when we
simplify, the BC corpus still is the best and reviews still, they're not very good. We know that
text simplification is quite effective, especially when we have some data like Twitter data, which
showed in our experiments -- also is not very effective in increasing accuracy of OLLIE or
information extraction -- open information extraction of Slashdot data sets. There is some
analysis in our paper, you can have a look. So in conclusion, in conversational data sets, we
have lots of challenges. We have very complex kinds of text in terms of noise and less structure.
We got to know that e-mails and blogs are kind of easier for relation extraction than reviews of
the product, so probably for product reviews and relation, we should decide some other ways to
use some knowledge to do summarization. If you come to our poster, we talk about review
summarization as well, and for future work, we are thinking of using some systems in
combination or different simplification methods. Also, we are trying to take advantage of such
relations that are extracted for summarization. Thank you very much.
>>: Thank you. Okay, we have time for a question or two.
>>: So by extracting relations, are you also interested in the relations between utterances, talking
about ->> Yashar Mehdad: Between utterances?
>>: Yes, such as ->> Yashar Mehdad: Exactly, we are. We are very interested about that, and this is actually one
of the works that we are working on. So, for example, the rhetorical structure can be very
interesting to see the structure of any conversation and use that for summarization. We have
some insights and some work done. We are at the evaluation phase, so if you come to the review
summarization poster, I will talk more about the conversational feature we use in our framework.
>>: Great, thank you very much.
>> Yashar Mehdad: Thank you.
>>: For our last talk of the morning, I am pleased to be able to introduce to you Stephanie
Lukin, who is going to be taking us today about identifying subjective and figurative language in
online dialogue. Over to you.
>> Stephanie Lukin: Okay. Hi. Thank you. My name is Stephanie Lukin, and I'm going to talk
about identifying social language in online dialogue, specifically focusing on nastiness and
sarcasm. So many of the current NLP tools are based on this monologic model of language from
traditional media, excuse me, such as the news, but as we've been hearing throughout the day so
far, social media is becoming more prevalent, and the type of style of this language is very
different. It consists of dialogues and emotions and informalities, so in our work, we're
interested in creating new models of dialogues by taking advantage of the abundance of all this
social data. In 2012, Walker et. al released the Internet argument corpus, which consists of
annotated exchanges and online debate data. It was annotated by Mechanical Turkers. They
were shown a dialogic term like up here and they were asked to evaluate the overall language of
the responder, which is in bold up here. A variety of types of languages were annotated along
the side, but of all these styles, we are specifically interested in sarcasm. But how can we define
sarcasm? It's very difficult. There's a lot of different definitions, people don't interpret it in the
same way, and furthermore, there's theoretical work that claims you need the context or you need
world knowledge in order to determine if something is sarcastic. Just an important note, in the
previous Mechanical Turk task, when we asked people to identify if an utterance was sarcastic or
not, we said just -- we didn't give them any definition of sarcasm. We just said is this sarcastic,
in your opinion. So in this work, we're hoping to examine these utterances that they picked out
and try and hone in on the aspects that they believed are actually sarcastic. So looking in our
forms data, we found sarcasm is prevalent in about 10% of the data, but despite -- that is
prevalent, but it's still pretty scarce, and it's expensive in order to have human annotators to
collect more data. So we think it would be useful to have a technique that we can be able to
learn sarcasm from a small amount of well-labeled and annotated data. So we look at a method
designed to do just this. It's from Thelen and Riloff and Riloff and Wiebe in 2002 and 2003.
And we recreate their method by applying it to our data in hopes we can learn new labels from
the small amount of label data that we have. And we also decide to look at nasty language in
addition to sarcastic language, just to see if this Riloff and Wiebe method can generalize well to
the domain of dialogue language. So this is their method, in summary. They first develop cues.
So in their original task, I'll point out, it's a task looking at detecting subjectivity versus
objectivity in a monologic domain. So the first thing they do is they develop cues for identifying
subjectivity. They use these learned cues to train a cue-based classifier, and the classifier, the
goal of it is to maximize precision at the expense of recall. They use this classifier as a first
approximation on a large amount of un-annotated data. And then because the precision from this
classifier is high enough, they can take these predicted labels and learn new patterns from this
data, specifically syntactic patterns. And then they then bootstrap this process and learn new
patterns on un-annotated data. So in our work, we found as we were following this process -- we
found that we can't achieve this high-based precision with just the cue words. So we used this
cue-based classifier we developed. We also developed a pattern-based classifier, and we
combined them together. And in the end, we achieve fairly high precision for this task. So the
first step is to develop these cue words for our sarcasm and nastiness domain, but as I mentioned,
the theoretical work says that you may need context in order to determine if something is
sarcastic. So we do have labeled data already from our corpus. We run a simple statistical
analysis to just select unigrams, bigrams and trigrams, but we also want to see what providing
context will do. So we create a Mechanical Turk hit where we showed them the quote and the
response and then asked them to pick out things in the response that people could think are
sarcastic. So from this, here's some examples of the highest-rated ones we have. Our statistical
test was chi-squared. MT is Mechanical Turk. So there is some overlap. You can see oh and oh
yeah appearing in both of them for sarcasm. But also we point out that in the chi-squared, things
appear like we, and we think this could just be just an overtraining issue. So we use these cues
we found. We make this cue-based classifier. To rank the cues in order of importance, we use
their frequency and their reliability that we found in our development set. We train over a
variety of data one and data two. We make a classification if two or more of these cues are
present above these thresholds. And here are our results for sarcasm and nastiness. But as I
mentioned before, our goal is to create a high-precision classifier, and this is not very high
precision. So we do notice that nastiness does better than sarcasm, to this may be just a first
indicator that maybe nastiness is easier to identify using these cue words than sarcasm is. So
okay, we don't get a very high precision, but we continue in the process of next learning syntactic
patterns, and we come back to this problem later. So in the pipeline, okay, yes, so next we learn
these syntactic patterns, and the point is that they can generalize across utterances. They don't
have to be exact surface matches like the cue words, so these are the templates that were used
from the original Riloff and Wiebe work. So as some more examples, so this template, noun
prep, noun phrase, can match any of these examples that we find in our text. When we the
question are these templates tailored to the subjectivity domain, which they were developed? So
we also look at our data and try and develop our own sarcastic cues, or sarcastic patterns, so here
are some examples. OH adverb, OH right and OH sorry we see appear a lot in our data. So we
run our pattern classification, and what we call baseline here is just using the Riloff and Wiebe
syntactic patterns without our new cues or without our new patterns. So there is an increase,
especially in nastiness. This is very good. And in sarcasm, it is better than the cue-based
classifier was previously. And then looking at the new patterns, we see that they help very much
in sarcasm, which was what we were expecting, but in nastiness, the precision jumps but recall
drops 5%. There is maybe like just 10 utterances or some very small number. So in this, we did
not develop patterns for nastiness, so that could be a reason as why this is just a little strange.
Okay, so we've gone through this pipeline, but as I mentioned a couple times now, the cue-based
classifier is not achieving the high precision that we need it to to make this actually work, so
what we do is we do have a trained cue-based classifier. We have a trained pattern classifier, so
we combine them together. So we make a classification. We have two distinctions. We make a
classification if either the cue based or the pattern based says yes this is sarcastic, that's our or in
the table below. And then we have another classifier, where if they both say yes, this is sarcastic,
and that's the and in the table. So comparing the cue-based classifier, which is the yellow, the
51%, the or does much -- it does much better in terms of recall. Because we're using both
classifiers, we're getting a lot more of the data. Precision is still increasing, as well, and for and,
we do very well in terms of precision, but as expected, recall is lower because we're being more
specialized and focused in our classifications. And in nastiness, we see the same results, where
we have the or-based classifier does better in terms of precision than the cue based, and then in
the and, 88% is our precision but with a lower recall of 30%. So in conclusion, the goal of the
Riloff and Wiebe work as to develop this high-precision classifier that we can use as this first
approximation and then learn from a large amount of un-annotated data, learn syntactic patterns
from it. So we couldn't really achieve that with our cue based, but our combined classifier is on
the right track. So we've learned that context is pretty important for these cues. As shown on
their own, they're not doing as good as we expect, but the patterns do generalize well, especially
our syntactic patterns. And in the future work, we're going to now look at un-annotated data and
run our classifier on the un-annotated data, have human annotators compare, and we're hoping to
get a high amount of agreement and overlap.
>>: Thank you very much.
>> Stephanie Lukin: Thank you.
>>: So we have time for a few questions, and you probably want a nasty question so you can use
a different [indiscernible].
>>: Hi, there. So very interesting, thank you. I was wondering, the Turkers review the things
and annotate them for sarcasm, whatever. Do you have a sense of what the inter-annotator
agreement was like? Did the Turkers generally agree on what was sarcastic, or was there a lot of
variability there?
>> Stephanie Lukin: There was -- I can't remember the exact numbers, but there was -- so we
picked -- the utterances that we selected for our study, we made sure that there was a high
agreement in those agreements. We had about seven annotators per utterance, and we picked
ones that had four that were agreeing that it was sarcastic or more.
>>: Great. Thank you very much. Turning it over to you, thanks very much for this morning's
session, and announcements relating to lunchtime.
Download