Building and processing a dataset containing articles related to food adulteration

Building and processing a dataset containing articles
related to food adulteration
by
Deepak Narayanan
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering, Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2015
c Massachusetts Institute of Technology 2015. All rights reserved.
β—‹
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 22, 2015
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regina Barzilay
Professor
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Albert Meyer
Chairman, Masters of Engineering Thesis Committee
2
Building and processing a dataset containing articles related
to food adulteration
by
Deepak Narayanan
Submitted to the Department of Electrical Engineering and Computer Science
on May 22, 2015, in partial fulfillment of the
requirements for the degree of
Masters of Engineering, Computer Science and Engineering
Abstract
In this thesis, I explored the problem of building a dataset containing news articles
related to adulteration, and curating this dataset in an automated fashion. In particular, we looked at food-adulterant co-existence detection, query reforumulation, and
entity extraction and text deduplication. All proposed algorithms were implemented
in Python, and performance was evaluated on multiple datasets. Methods described
in this thesis can be generalized to other applications as well.
Thesis Supervisor: Regina Barzilay
Title: Professor
3
4
Acknowledgments
First and foremost, I would like to thank my advisor, Prof. Regina Barzilay, for giving
me my first real opportunity at academic research. Without her feedback, this thesis
would not be possible.
I would also like to thank Prof. Tommi Jaakola for his insightful comments – a
lot of the initial work in this thesis was a bi-product of suggestions made by him.
Rushi Ganmukhi has been a major contributor to the FDA project over the last
year, and a lot of the work in this thesis was done in close collaboration with him.
To the other members of the RBG group – thank you! Attending weekly group
meetings has given me a glimpse into how academic research is done. In particular, Karthik Narasimhan and Tao Lei were always extremely willing to answer my
questions, however dumb.
Last, I would like to thank my mom for always pushing me to become better,
both as a person and as a student. I would not be here without her support and
dedication.
5
6
Contents
1 Introduction
13
2 Food and adulterant co-existence
19
2.1
SVM classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2
Collaborative filtering classifier . . . . . . . . . . . . . . . . . . . . .
20
2.3
Passive-aggressive ranker . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.4.1
Collaborative filtering classifier . . . . . . . . . . . . . . . . .
26
2.4.2
Passive-aggressive ranker . . . . . . . . . . . . . . . . . . . . .
27
3 Query refinement and relevance feedback
3.1
31
Pseudo-relevance feedback with a transductive classifier . . . . . . . .
33
3.1.1
Transductive learning . . . . . . . . . . . . . . . . . . . . . . .
33
3.1.2
Relevance feedback . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
Query template selection . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
Filtering for adulteration-related articles using a Language Model . .
41
3.3.1
42
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Article clustering and record extraction
4.1
43
Named Entity Recognition using Conditional Random Fields . . . . .
43
4.1.1
Mathematical formulation . . . . . . . . . . . . . . . . . . . .
44
4.1.2
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.3
Finding optimal label sequence . . . . . . . . . . . . . . . . .
46
7
4.2
4.3
4.1.4
Entity recognition . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.1.5
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2.1
Clustering using k-means . . . . . . . . . . . . . . . . . . . . .
49
4.2.2
Clustering using HCA . . . . . . . . . . . . . . . . . . . . . .
50
4.2.3
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Combining entity recognition and clustering . . . . . . . . . . . . . .
55
4.3.1
Mathematical tools . . . . . . . . . . . . . . . . . . . . . . . .
55
4.3.2
Task setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.3.3
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.3.4
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5 Contributions and Future Work
67
5.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
8
List of Figures
1-1 Different data sources that can be used to extract information about
adulteration from . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3-1 Example of query refining . . . . . . . . . . . . . . . . . . . . . . . .
32
3-2 Example of patterns extracted from tweets describing music events in
New York . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3-3 Design of snowball system . . . . . . . . . . . . . . . . . . . . . . . .
40
4-1 CRF model – here π‘₯ is the sequence of observed variables and 𝑠 is the
sequence of hidden variables . . . . . . . . . . . . . . . . . . . . . . .
44
4-2 Example clustering of documents using HCA . . . . . . . . . . . . . .
51
4-3 Pseudocode of HCA algorithm, with termination inter-cluster similarity threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
53
10
List of Tables
2.1
F-measures obtained with collaborative filtering with different 𝑑 values
27
2.2
MAP scores on adulteration dataset . . . . . . . . . . . . . . . . . . .
28
4.1
Precision, recall and F1 of NER on training data . . . . . . . . . . . .
47
4.2
Precision, recall and F1 of NER on testing data . . . . . . . . . . . .
47
4.3
F-measure of clustering algorithms with and w/o keyword extraction
on Google News articles . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
F-measure of clustering algorithms with and w/o keyword extraction
on set of 53 adulteration-related articles . . . . . . . . . . . . . . . .
4.5
53
54
F-measure of clustering algorithms with and w/o keyword extraction
on set of 116 adulteration-related articles . . . . . . . . . . . . . . . .
54
4.6
Recall and precision of record extraction on Twitter corpus . . . . . .
64
4.7
Precision, recall and F1 of NER on Twitter event data . . . . . . . .
64
4.8
Recall and precision of record extraction on adulteration datasets . .
64
4.9
Recall, precision and F-measure of clustering using combined approach
on adulteration dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
11
65
12
Chapter 1
Introduction
Over the last couple of years, instances of food adulteration have been on the rise.
Here in the United States, the Food and Drug Administration (FDA) regulates and
enforces laws over food safety. With food-related illnesses becoming increasingly common, the FDA has recently become more interested in identifying and understanding
the risks associated with imported FDA-regulated products.
The reasons for increased risk are often varying and hard to pinpoint. For instance,
a couple of years ago, milk imported from China contained an increased amount of
melamine (a chemical that can be extremely harmful to human beings but often
added to foods to increase their apparent protein content) – this later was attributed
to the increased grain prices in China. In order to make ends meet and cut costs,
farmers added melamine to heavily watered-down milk to keep the protein content of
the diluted milk high.
In addition, since food adulteration is largely decentralized, one of the primary
challenges with understanding the causes of such risks is being able to piece together
a number of disparate data sources. With the unified data, it would be possible
to run inference models to obtain timely prospective information about instances of
increased risk.
The main goals of this thesis are two-fold – one, building a system capable of retrieving documents (in particular, news articles) that talk about specific adulteration
incidents, and two, developing algorithms that are capable of curating this dataset
13
and making it more usable in an automated way.
Retrieval of relevant news articles
Our approach to retrieving all news articles related to food adulteration largely relies
on the ability to determine if a particular food product and adulterant can co-exist
with each other. To this end, much of the intial work presented in this thesis is concentrated on building a classifier that is capable of determining if a food product and
an adulterant can co- exist. We have information about some cases of adulteration,
but have no information about most (food, adulterant) pairs. Given that similar
food products are more likely to be adulterated by the same adulterants, and symmetrically, similar adulterants are more likely to adulterate the same foods, it seems
possible to build a classifier that predicts such co-existences, using information about
foods and adulterants that is easily available.
One approach that we considered involved using a standard classification algorithm (such as SVM) to classify pairs of foods and adulterants – here each pair has
some feature representation, and we try to classify each such vector as positive or
negative. One inherent problem with this method is that it doesn’t really take into
account interactions between different food and adulterant features (even though some
of these relationships can potentially be exposed by using kernels). It seems much
more natural to solve this problem using collaborative filtering – a technique used by
many recommender systems.
Collaborative filtering is very loosely defined as the process of filtering for information or patterns using techniques that combine multiple view points or data sources.
Netflix extensively uses collaborative filtering for movie recommendations. Most people have not watched all movies, so a good way of predicting a particular person’s
reaction to a movie is to first find people with similar movie tastes, and then use the
reviews of these like-minded people to generate a predicted rating.
The two approaches we’ve described so far face the same inherent problem – for
our particular task there are no negative training examples (because we only have
14
information about a food product and adulterant co-existing, we don’t have any
information about a food product and adulterant not co- existing) – a naive way of
overcoming this problem is to cast all (or just a subset of) food - adulterant pairs not
seen in our datasources as negative in the training step.
However, this seems like a poor way to solve this problem since some of these (food,
adulterant) pairs may in fact co-exist (and with this approach we would mark them
as negative in training). Moreover, our main goal remains trying to make predictions
about unseen food and adulterant pairs – labeling all of these pairs as negative in our
training seems at odds with the goals of our task, and makes it less likely that we
will in fact be able to discover new co-existing (food, adulterant) pairs. Later in this
thesis, we present an approach that attempts to circumvent this problem in a more
natural manner, by looking at the problem from the perspective of ranking.
Once we have a sufficiently robust classifier which can predict co- relations with
reasonable accuracy and precision, the next important step is to develop an automated
process that not only generates queries given food products and adulterants that coexist with each other, but also retrieves documents relevant to the generated query.
We assume here that we have access to a stanndard information retrieval system like
Google that is capable of retrieving documents relevant to a particular query.
Our query formulation and information retrieval tasks are in fact intertwined,
since the only way we can really evaluate the performance of the queries generated by
the above process is by evaluating the quality of the documents generated by running
a retrieval task with the given query. We talk about some ways of generating suitable
queries given metadata about the incident in Chapter 3 of this thesis.
At the end of chapter 3, we briefly discuss another way of looking at the article
retrieval problem.
Curation of dataset
Creating a dataset of incidents that is easily usable isn’t confined to merely retrieving
documents that describe an adulteration incident. In order for these articles to be
15
“understandable” by a computer system, some additional steps are imperative – for
example, we often need to extract recognized entities from an unstructured blob of
text, because computer programs understand structured information a lot better than
unstructured information.
Furthermore, extracting well-defined fields from unstructured text allows us to
access subsets of our dataset in much easier ways. For example, extracting the food
and adulterant from every adulteration-related article allows us to look at adulteration
incidents involving “milk” and “melamine” easily. We present results of running the
current state-of-the-art entity recognition system, which makes use of a CRF, on our
dataset of adulteration articles.
In addition, since articles are often extracted from a number of diverse sources,
articles describing the same incident may in fact be lexically very different. Figuring
out that these articles are “duplicates” is not as trivial as comparing two strings.
We model deduplication as a clustering task, where we’re trying to classify articles
related to the same incident into the same cluster, and classify unrelated articles into
different clusters.
One can also think of making this task online – as soon as we retrieve an article,
we can run our duplicate detection algorithm to determine if the new article is a
duplicate of an existing article. This seems especially important when we’re using
our algorithms to detect potential new adulteration incidents in real time.
The majority of chapter 4 describes an approach using variational inference to
combine the above two tasks – record extraction and clustering – to reduce the noise
associated with running each of these tasks individually. Our hope is that articles
talking about the same incident must contain the same entitites, and articles with
the same entities should be classified into the same cluster.
Use cases of this dataset
Once we have a database of news articles related to cases of adulteration, we can
move on to building models that attempt to understand the types of events that may
16
Figure 1-1: Different data sources that can be used to extract information about
adulteration from
lead to elevated risks. These indicators are most likely to be some complex pattern
at the intersection of multiple data sources (for example, events now could lead to an
increased number of adulteration incidents four months down the road), which is why
its imperative to first build a reliable retrieval system that is capable of collecting
data from these multiple data sources.
Note that even though the techniques presented in this thesis were tuned with
the specific application of retrieving adulteration-related articles, the overall techniques are easily generalizable and can be applied to other, more generic, information
retrieval tasks.
17
18
Chapter 2
Food and adulterant co-existence
Our approach to retrieving articles requires a relatively comprehensive list of coexisting foods and adulterants. Keeping this in mind, the first task we look at is the
task of detecting whether a food and adulterant can co-exist with each other. With a
classifier capable of predicting food-adulterant co-existence with reasonable accuracy,
we can retrieve news articles that describe an incident involving the particular food
and adulterant.
2.1
SVM classifier
Our first approach, which is also the most naive approach, is to build a simple Support
Vector Machine classifier that classifies (food, adulterant) pairs as positive or negative
depending on whether the food and adulterant indeed co-exist or not.
Given a feature vector representation π‘₯ for a food, and a feature vector representation 𝑧 for an adulterant, we construct a feature vector representation for a (food,
adulterant) pair as vec(π‘₯𝑧 𝑇 ), which allows food features and adulterant features to
interact jointly (unlike concatenating π‘₯ and 𝑧 for example) – for example, if π‘₯ has a
feature which takes value 1 if the food is high in protein content and 0 otherwise, and
𝑧 has a feature which takes value 1 if the adulterant is melamine and 0 otherwise,
then the (food, adulterant) pair would have a feature which has value 1 if the food is
high in protein content and the adulterant is melamine – such a representation gives
19
the classifier more information to work with.
Given the feature vector representation of a (food, adulterant) pair, we try classifying different pairs as positive or negative by attempting to find a separator between
the positive and negative pairs seen in the training set. Note that this separator need
not be a straight line and can take different shapes depending on the type of kernel
used.
However, we observed that this approach has the following fundamental problems,
βˆ™ Our training data fundamentally doesn’t have negative examples, because we
can only observe positive examples.
Not observing a (food, adulterant) pairs can mean one of two things – one,
the food and adulterant indeed can’t co-exist, or two, the food and adulterant
co-exist but we just don’t have evidence of them co-existing.
In our linear classifier approach, we are forced to have some negative examples
provided to the classifier as part of the training data; otherwise our classifier
will learn to classify every example as positive! We can choose a subset of (food,
adulterant) pairs that we haven’t seen to be “negative examples” – it is possible
with this approach to label examples that are truly positive to be negative,
which would lead the classifier to learn incorrect relationships.
βˆ™ A standard linear classifier is also not able to take full advantage of the similarities between different foods and different adulterants. It intuitively makes sense
that similar foods are adulterated by similar adulterants and similar adulterants
adulterate similar foods, but a standard maximum-margin classifier is not able
to leverage this information effectively.
2.2
Collaborative filtering classifier
Collaborative filtering seems more compatible with the task at hand, since collaborative filtering inherently attempts to leverage similarity in different domains to make
better classification decisions.
20
Our collaboratively-filtered classifier can be mathematically formulated as finding
a linear classifier of the form π‘₯𝑇 π‘ˆ 𝑉 𝑇 𝑧, where π‘₯ is the feature vector representation
of a food product and 𝑧 is the feature vector representation of an adulterant, and π‘ˆ
and 𝑉 are weight parameter matrices.
Here, we assume that the food vector π‘₯ has dimension 𝑑1 ×1, the adulterant vector
𝑧 has dimension 𝑑2 × 1, and the paramters π‘ˆ and 𝑉 have dimensions 𝑑1 × π‘‘ and 𝑑2 × π‘‘
respectively (𝑑 is an adjustable parameter).
Training this classifier involves finding the optimal parameters π‘ˆ and 𝑉 that
minimize the loss function. In this case, we will use the quadratic loss function,
but other loss functions can be used as well. Mathematically, our task is to find
parameters π‘ˆ and 𝑉 that minimize total loss given as,
𝐿(π‘ˆ, 𝑉 ) =
∑︁
(𝑦𝑖𝑗 − π‘₯𝑇𝑖 π‘ˆ 𝑉 𝑇 𝑧𝑗 )2
(𝑖,𝑗)∈𝑇
Here, 𝑇 is the set of food-adulterant pairs (𝑖, 𝑗) in our training set, and 𝑦𝑖𝑗 is the
gold label of the pair (𝑖, 𝑗) – as with the linear classifier approach described above,
we need to choose a subset of (food, adulterant) pairs to be “negative”. As is evident,
this formulation is trying to move predictions as close to true labels as possible.
We can think of this entire setup as a low rank decomposition of a weight matrix
π‘Š.
To prevent overfitting, we introduce a regularization term (the sum of the Frobenius norms of π‘ˆ and 𝑉 ), to obtain the following expression,
𝐿(π‘ˆ, 𝑉 ) =
∑︁
𝐢 · (𝑦𝑖𝑗 − π‘₯𝑇𝑖 π‘ˆ 𝑉 𝑇 𝑧𝑗 )2 + |π‘ˆ |2𝐹 + 𝑉 |2𝐹
(𝑖,𝑗)∈𝑇
Here, 𝐢 is a constant used to trade off between the loss function term and the
regularization term – the higher the value of 𝐢, the more the classifier listens to
training examples.
We want to find values of π‘ˆ and 𝑉 that minimize 𝐿(π‘ˆ, 𝑉 ).
To simplify notation, let us represent 𝑉 𝑇 𝑧𝑗 as 𝑣𝑗 . 𝐿(π‘ˆ, 𝑉 ) is now given by the
21
expression,
⎞
βŽ›
𝐿(π‘ˆ, 𝑉 ) = ⎝
∑︁
𝐢 · (𝑦𝑖𝑗 − vec(π‘ˆ )𝑇 vec(π‘₯𝑖 𝑣𝑗𝑇 ))2 ⎠ + |π‘ˆ |2𝐹 + 𝑉 |2𝐹
(𝑖,𝑗)∈𝑇
Consider the partial derivative of 𝐿(π‘ˆ, 𝑉 ) with respect to π‘ˆ ,
πœ•πΏ
=0
πœ•π‘ˆ
βŽ›
⇒ 2𝐢 ⎝
⎞
∑︁
(𝑦𝑖𝑗 − vec(π‘ˆ )𝑇 vec(π‘₯𝑖 𝑣𝑗𝑇 )) · −vec(π‘₯𝑖 𝑣𝑗𝑇 )⎠ + 2 · vec(π‘ˆ ) = 0
(𝑖,𝑗)∈𝑇
βŽ›
⇒ vec(π‘ˆ ) + 𝐢 ⎝
⎞
∑︁
vec(π‘₯𝑖 𝑣𝑗𝑇 )vec(π‘₯𝑖 𝑣𝑗𝑇 )𝑇 ⎠ · vec(π‘ˆ ) = 𝐢(
(𝑖,𝑗)∈𝑇
∑︁
𝑦𝑖𝑗 vec(π‘₯𝑖 𝑣𝑗𝑇 ))
(𝑖,𝑗)∈𝑇
Now, let’s define 𝐴 as
βŽ›
𝐴=𝐼 +𝐢⎝
⎞
∑︁
vec(π‘₯𝑖 𝑣𝑗𝑇 )vec(π‘₯𝑖 𝑣𝑗𝑇 )𝑇 ⎠
(𝑖,𝑗)∈𝑇
and 𝑏 as
βŽ›
𝑏=𝐢⎝
⎞
∑︁
𝑦𝑖𝑗 vec(π‘₯𝑖 𝑣𝑗𝑇 )⎠
(𝑖,𝑗)∈𝑇
Note that 𝐴 is a matrix of dimension 𝑑1 𝑑 × π‘‘1 𝑑 and 𝑏 is a vector of dimension
𝑑1 𝑑 × 1.
With these definitions, we can see that
𝐴 · vec(π‘ˆ ) = 𝑏
⇒ vec(π‘ˆ ) = 𝐴−1 𝑏
We can compute the updated value of parameter 𝑉 similarly, with 𝐴′ defined as
βŽ›
𝐴′ = 𝐼 + 𝐢 ⎝
⎞
∑︁
vec(𝑧𝑗 𝑀𝑖𝑇 )vec(𝑧𝑗 𝑀𝑖𝑇 )𝑇 ⎠
(𝑖,𝑗)∈𝑇
22
and 𝑏′ defined as
βŽ›
𝑏′ = 𝐢 ⎝
⎞
∑︁
𝑦𝑖𝑗 vec(𝑧𝑗 𝑀𝑖𝑇 )⎠
(𝑖,𝑗)∈𝑇
(here, again for simplicity of notation, we define 𝑀𝑖 as π‘ˆ 𝑇 π‘₯𝑖 ). 𝐴′ is of dimension
𝑑2 𝑑 × π‘‘2 𝑑 and 𝑏′ is of dimension 𝑑2 𝑑 × 1.
With these definitions, we see that
𝐴′ · vec(𝑉 ) = 𝑏′
⇒ vec(𝑉 ) = 𝐴′−1 𝑏′
As with any collaboratively trained model, π‘ˆ and 𝑉 are initialized as random
matrices of the right dimension. Then, to obtain optimal values of π‘ˆ and 𝑉 , we first
fix π‘ˆ and find optimal 𝑉 , and then fix 𝑉 and find optimal π‘ˆ . We repeat the above
two steps until convergence.
Since we can have vastly different number of positive and negative examples, we
set different values of 𝐢 for examples with positive and negative labels. We set 𝐢+
to be equal to 1/𝑛− and 𝐢− to be equal to 1/𝑛+ . This ensures that if we have less
positive examples, mislabeling a positive example is penalized more heavily. This
is a fairly standard technique used in other learning tasks as well to account for a
disparity in number of different labels available in training.
2.3
Passive-aggressive ranker
Both approaches we’ve described so far fail to deal with the lack of negative examples
effectively. We now describe an approach that doesn’t make assumptions about pairs
that we haven’t seen so far.
Instead of fundamentally treating this task as a classification problem where we
need to predict a label given some properties about a food and adulterant, we can
think of this task as a ranking problem where we try to rank pairs that we’ve seen
before higher than pairs we haven’t seen so far.
23
Let 𝐴 = {𝑦1 , 𝑦2 , . . . , π‘¦π‘š } be the set of all adulterants we have ever seen and
𝐹 = {π‘₯1 , π‘₯2 , . . . , π‘₯𝑛 } be the set of all foods we have ever seen. Then, since we
have seen an incomplete set of (food, adulterant) pairs in our training data, we can
represent our training data as,
π‘₯1
{𝑦𝑖11 , 𝑦𝑖12 , . . . , 𝑦𝑖1𝑠1 }
π‘₯2
..
.
{𝑦𝑖21 , 𝑦𝑖22 , . . . , 𝑦𝑖2𝑠2 }
..
.
π‘₯𝑛
{𝑦𝑖𝑛1 , 𝑦𝑖𝑛2 , . . . , 𝑦𝑖𝑛𝑠𝑛 }
where {𝑦𝑖11 , 𝑦𝑖12 , . . . , 𝑦𝑖1𝑠1 } is the set of adulterants that we know co-exist with the
food π‘₯1 , {𝑦𝑖21 , 𝑦𝑖22 , . . . , 𝑦𝑖2𝑠2 } is the set of adulterants that we know co-exist with the
food π‘₯2 , etc. For simplicity of notation, let us denote the set of adulterants that
co-exist with food π‘₯𝑗 as the set 𝑆𝑗 .
We now want to calculate a weight vector π‘Š such that for all π‘˜ ∈ 𝑆𝑗 ,
π‘₯𝑇𝑗 π‘Š π‘¦π‘˜ ≥ max π‘₯𝑇𝑗 π‘Š 𝑦𝑙 + 1
𝑙̸∈𝑆𝑗
The main intuition here is that all adulterants that are known to co-exist with a
particular food should rank (or score) higher than all other adulterants that we have
no information about in the context of its co-existence with the particular food.
Note that we have no common scale across foods – we only make comparisons
between adulterants for a particular food. Also note that this formulation does not
need to make any assumptions about negative examples in the dataset – this ensures
that the model doesn’t see incorrect relationships in the training dataset.
There are two ways of going about solving this system of inequations. The first
way involves minimizing the norm of the weight matrix, subject to the specified
constraints – similar to the way a maximum margin classifier is formulated. This
24
however, can blow up computationally.
We can also think of the above system of constraint equations in terms of loss
functions. Instead of satisfying the following constraint equation,
π‘₯𝑇𝑗 π‘Š π‘¦π‘˜ ≥ max π‘₯𝑇𝑗 π‘Š 𝑦𝑙 + 1
𝑙̸∈𝑆𝑗
we can think about minimizing the loss,
Loss(π‘₯𝑇𝑗 π‘Š π‘¦π‘˜ − max π‘₯𝑇𝑗 π‘Š 𝑦𝑙 )
𝑙̸∈𝑆𝑗
where Loss(𝑧) = max{1 − 𝑧, 0} (the hinge loss function).
We can now use a passive-aggressive like algorithm to update the weight parameters with every encountered training example. If our initial weight vector is π‘Š , then
for every (food, adulterant) pair (π‘₯𝑗 , π‘¦π‘˜ ) a new weight vector π‘Š ′ can be obtained by
minimizing,
Loss(π‘₯𝑇𝑗 π‘Š π‘¦π‘˜ − π‘₯𝑇𝑗 π‘Š π‘¦π‘š ) +
πœ†
· |π‘Š ′ − π‘Š |2
2
where πœ† is a parameter that controls how close π‘Š ′ is to π‘Š , and π‘š = argmax𝑙̸∈𝑆𝑗 π‘₯𝑇𝑗 π‘Š 𝑦𝑙 .
(We can think of the second term as a regularization term that attempts to prevent
overfitting)
This results in the following update,
𝑇
π‘Š ′ = π‘Š + πœ‚(vec{π‘₯𝑗 π‘¦π‘˜π‘‡ } − vec{π‘₯𝑗 π‘¦π‘š
})
Here, πœ‚ is the learrning rate – the smaller the value of πœ‚, the closer π‘Š ′ remains to
π‘Š.
We perform such updates to π‘Š for every (food, adulterant) pair that exists in our
training data, and repeat until convergence.
25
2.4
Evaluation
2.4.1
Collaborative filtering classifier
We measure the performance of our collaborative filtering classifier using a metric
called F-measure which is commonly used in information retrieval.
Before defining what we mean by F-measure, we define two other metrics, called
precision and recall.
βˆ™ Precision: In information retrieval, precision is defined as the fraction of retrieved documents that are relevant. In our task, we represent a (food, adulterant) pair as a document. Mathematically, we can express this as,
Precision =
Number of relevant and retrieved documents
Number of retrieved documents
βˆ™ Recall: Recall is defined as the fraction of relevant documents that are retrieved. Mathematically, this can be expressed as,
Recall =
Number of relevant and retrieved documents
Number of relevant documents
βˆ™ F-measure: F-measure is the harmonic mean of precision and recall. That is,
F-measure =
2 · Recall · Precision
Recall + Precision
We present the results of our collaborative filtering model below.
We ran our experiment on a European database of food adulteration incidents.
We extracted all foods and adulterants involved in adulteration incidents from the
database, and then split all such extracted foods 80 − 20 into training and testing
sets. Training and testing foods were then combined with all known adulterants to get
training (food, adulterant) pairs and testing (food, adulterant) pairs. Our training
set contained 280,000 pairs and our testing set contained 68,000 pairs.
26
We use a food feature vector representation that consists of a binary expansion of
the food’s food category along with a meta feature vector that is a binary expansion
of the origin country and country in which the incident was reported.
The adulterant feature vector consists of a binary expansion of the adulterant’s
adulterant ID.
Average F-measure over 5 runs
0.29
0.36
0.38
Collaborative filtering with 𝑑 = 1
Collaborative filtering with 𝑑 = 3
Collaborative filtering with 𝑑 = 5
Table 2.1: F-measures obtained with collaborative filtering with different 𝑑 values
We do not present results obtained with our SVM classifier because the results
obtained are extremely low, due to the reasons discussed earlier in this chapter.
The results of our collaboratively filtering approach, even though somewhat promising, can still be improved. We hypothesize that not being able to deal with negative
examples in a clean way is the main reason why.
2.4.2
Passive-aggressive ranker
We measure the performance of our passive-aggressive ranker using a different metric
called Mean Average Precision (MAP). As with F-measure, the MAP score ranges
between 0.0 and 1.0, with a higher score indicating better ranking performance.
Before defining what we mean by Mean Average Precision, we explain Average
Precision.
Precision @ 𝐾 is defined as the fraction of relevant documents in the top 𝐾
documents retrieved by a query.
Mathematically,
∑︀𝑅
Average Precision =
𝑖=1
Precision @ 𝐾𝑖
𝑅
where 𝑅 is the total number of relevant documents, and 𝐾𝑖 is the rank of the 𝑖th
relevant, retrieved document.
27
Mean average precision is the average precision averaged across multiple queries
/ rankings.
Mean Average Precision is a metric for ranking performance that takes into account percentage of relevant documents retrieved at various cutoff points – it is ordersensitive, in that an ordering with five relevant documents followed by five irrelevant
documents would have a much higher score than an ordering with five irrelevant
documents followed by five relevant documents.
We present the performance of our passive-aggressive ranker below. We use the
same dataset as above, (all foods are split 80 − 20 into training and testing sets). The
number of foods in the training set is 1273, and the number of foods in the testing set
is 337. The number of adulterants is 124. We train our ranker on the training set, and
then rank all adulterants for foods in the test set, given the gold list of adulterants
co-existing with the particular food.
As with our collaborative filtering classifier, our food vector representation consists
of a binary representation of the food’s category along with metadata information
about the incident. Our adulterant vector representation, as before, consists of a
binary representation of the adulterant’s adulterant ID.
To evaluate performance, we rank all adulterants for every food in our test set,
and compute the MAP score given the gold list of adulterants co-existing with a food
in the testing set, extracted from the European dataset.
Average MAP
Passive aggressive ranker on training data
0.84
Passive aggressive ranker on testing data
0.59
Passive aggressive ranker on both training and testing data
0.75
Table 2.2: MAP scores on adulteration dataset
MAP has a relatively simple interpretation – the reciprocal of the MAP score gives
the rank of the returned, relevant document on average. For example, a MAP score
of 0.5 means that on average, returned relevant documents had a rank of 2; a MAP
score of 0.25 means that on average, returned relevant documents had a rank of 4,
etc.
28
Given this interpretation, our MAP score of around 0.59 indicates relatively good
performance since it means on average, co-existing adulterants ranked in the top 2
among all adulterants for a previously unseen food product.
29
30
Chapter 3
Query refinement and relevance
feedback
Query refinement is a problem that is of much interest in the field of Information
Retrieval. Given that people often find it hard to accurately express their search
needs in the form of a short and concise query, popular search engines such as Google
and Bing use query refinement techniques to improve the quality of search results
returned to the user. Spelling correction and query expansion (Google’s "did you
mean" feature) are popular applications of query refinement.
Very simplistically, query refinement can be thought of as a ranking procedure –
given a query, potential query replacements are generated, which can then be sorted
using a scoring function to generate the “best” refined queries.
The scoring function used is heavily dependent on the task at hand – for example, for spelling correction, a good scoring function is the edit distance between the
candidate query and the original query, since common misspellings can often be attributed to typos. Better scoring functions also take into account context to improve
the quality of results – for example, even though “there” and “their” are both valid
words with similar spellings, they obviously cannot be used interchangeably.
The major problem with query refinement is that for it to be effective, large
amounts of data are needed. In practice, both spelling correction and query expansion depend on query logs (extensive records of how users have searched in the
31
Figure 3-1: Example of query refining
past) for both a good list of possible candidates, as well as accurate enough data to
meaningfully rank the generated list of candidates.
We don’t really have query logs at our disposal in this problem, which means it
is unclear how to use standard query refinement techniques to solve our problem –
in fact, the problem we are trying to solve involves generating queries that would
retrieve better data, so using a technique that requires a lot of data creates a classic
chicken-and-egg problem.
Relevance feedback is a technique frequently used by information retrieval systems
to improve the quality of search results returned to a user, and does not require a large
amount of data. Relevance feedback uses user input to get a better idea of the type
of results the user is looking. Relevance feedback algorithms usually work well when
all relevant documents can be clustered together, and when it’s relatively straight
forward to distinguish between relevant and irrelevant documents for a particular
query.
Since it’s impractical to expect someone to manually annotate all retrieved documents as good or bad in an offline system that does not accept user feedback in
an easy way, it is necessary to devise a method to use relevance feedback in a more
automated way.
Given a classifier capable of determining if a particular document is a news article
reporting an incident of adulteration, we can use relevance feedback on our problem –
start with some initial version of a query, rate each document in the set of documents
retrieved by that query as appropriate or not using the classifier and then reiterate
on the query until a reasonable number of relevant documents are retrieved. We call
32
this modified relevance feedback approach pseudo-relevance feedback.
Standard classification algorithms, however, often require a fair amount of training
data to generate reasonably correct results, which increases the amount of annotation
required. In this chapter, we describe a transductive classification approach that
attempts to produce a generalizable classifier with minimal training data.
We begin by describing the general transductive learning setup with pseudorelevance feedback, and then discuss alternate ways of retrieving relevant documents.
3.1
Pseudo-relevance feedback with a transductive
classifier
We first discuss our transductive classification scheme, and then discuss a potential
pseudo-relevance feedback approach.
3.1.1
Transductive learning
In transductive learning, we try to predict the labels of test examples directly by observing training examples. This is in contrast to inductive learning where we attempt
to infer general rules from training examples, and then apply these general rules to
test examples.
Since transductive learning is trying to solve an easier task (by not trying to learn
a generalizable mapping), transductive tasks most often require less training data
(and hence less annotation) than inductive tasks.
−
−
→, with
In a transductive inference setup [6], we are given 𝑛 points →
π‘₯1 , →
π‘₯2 , . . . , −
π‘₯
𝑛
true labels 𝑦1 , 𝑦2 , . . . , 𝑦𝑛 .
At training time, we are given the true labels of a subset of points π‘Œ ⊂ {𝑦1 , 𝑦2 , . . . , 𝑦𝑛 }
(known as the training points), and we want to predict the labels 𝑦 ∈ {−1, 1} of all
other points as accurately as possible.
The main difference between this setup and that of a standard inductive learning
task is that we have access to the testing examples during training.
33
We want to train a transductive learner that is self-consistent – that is, our classifier should not be overly affected by any one training example. If a classifier is overly
influenced by a couple of training examples, then it is less generalizable and more likely
to make mistakes on examples it hasn’t seen before. A metric for self-consistency is
leave-one-out error.
Consider a π‘˜-nearest neighbor classifier that determines the label of an unseen
point π‘₯ based on the labels of the majority of π‘₯’s π‘˜ nearest neighbors.
Define 𝛿𝑖 ,
∑οΈ€
→
𝑗∈𝐾𝑁 𝑁 (−
π‘₯𝑖 )
𝛿𝑖 = 𝑦 𝑖
𝑦𝑗
π‘˜
If 𝛿𝑖 < 0, then clearly we will make a leave-one-out error on π‘₯𝑖 , since 𝛿𝑖 being less
than 0 implies that a majority of π‘₯𝑖 ’s π‘˜ nearest neighbors have a label opposite to
π‘₯𝑖 ’s true label.
We can modify the standard π‘˜-nearest neighbor classifier presented above to include weights, creating a weighted π‘˜-nearest neighbor classifier. If 𝑀𝑖𝑗 is the similarity
between data points 𝑖 and 𝑗, then if
∑οΈ€
→
𝑗∈𝐾𝑁 𝑁 (−
π‘₯𝑖 )
𝛿𝑖 = 𝑦𝑖 ∑οΈ€
𝑦𝑗 𝑀𝑖𝑗
→
π‘š∈𝐾𝑁 𝑁 (−
π‘₯𝑖 )
π‘€π‘–π‘š
we will have a leave-one-out error on π‘₯𝑖 if 𝛿𝑖 < 0.
Given that we want to keep leave-one-out error as small as possible, our goal
should be to make 𝛿𝑖 as large as possible (larger 𝛿𝑖 implies lower error). We can
formulate this as the following optimization problem.
max
−
→
𝑦
𝑛
∑︁
∑︁
→
𝑖=1 𝑗∈π‘˜π‘ 𝑁 (−
π‘₯𝑖 )
𝑦𝑖 𝑦𝑗 ∑οΈ€
𝑀𝑖𝑗
→
π‘š∈𝐾𝑁 𝑁 (−
π‘₯𝑖 )
π‘€π‘–π‘š
s.t. 𝑦𝑖 = 1, if 𝑖 ∈ π‘Œ and positive
s.t. 𝑦𝑖 = −1, if 𝑖 ∈ π‘Œ and negative
∀𝑗, 𝑦𝑗 ∈ {−1, 1}
34
Let us define 𝐴𝑖𝑗 as
𝑀𝑖𝑗
∑οΈ€
→
π‘š∈𝐾𝑁 𝑁 (−
π‘₯𝑖 )
π‘€π‘–π‘š
Then we see that for our objective to be maximized, we need to maximize
𝑛
∑︁
∑︁
𝑦𝑖 𝑦𝑗 𝐴𝑖𝑗
𝑖=1 𝑗∈π‘˜π‘ 𝑁 (π‘₯𝑖 )
for all (𝑖, 𝑗) for which 𝑦𝑖 𝑦𝑗 = −1.
We can now think of the π‘₯𝑖 ’s as vertices in a graph, and 𝐴𝑖𝑗 as the weight of the
edge from 𝑖 to 𝑗. Then if 𝐺+ represents the sub-graph with all positive vertices (all
vertices with label of +1), and 𝐺− represents the sub-graph with all negative vertices
(all vertices with label −1), we need to maximize −cut(𝐺+ , 𝐺− ) (the sum of edge
weights between vertices in 𝐺+ and 𝐺− ), or minimize cut(𝐺+ , 𝐺− ).
We can think of this classifier now as a modified clustering task, where we’re
trying to classify unlabeled examples close to positive training examples as positive
and classify unlabeled examples close to negative training examples as negative.
Note that the problem as currently formulated, closely resembles the classic mincut problem. However, simply finding a solution to this problem may lead to potentially highly undesirable solutions – for example, a viable partition could involve one
sub-graph having just one vertex, and the other sub-graph having all other vertices.
This is a degenerate solution.
Part of the reason for this problem is the fact that our above formulation doesn’t
take into account the number of edges crossing the cut – the degenerate solution
which includes one vertex on one side, and all other vertices on the other side has a
very small number of edges passing through the cut, which automatically prefers cuts
of smaller weight.
This problem can be overcome by instead trying to minimize
cut(𝐺+ , 𝐺− )
(the
|𝐺+ ||𝐺− |
average cut weight) subject to the same constraints as before.
Obtaining an exact solution to this problem is extremely hard. However, there
exist efficient spectral methods that approximate the solution to this problem. We
35
describe one such method below. [4]
Before proceeding, let us define the Laplacian 𝐿 of a graph 𝐺 as the 𝑛 × π‘› symmetric matrix whose (𝑖, 𝑗)th entry is given by,
⎧
∑οΈ€
βŽͺ
βŽͺ
βŽͺ
if 𝑖 = 𝑗
βŽͺ
π‘˜ πΈπ‘–π‘˜ ,
βŽͺ
⎨
𝐿𝑖𝑗 = −𝐸𝑖𝑗 , 𝑖 ΜΈ= 𝑗 and edge (𝑖, 𝑗) exists
βŽͺ
βŽͺ
βŽͺ
βŽͺ
βŽͺ
⎩0
otherwise
We claim that for any arbitrary 𝑛 × 1 vector π‘₯, π‘₯𝑇 𝐿π‘₯ =
∑οΈ€
(𝑖,𝑗)∈𝐸 (π‘₯𝑖
− π‘₯𝑗 )2 𝐸𝑖𝑗 .
This is easy to observe from the fact that the Laplacian matrix 𝐿 can be expressed
as 𝐼𝐺 𝐼𝐺𝑇 , where 𝐼𝐺 is the incidence matrix of the graph 𝐺 and has dimensions of 𝑛×π‘š.
The column in matrix 𝐼𝐺 representing the edge between vertices 𝑖 and 𝑗 has all
√οΈ€
√οΈ€
zeros except at the 𝑖th and 𝑗th rows, which has values 𝐸𝑖𝑗 and − 𝐸𝑖𝑗 respectively.
π‘₯𝑇 𝐿π‘₯ can now be expressed as (𝐼𝐺𝑇 π‘₯)𝑇 (𝐼𝐺𝑇 π‘₯).
√οΈ€
𝐸𝑖𝑗 (π‘₯𝑖 −π‘₯𝑗 ) where 𝑒 is an
∑οΈ€
edge between vertices 𝑖 and 𝑗. This implies that (𝐼𝐺𝑇 π‘₯)𝑇 (𝐼𝐺𝑇 π‘₯) = (𝑖,𝑗)∈𝐸 (π‘₯𝑖 − π‘₯𝑗 )2 𝐸𝑖𝑗 .
Observe that 𝐼𝐺𝑇 π‘₯ is a column-vector with 𝑒th row value
If we constrain 𝑧𝑖 to only take values {𝛾+ , 𝛾− }, where 𝛾+ =
𝛾− =
|{𝑖 : 𝑧𝑖 < 0}|
and
|{𝑖 : 𝑧𝑖 > 0}|
|{𝑖 : 𝑧𝑖 > 0}|
, we see that 𝑧 𝑇 𝐿𝑧 is equal to
|{𝑖 : 𝑧𝑖 < 0}|
𝑛2
cut(𝐺+ , 𝐺− )
|{𝑖 : 𝑧𝑖 < 0}||{𝑖 : 𝑧𝑖 > 0}|
provided 𝑧𝑖 = 𝛾+ for all positive points and 𝑧𝑖 = 𝛾− for all negative points.
𝑛 is a constant (total number of examples), which implies that minimizing the
average cut subject to constraints is equivalent to minimizing 𝑧 𝑇 𝐿𝑧 subject to the
constraints.
In addition, observe that from the way 𝑧 was constructed above, 𝑧 𝑇 𝑧 = 𝑛 and
𝑧𝑇 1 = 0
Using all this information, we can formulate our objective as the following optimization problem (note that we’re still ignoring our original constraint equations),
36
min 𝑧 𝑇 𝐿𝑧
𝑧
s.t. 𝑧 𝑇 1 = 0 and 𝑧 𝑇 𝑧 = 𝑛
−
−
We try to ensure that the predictions →
𝑧 stay close to the expected predictions →
𝛾
(that is, try to ensure that the constraints are satisfied) by adding in a loss term to
the formulation above,
min 𝑧 𝑇 𝐿𝑧 + 𝑐(𝑧 − 𝛾)𝑇 𝐢(𝑧 − 𝛾)
𝑧
s.t. 𝑧 𝑇 1 = 0 and 𝑧 𝑇 𝑧 = 𝑛
Here, C is a matrix that allows us to weight the importance of different labels
−
differently, and →
𝛾 is a vector with the 𝑖th element equal to 𝛾+ if the 𝑖th element has
a positive label, 𝛾− if the 𝑖th element has a negative label, and 0 otherwise.
We can find an approximate solution to the above system by performing an eigendecomposition on 𝐿, which is possible since 𝐿 is a symmetric matrix. Let’s express
𝐿 as π‘ˆ Σπ‘ˆ 𝑇 , where Σ is a diagonal matrix containing the eigen values of 𝐿, and π‘ˆ is
the eigenvalue matrix and is unitary. Also from the way 𝐿 is formulated, we see that
→
−
1 is an eigenvector of 𝐿, with a corresponding eigenvalue of 0.
→
−
Let 𝑉 be the eigenvector matrix with 1 removed, and 𝐷 be the eigenvalue matrix
with 0 removed.
Now
π‘ˆ Σπ‘ˆ 𝑇 ≈ 𝑉 𝐷𝑉 𝑇
.
Substituting 𝑧 = 𝑉 𝑀 produces,
37
min 𝑀𝑇 𝐿𝑀 + 𝑐(𝑉 𝑀 − 𝛾)𝑇 𝐢(𝑉 𝑀 − 𝛾)
𝑧
s.t. 𝑀𝑇 𝑀 = 𝑛
[8] specifies how to obtain the optimal solution to this problem.
Given an adjacency matrix 𝐴 of a graph, the following steps can be performed to
obtain label assignments that minimize the total leave-one-out error,
βˆ™ Compute the Laplacian 𝐿 of the matrix 𝐴.
βˆ™ Compute the smallest 2 to 𝑑 + 1 eigenvalues and eigenvectors of 𝐿 – these form
𝐷 and 𝑉 respectively.
−
βˆ™ Compute 𝐺 = (𝐷 + 𝑐𝑉 𝑇 𝐢𝑉 ) and 𝑏 = 𝑐𝑉 𝑇 𝐢 →
𝛾 and find smallest eigen vector
πœ†* of the matrix,
⎑
⎀
𝐺
−𝐼
⎦
⎣
→
−
→
−
− 𝑛1 𝑏 𝑏 𝑇 𝐺
→
−
βˆ™ Compute predictions as 𝑧 * = 𝑉 (𝐺 − πœ†* 𝐼)−1 𝑏 . These 𝑧’s can be used for
ranking.
βˆ™ An appropriate threshold can be selected to get discretized class labels of +1
and −1.
3.1.2
Relevance feedback
The technique described above allows us to easily determine which articles are “positive” and which articles are “negative” given minimum annotation (one or two examples of both classes). A pseudo-relevance feedback algorithm can then make use of
these predictions to improve the quality of search results produced.
However, if our classifier isn’t accurate enough, it is possible that we suffer from a
problem known as query drift, where the queries obtained from moving queries based
38
<ARTIST> performing Live at <VENUE>
So excited for <ARTIST> today at <VENUE>
<VENUE> is going down tonight with <ARTIST> performing!
Figure 3-2: Example of patterns extracted from tweets describing music events in
New York
on classifier output are in fact worse than the initial queries we started off with,
because of incorrect predictions made by the classifier.
In practice, we have observed that there is a fair amount of noise in our classifier
output, leading to sub-optimal results. Making the classifier more robust remains an
area of future work.
3.2
Query template selection
Instead of relying on a classifier’s performance on unseen data, we can try to improve
the retrieval process by exclusively making conclusions from training data.
We use techniques borrowed from a system called Snowball [1] to detect query
templates from text. These extracted query templates can be subsequently used to
create queries that can be fed through a retrieval system to retrieve articles.
In this task, we want to extract generic phrases that are representative of the
contexts records are usually found in unstructured text. For example, let’s say we
were trying to extract contexts in which (artist, venue) tuples are found in a corpus of
tweets referring to music events in New York. Some of the templates that we would
like to extract from these tweets are shown in figure 3-2.
We start off with a list of “seed tuples”. We go through the corpus of documents,
and extract all phrases that contain occurrences of all fields in the record or tuple.
These phrases can be converted to generic patterns – every specific occurrence
of a record field is replaced by its field type. For example, “John Lennon performed
in Madison Square Garden today" can be converted to “<ARTIST> performed in
<VENUE> today".
Note that this extracted pattern does not allow any arbitrary word to precede
39
“performed”, and any arbitrary word to precede “today” – the words in the sentence
must be tagged with the tag ARTIST and VENUE respectively. We use a NamedEntity Recognizer (discussed more in the next chapter) to extract tags for every word
in the corpus.
We can now use the patterns just extracted to obtain more tuples. These new
tuples can subsequently be used to generate more patterns – the process continues
until we extract enough records / patterns. The overall system can be understood
better from Figure 3-2.
Figure 3-3: Design of snowball system
We use a vector space model to ensure that patterns similar to each other, but
not completely identical, are not considered to be the same “pattern” by the system.
Once we have a list of possible query templates that capture a number of the
contexts in which the records are found in unstructured text, we can combine our
templates with metadata about the incidents (in this case, foods and adulterants
involved in the adulteration incident) to generate queries that can subsequently be
40
fed into a vanilla retrieval system to retrieve new articles that can be added to the
dataset.
3.3
Filtering for adulteration-related articles using a
Language Model
Another approach to retrieving new adulteration-related articles is by filtering through
a stream of Google News articles for adulteration-related articles. This approach is
fundamentally different from the one we’ve discussed so far, where we’re looking for
articles describing specific incidents.
There are a couple of different ways we can try filtering for adulteration-related
articles. One way is to build a standard classifier that attempts to predict if the topic
of the article is “food adulteration”.
Another way is to build a language model [7] that attempts to differentiate between
the language used in news articles that talk about adulteration and the language used
in news articles that don’t talk about adulteration.
Our hypothesis here is that articles that talk about adulteration use a lot more
technical and domain-specific language than articles related to other topics like sports
or business.
We use a standard bigram language model with Kneisser-Ney smoothing. One
can easily extend these ideas to higher 𝑛-grams.
Smoothing is a technique used in Language Models to deal with data sparsity. On
many occasions, we don’t see a 𝑛-gram in its entirety in our training set. To overcome
this problem, with smoothing, we backoff to a simpler model.
For the sake of simplicity, we discuss Kneisser-Ney smoothing in the context of
bigrams. In a standard backoff model, if we didn’t see a bigram in our training set,
we backoff to using a standard unigram model. This works well sometimes; however,
it often doesn’t work well since a unigram model completely neglects the context
inherent in a sentence.
41
Instead, Kneisser-Ney attempts to capture how likely a word appears in an unseen
word context, by counting the number of unique words that precede the given word.
Mathematically, we can write this as,
Pr𝐢𝑂𝑁 𝑇 (𝑀𝑖 ) ∝ |{𝑀𝑖−1 : 𝑐(𝑀𝑖−1 , 𝑀𝑖 ) > 0}|
After normalizing the above expression appropriately so that Pr𝐢𝑂𝑁 𝑇 sums to 1
for all 𝑀𝑖 , we obtain the following expression for Pr𝐾𝑁 ,
Pr𝐾𝑁 (𝑀𝑖 |𝑀𝑖−1 ) =
max(𝑐(𝑀𝑖 , 𝑀𝑖−1 ) − 𝛿, 0)
|{𝑀𝑖−1 : 𝑐(𝑀𝑖−1 , 𝑀𝑖 ) > 0}|
+πœ†
𝑐(𝑀𝑖−1 )
|{𝑀𝑗−1 : 𝑐(𝑀𝑗−1 , 𝑀𝑗 ) > 0}|
where πœ† is a normalizing constant.
3.3.1
Evaluation
We train our Language model on a subset of our corpus of adulteration-related
articles (200 articles), and then evaluate performance on the remaining subset of
adulteration-related articles along with non-adulteration related articles scraped from
Google News.
We decide an average negative log likelihood to decide whether a news articles
talks about adulteration or not. Using this approach, we obtain an F-measure of
0.61. We hypothesize that this approach will do a lot better with a more extensive
training dataset.
42
Chapter 4
Article clustering and record
extraction
For the purposes of our dataset creation task, the mere retrieval of relevant news
articles is not sufficient, simply because raw articles are unstructured and hence difficult to understand for a computer program. An important goal of this project is
to extract fields from the articles we manage to retrieve. In addition, it is useful to
de-duplicate articles on the basis of the incident they are describing. Later in this
section, we will describe an approach that attempts to accomplish both these tasks
in a single, unified manner.
Before describing our combined approach, we describe some well-known ways of
going about solving the individual problems described above – one, the problem of
extracting records from unstructured text, and two, clustering texts according to their
content.
4.1
Named Entity Recognition using Conditional Random Fields
Conditional Random Fields are a well-established way of extracting records from
text. CRFs are a class of machine learning algorithms that are used for structured
43
Figure 4-1: CRF model – here π‘₯ is the sequence of observed variables and 𝑠 is the
sequence of hidden variables
prediction.
CRF aims at utilizing information inherently prevalent in a sequence to make
better classification decisions. It is possible to perform a task such as Part-of-Speech
tagging by trying to predict the part-of-speech tag of each word in a sentence in
isolation. But this approach won’t work well, because there exists a number of words
that function as different parts-of-speech in different contexts – for example, the
word “address” can function as both a noun and an adjective, so without the context
in which the word is used, it would be impossible to always predict the right tag
associated with the word.
4.1.1
Mathematical formulation
Mathematically, we can think of this as a task where we try to predict the optimum
label sequence 𝑙 given a sentence 𝑠. In other words, we want to find the label sequence
𝑙 that maximizes the conditional probability Pr(𝑙|𝑠).
Given a sentence 𝑠, we can score a labeling 𝑙 of 𝑠 using a scoring function,
score(𝑙|𝑠) =
π‘š ∑︁
𝑛
∑︁
πœ†π‘— 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) =
𝑗=1 𝑖=1
𝑛
∑︁
¯ · 𝑓¯(𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 )
πœ†
𝑖=1
Observe from this formulation that a label depends on its preceding label, the
word it corresponds to and the entire sentence the word belongs to.
Here, πœ†π‘— is used to weight the different 𝑓𝑗 differently.
We can now express the probability of a label sequence given a sentence 𝑠 as,
44
∑οΈ€ ∑︀𝑛
exp[ π‘š
exp[score(𝑙|𝑠)]
𝑗=1
𝑖=1 πœ†π‘— 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 )]
∑οΈ€
∑οΈ€
∑οΈ€
Pr(𝑙|𝑠) = ∑οΈ€
=
π‘š
𝑛
′ ′
′
𝑙′ exp[score(𝑙 |𝑠)]
𝑙′ exp[
𝑗=1
𝑖=1 πœ†π‘— 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 )]
For a classic part-of-speech tagging problem, possible 𝑓𝑗 ’s that can be used are
βˆ™ 𝑓1 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = 1 if 𝑙𝑖 = VERB; 0 otherwise.
βˆ™ 𝑓2 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = −1 if 𝑙𝑖 = NOUN and 𝑙𝑖−1 = ADJECTIVE; 0 otherwise.
βˆ™ 𝑓3 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = 1 if 𝑠[𝑖] = THE; 0 otherwise.
βˆ™ 𝑓4 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = tf(𝑠[𝑖]) · idf(𝑠[𝑖]), where tf(𝑀) is the number of times the word
𝑀 occurs in the sentence 𝑠, and idf(𝑀) is the inverse document frequency of the
word 𝑀.
4.1.2
Training
A problem that we skirted in the previous section was how to set the feature weights
πœ†π‘— . It is clear that setting πœ†π‘— to arbitrary values will not work well, since we may be
giving importance to features that are in reality not that important.
It turns out that we can find optimal weight parameters using gradient ascent.
Gradient ascent is an optimization algorithm that attempts to find the local maximum of a multi- variable function 𝐹 (π‘₯), making use of the fact that the function
𝐹 (π‘₯) at some point 𝑑 increases fastest in the direction of the gradient of 𝐹 (π‘₯) at the
point 𝑑.
Essentially, we start off with a randomly assigned initial weight vector, and then
try to move towards more optimal weight vectors by moving in the direction of the
πœ• log Pr(𝑙|𝑠)
gradient
.
πœ•πœ†
For every feature 𝑗 we can compute the gradient of log Pr(𝑙|𝑠) with respect to the
corresponding weight πœ†π‘— ,
π‘š
π‘š
∑︁
∑︁
πœ• log Pr(𝑙|𝑠) ∑︁
′
=
𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) −
Pr(𝑙′ |𝑠)
𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖′ , 𝑙𝑖−1
)
πœ•πœ†π‘—
′
𝑖=1
𝑖=1
𝑙
45
However, note that the number of possible 𝑙′ is exponential in the length of the
sentence, so this update looks intractable. However, observe that
π‘š
π‘š
∑︁ ∑︁
πœ• log Pr(𝑙|𝑠) ∑︁
′
=
𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) −
Pr(𝑙′ |𝑠)𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖′ , 𝑙𝑖−1
)
πœ•πœ†π‘—
′
𝑖=1
𝑖=1 𝑙
=
π‘š
∑︁
𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) −
π‘š ∑︁
∑︁
′
′
)
, 𝑙𝑖′ |𝑠)𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖′ , 𝑙𝑖−1
Pr(𝑙𝑖−1
′
𝑖=1 𝑙𝑖−1
,𝑙𝑖′
𝑖=1
Here we marginalized Pr(𝑙′ |𝑠) over all labels in the label sequence 𝑙′ that are not
′
at indices 𝑖−1 and 𝑖. Computing the probability of all pairs of labels (𝑙𝑖−1
, 𝑙𝑖′ ) requires
𝑂(π‘˜ 2 ) time where π‘˜ is the number of possible labels – this is clearly tractable.
We can now move πœ†π‘— in the direction of this gradient, to obtain a new weight
value πœ†′𝑗 ,
πœ†′𝑗 = πœ†π‘— + 𝛼
πœ• log Pr(𝑙|𝑠)
πœ•πœ†π‘—
Here, as stated before, we are moving the weight vector in the direction of maximum change of the log-probability of a label sequence given a sentence. 𝛼 is the
learning rate. The larger the value of 𝛼, the faster πœ† moves in the direction of the
gradient; if the value of 𝛼 is very high however, the updates will be inaccurate, and
πœ† will move far away from its true optimal value.
¯ until convergence, or for a fixed number of iterations.
We make updates to πœ†
4.1.3
Finding optimal label sequence
Given a trained CRF, we want to find the optimal label sequence of a given sentence
𝑠. One way of doing this is to compute Pr(𝑙′ |𝑠) for every possible label sequence 𝑙′ of
a sentence 𝑠 to figure out which label sequence 𝑙′ is most likely given a sentence 𝑠.
However, for a sentence of length 𝑛, if the total number of labels for each word
is π‘˜, then the total number of labeling sequences for the sentence 𝑠 is π‘˜ 𝑛 , which is
exponential in the length of the sentence.
However, since our formulation only depends on pairs of labels, we can use Viterbi’s
46
algorithm [2] to cut the complexity down to 𝑂(π‘›π‘˜ 2 ).
4.1.4
Entity recognition
We can utilize the above framework to recognize entities within an unstructured blob
of text. We can tag all entities that interest us with their respective tags, and all
words encountered in our training corpus that aren’t specific entities can be labeled
with a generic tag – let us call this generic tag a ‘O’.
Now, when we run our trained model on our testing corpus, any contiguous sequence of words tagged with the same tag is treated as a recognized entity of that
particular tag.
4.1.5
Evaluation
We run Stanford’s NER code [5] on our dataset of adulteration-related news articles,
trying to recognize foods and adulterants. We split our dataset of approximately 150
articles 70 − 30 into training and testing sets.
We obtain the following precision, accuracy and recall for the different fields.
Entity
Precision
ADULTER
0.846
FOOD
0.998
Total
0.923
Recall
0.653
0.973
0.796
F-measure
0.737
0.985
0.854
Table 4.1: Precision, recall and F1 of NER on training data
Entity
Precision
ADULTER
0.522
FOOD
0.798
Total
0.667
Recall
0.224
0.326
0.279
F-measure
0.313
0.463
0.393
Table 4.2: Precision, recall and F1 of NER on testing data
We see that on examples not yet seen, there is still a fair amount of noise in the
accuracy of the fields extracted, especially for the adulterant field. As is expected,
47
performance on the training set is significantly better than performance on the testing
set.
Slightly more surprising is the difference in performance of foods versus adulterants. This can be probably attributed to the fact that the names of adulterants are
often extremely technical, which means the number of unseen adulterant tokens in
the test dataset is a lot higher than the number of unseen food tokens.
48
4.2
Clustering
Our dataset consists of a number of articles that describe instances of adulteration.
These articles are collected from a number of very different sources, and are often describing the same adulteration incident, even though the exact content of the articles
may not be completely identical.
Since we can have multiple duplicates for every incident, it makes sense to cast
this as a clustering problem, where ideally, two articles are in the same cluster if
they describe the same incident, and articles in different clusters describe different
incidents.
4.2.1
Clustering using k-means
K-means is one of the more popular clustering algorithms. Like most clustering
algorithms, it is completely unsupervised.
Mathematical formulation
Given 𝑛 points (π‘₯1 , π‘₯2 , . . . , π‘₯𝑛 ) where each π‘₯𝑖 is a 𝑑-dimensional vector, we want to
assign each π‘₯𝑖 to a set 𝑆𝑗 where 1 ≤ 𝑗 ≤ π‘˜, such that 𝑆1 ∪ 𝑆2 ∪ . . . π‘†π‘˜−1 ∪ π‘†π‘˜ =
{π‘₯1 , π‘₯2 , . . . , π‘₯𝑛 } and each of the sets 𝑆𝑗 are disjoint (that is π‘₯𝑖 can belong to exactly
one cluster).
We can mathematically formulate this task as trying to find the set 𝑆 = {𝑆1 , 𝑆2 , . . . , π‘†π‘˜ }
that minimizes,
π‘˜ ∑︁
∑︁
||π‘₯ − πœ‡π‘– ||2
𝑖=1 π‘₯∈𝑆𝑖
Here, we can think of πœ‡π‘– as the representative element of the cluster 𝑖.
Minimizing the above expression with respect to πœ‡π‘– shows us that πœ‡π‘– must be equal
to the centroid of the data points assigned to the cluster 𝑖.
49
Description
Every iteration of k-means clustering involves two important steps,
1. Assignment step: Each point π‘₯1 , π‘₯2 , . . . , π‘₯𝑛 must be assigned to one of π‘˜
clusters. This is done by assigning each point to the cluster with nearest representative point.
(𝑑)
(𝑑)
(𝑑)
𝑆𝑖 = {π‘₯𝑝 : ||π‘₯𝑝 − πœ‡π‘– ||2 ≤ ||π‘₯𝑝 − πœ‡π‘— ||2 ∀𝑗, 1 ≤ 𝑗 ≤ π‘˜}
2. Update step: Once every point is assigned to a cluster, we need to update the
centroid of each cluster.
Each new centroid is given by the expression,
πœ‡π‘‘+1
=
𝑖
1
∑︁
(𝑑)
|𝑆𝑖 |
π‘₯𝑗
(𝑑)
π‘₯𝑗 ∈𝑆𝑖
The above two steps are repeated until convergence. Every point π‘₯𝑖 is initially
assigned to a random cluster.
4.2.2
Clustering using HCA
Hierarchical clustering is a method of clustering that tries to build a hierarchy of
clusters. It circumvents the problem of initializing data points to clusters by starting
every point as its own cluster of size 1, and then collapsing clusters that are similar
to obtain larger clusters.
Formulation
As with k-means, the first step in this process involves vectorizing the news article.
Once we have a vector representation of each news article, we can run HCA on the
resulting vectors to obtain clusters, such that all vectors in a cluster are similar to
each other.
50
Figure 4-2: Example clustering of documents using HCA
51
In Hierarchical Agglomerative clustering, every article starts off as its own cluster
– that is, if we start with 𝑛 articles, we initially start with 𝑛 different clusters.
Every iteration of the algorithm, we attempt to merge the two closest clusters.
Note that we can have different inter-cluster similarity metrics. We describe them
in greater detail below,
βˆ™ Single-link clustering: The distance between two clusters is the distance
between their most similar members. This inter-cluster similarity metric’s main
disadvantage is that it disregards the fact that the furthest points in clusters
that are going to be merged may in fact be really far apart.
π‘‘πΆπΏπ‘ˆ 𝑆 (𝐴, 𝐡) = min 𝑑(π‘₯, 𝑦)
π‘₯∈𝐴,𝑦∈𝐡
βˆ™ Complete-link clustering: The distance between two clusters is the distance
between their most dissimilar members.
π‘‘πΆπΏπ‘ˆ 𝑆 (𝐴, 𝐡) = max 𝑑(π‘₯, 𝑦)
π‘₯∈𝐴,𝑦∈𝐡
βˆ™ Average-link clustering: The distance between two clusters is the average of
the distances between all pairs of points in the two clusters. Mathematically,
π‘‘πΆπΏπ‘ˆ 𝑆 (𝐴, 𝐡) =
∑︁ ∑︁
1
𝑑(π‘₯, 𝑦)
|𝐴| · |𝐡| π‘₯∈𝐴 𝑦∈𝐡
52
Pseudocode
HAC(π‘₯1 , π‘₯2 , . . . , π‘₯𝑛 , 𝑑)
1 Initialize 𝐢
2 for 𝑖 = 1 to 𝑛
3
𝐢[𝑖] = π‘₯𝑖
4 maxClusterSimilarity = ComputeMaxClusterSimilarity(𝐢)
5 while maxClusterSimilarity < 𝑑
6
𝑖, 𝑗 = FindMostSimilarClusters(𝐢)
7
MergeClusters(𝐢, 𝑖, 𝑗)
8
maxClusterSimilarity = ComputeMaxClusterSimilarity(𝐢)
Figure 4-3: Pseudocode of HCA algorithm, with termination inter-cluster similarity
threshold
4.2.3
Evaluation
We run our two clustering algorithms on two datasets – a set of news articles scraped
from Google news, and our dataset of adulteration-related articles.
We first present results obtained on the Google news corpus. We use tf-idf vector
representations of the text, with and without keyword extraction. We chose 20 distinct incidents scraped from the landing page of news.google.com, and then tried
clustering the 76 obtained articles into 20 clusters.
Clustering algorithm
HCA (Complete-link)
HCA (Single-link)
HCA (Average-link)
K-means
F-measure with keywords
0.83
0.67
0.87
0.84
F-measure w/o keywords
0.69
0.25
0.63
0.77
Table 4.3: F-measure of clustering algorithms with and w/o keyword extraction on
Google News articles
We next present results obtained on adulteration-related articles. We run our
experiments on two subsets of the adulteration article dataset at our disposal – one
with 52 articles and another with 116 articles. Both times, we tried to cluster the
dataset of articles into 15 clusters.
53
Clustering algorithm
HCA (Complete-link)
HCA (Single-link)
HCA (Average-link)
K-means
F-measure with k/w extr. F-measure w/o k/w extr.
0.67
0.43
0.62
0.35
0.57
0.47
0.13
0.65
Table 4.4: F-measure of clustering algorithms with and w/o keyword extraction on
set of 53 adulteration-related articles
Clustering algorithm
HCA (Complete-link)
HCA (Single-link)
HCA (Average-link)
F-measure with k/w extr. F-measure w/o k/w extr.
0.35
0.19
0.13
0.10
0.27
0.11
Table 4.5: F-measure of clustering algorithms with and w/o keyword extraction on
set of 116 adulteration-related articles
Most of the clustering algorithms seem to perform well on the Google news corpus dataset – this may partially be due to the fact that a number of these articles
come from very different domains (Sports, business, current affairs) so distinguishing
between them using word frequencies actually works pretty well in practice.
Adulteration articles on the other hand are probably pretty similar on a word token
level, which means performance on the adulteration dataset is far worse. More clever
feature representations probably exist, but we hypothesize that the best possible
performance of these algorithms is not much better than that presented in Table 4.4
and Table 4.5.
54
4.3
Combining entity recognition and clustering
Both approaches that we described above are not perfect and include a lot of noise
in their results – sometimes, the NER algorithm extracts incorrect fields from the
article and sometimes, the clustering algorithm inserts articles into the wrong cluster.
With the knowledge that articles in the same cluster should have the same entities
and articles with the same entities should belong to the same clusters, we claim that
precision of both clustering and entity recognition can be improved [3].
Before, describing our approach in detail, we describe some tools needed to understand the approach better.
4.3.1
Mathematical tools
Coordinate ascent
Coordinate ascent is a technique used to optimize a multivariable function 𝐹 (π‘₯). We
can maximize a multivariable function 𝐹 (π‘₯) by maximizing along one dimension at a
time.
Given a setting of variables π‘₯π‘˜ at the end of some iteration π‘˜, we can find the
optimal setting of variables π‘₯π‘˜+1 at the end of iteration π‘˜ + 1 by finding π‘₯π‘˜+1
for each
𝑖
dimension 𝑖, by computing,
π‘˜+1
π‘˜+1
π‘˜
π‘˜
π‘₯π‘˜+1
= arg max 𝑓 (π‘₯π‘˜+1
1 , π‘₯2 , . . . , π‘₯𝑖−1 , 𝑦, π‘₯𝑖+1 . . . , π‘₯𝑛 )
𝑖
𝑦∈β„›
We then start off with some arbitrary initial vector π‘₯0 , and subsequently compute
π‘₯1 , π‘₯2 , . . ..
By doing line search, we get 𝐹 (π‘₯0 ) ≤ 𝐹 (π‘₯1 ) ≤ 𝐹 (π‘₯2 ) ≤ . . .. Note that this is
slightly different from gradient ascent where we attempt to modify all parameters in
one go, instead of in a sequential manner.
55
Variational inference
For many models, computing the posterior is computationally intractable. To overcome this problem, we can try to find a distribution over latent variables that approximates the true posterior of the distribution. We call this approximated distribution
π‘ž, and we use it instead of the posterior to make predictions about the data.
The Kullback-Leibler divergence is a measure for closeness of two distributions. Given two distributions 𝑝 and π‘ž, the KL divergence is defined as,
KL(π‘ž||𝑝) = Eπ‘ž [log
π‘ž(𝑍)
]
𝑝(𝑍)
In variational inference, we try to approximate a posterior distribution 𝑝(𝑍|π‘₯)
with a distribution π‘ž(𝑍) that is chosen out of a family of distributions. π‘ž is picked in
such a way so as to minimize the KL divergence between 𝑝 and π‘ž.
Intuitively, this makes sense, because KL(π‘ž||𝑝) satisfies the following nice properties,
βˆ™ For all points where the value of the distribution π‘ž is high and 𝑝 is high, contribution towards KL(𝑝||π‘ž) is low.
βˆ™ For all points where the value of the distribution π‘ž is high and 𝑝 is low, contribution towards KL(𝑝||π‘ž) is high.
βˆ™ For all points where the value of the distribution π‘ž is low, contribution towards
π‘ž(𝑍)
KL(𝑝||π‘ž) is low, since we’re computing the expectation of log
with re𝑝(𝑍|π‘₯)
spect to the distribution π‘ž.
It is important to note that 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) is not symmetric, and that 𝐾𝐿(𝑝(𝑧|π‘₯)||π‘ž(𝑧))
satisfies a number of the nice properties that 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) does. However, 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯))
is a lot easier to compute since it involves computing the expectation over the distribution π‘ž, which is a lot easier than computing the expectation over the distribution
𝑝(𝑧|π‘₯).
56
Note that 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯) can be written as,
[οΈ‚
𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) = Exπ‘ž
π‘ž(𝑍)
log
𝑝(𝑍|π‘₯)
]οΈ‚
= Exπ‘ž [log π‘ž(𝑍)] − Exπ‘ž [log 𝑝(𝑍|π‘₯)]
= Exπ‘ž [log π‘ž(𝑍)] − Exπ‘ž [log 𝑝(𝑍, π‘₯)] + log 𝑝(π‘₯)
= −(Exπ‘ž [log 𝑝(𝑍, π‘₯)] − Exπ‘ž [log π‘ž(𝑍)]) + log 𝑝(π‘₯)
Since 𝑝(π‘₯) does not depend on π‘ž in any way, we can think of minimizing 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯))
as simply minimizing the expression Exπ‘ž [log π‘ž(𝑍)] − Exπ‘ž [log 𝑝(𝑍, π‘₯)].
In mean-field variational inference, we usually factorize the distribution π‘ž(𝑧1 , . . . , π‘§π‘š )
as the product of independent factors π‘ž(𝑧𝑗 ), that is,
π‘ž(𝑧1 , . . . , π‘§π‘š ) =
π‘š
∏︁
π‘ž(𝑧𝑗 )
𝑗=1
With this factorization, we can express 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯) as,
𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) = Exπ‘ž (log π‘ž(𝑧)) − Exπ‘ž (log 𝑝(𝑧|π‘₯)) + constant
∑︁
=
Ex𝑗 (log π‘ž(𝑧𝑗 )) − Exπ‘ž (log 𝑝(𝑧|π‘₯)) + constant
𝑗
Here, Ex𝑗 log π‘ž(𝑧𝑗 ) is the expectation of log π‘ž(𝑧𝑗 ) with respect to the distribution
π‘ž(𝑧𝑗 ).
Further, we will use the chain rule to simplify the distribution 𝑝(𝑧1:π‘š |π‘₯1:𝑛 ) as,
𝑝(𝑧1:π‘š |π‘₯1:𝑛 ) =
π‘š
∏︁
𝑝(𝑧𝑗 |𝑧1:(𝑗−1) , π‘₯1:𝑛 )
𝑗=1
Using this, we can further simplify our expression for 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) as,
𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) =
∑︁
Ex𝑗 (log π‘ž(𝑧𝑗 )) −
𝑗
∑︁
Exπ‘ž (log 𝑝(𝑧𝑗 |π‘₯, 𝑧1:(𝑗−1) ))
𝑗
We will use coordinate ascent to find optimal values of our variational factors,
57
optimizing for one factor keeping all other factors constant.
We can now split 𝐾𝐿(π‘ž(𝑧)||𝑝(𝑧|π‘₯)) as a sum of functions of π‘§π‘˜ . Each of these
functions can be expressed in the following form,
∫︁
β„’π‘˜ = Exπ‘˜ log π‘ž(π‘§π‘˜ ) −
π‘ž(π‘§π‘˜ )Ex−π‘˜ log 𝑝(π‘§π‘˜ |π‘₯, 𝑧1:(𝑗−1) )π‘‘π‘§π‘˜
Differentiating β„’π‘˜ with respect to π‘ž(π‘§π‘˜ ) and equating to 0 gives us,
π‘‘β„’π‘˜
= Ex−π‘˜ [log 𝑝(π‘§π‘˜ |𝑧−π‘˜ , π‘₯)] − log π‘ž(π‘§π‘˜ ) − 1 = 0
π‘‘π‘ž(π‘§π‘˜ )
⇒ π‘ž * (π‘§π‘˜ ) ∝ exp{Ex−π‘˜ [log 𝑝(π‘§π‘˜ |𝑍−π‘˜ , π‘₯)]}
4.3.2
Task setup
News articles talking about the same event, intuitively, should talk about the same
entities. We utilize this intuition to aid our clustering task – articles belonging to the
same cluster should produce entities that are the same.
Before describing the approach in greater detail, we describe the notation used.
Notation
We assume that we have 𝑛 messages to cluster. We define the set of messages as
π‘₯ = {π‘₯1 , π‘₯2 , . . . , π‘₯𝑛 }.
Each record is a collection of fields. We define the set of all records to be 𝑅 – π‘…π‘˜
refers to the π‘˜th record. Each record field is referred to by its label 𝑙 – more clearly,
we refer to the 𝑙th field of the π‘˜th record π‘…π‘˜π‘™ . We assume that |𝑅| = π‘š.
Each message π‘₯𝑖 maps to a single record with id 𝐴𝑖 . For all 𝑖 ∈ {1, 2, . . . , 𝑛},
𝐴𝑖 ∈ {1, 2, . . . , π‘š}. 𝐴 refers to set of all alignments.
In addition, we define 𝑦 as the set of all labeling sequences. 𝑦𝑖 is the label sequence
of the message π‘₯𝑖 .
58
Formulation
In this approach, we want to find 𝑅, 𝐴 and 𝑦 that maximize Pr(𝑅, 𝐴, 𝑦|π‘₯), which can
be thought of as the probability of records, alignments and labelings given a set of
messages.
We express the probability Pr(𝑅, 𝐴, 𝑦|π‘₯) by the following mathematical expression,
(οΈƒ
)οΈƒ
∏︁
Pr(𝑅, 𝐴, 𝑦|π‘₯) =
πœ‘π‘†πΈπ‘„ (π‘₯𝑖 , 𝑦𝑖 )
𝑖
)οΈƒ
(οΈƒ
∏︁
πœ‘π‘ˆ 𝑁 𝑄 (𝑅𝑙 )
𝑙
(οΈƒ
)οΈƒ
∏︁
𝑙
πœ‘π‘ƒ 𝑂𝑃 (π‘₯𝑖 , 𝑦𝑖 , 𝑅𝐴
)
𝑖
𝑖,𝑙
)οΈƒ
(οΈƒ
∏︁
πœ‘πΆπ‘‚π‘ (π‘₯𝑖 , 𝑦𝑖 , 𝑅𝐴𝑖 )
𝑖
We explain the meanings of the above potential functions below,
1. πœ‘π‘†πΈπ‘„ : The sequential potential attempts to estimate the likelihood of a label
sequence 𝑦𝑖 given a message π‘₯𝑖 . The potential over a message label sequence
decomposes over pairs of labels as follows,
𝑇
πœ‘π‘†πΈπ‘„ (π‘₯, 𝑦) = exp{πœƒπ‘†πΈπ‘„
𝑓𝑆𝐸𝑄 (π‘₯, 𝑦)}
{οΈƒ
}οΈƒ
∑︁
𝑇
= exp πœƒπ‘†πΈπ‘„
𝑓𝑆𝐸𝑄_𝑇 𝐸𝑅𝑀 (π‘₯, 𝑦 𝑗 , 𝑦 𝑗+1 )
𝑗
Note that this formulation is exactly the same as the one used to compute
optimal label sequences using CRFs.
Here, 𝑓𝑆𝐸𝑄_𝑇 𝐸𝑅𝑀 is the dot product of a weight vector πœ†π‘†πΈπ‘„ and a feature
vector (with features similar to the ones described in section 4.1). The weights
πœ†π‘†πΈπ‘„ are learnt using a standard log likelihood objective function, and represent
59
the only weights in this model that are learnt.
2. πœ‘π‘ˆ 𝑁 𝑄 : In our clustering task, we want to prevent similar records from clustering
into multiple clusters. We do this by trying to prevent the same field of different
records from overlapping textually. Mathematically, we can express this as,
πœ‘π‘ˆ 𝑁 𝑄 (𝑅𝑙 ) =
∏︁
πœ‘π‘ˆ 𝑁 𝑄_𝑇 𝐸𝑅𝑀 (π‘…π‘˜π‘™ , π‘…π‘˜π‘™ ′ )
π‘˜ΜΈ=π‘˜′
where π‘…π‘˜π‘™ and π‘…π‘˜π‘™ ′ are the values of the field 𝑙 for records π‘…π‘˜ and π‘…π‘˜′ . We define
a πœ‘π‘ˆ 𝑁 𝑄_𝑇 𝐸𝑅𝑀 that discourages textual similarity between different field values
in a record,
𝑇
πœ‘π‘ˆ 𝑁 𝑄_𝑇 𝐸𝑅𝑀 (π‘…π‘˜π‘™ , π‘…π‘˜π‘™ ′ ) = exp{−πœƒπ‘†πΌπ‘€
𝑓𝑆𝐼𝑀 (π‘…π‘˜π‘™ , π‘…π‘˜π‘™ ′ )}
Here,
𝑓𝑆𝐼𝑀 (π‘…π‘˜π‘™ , π‘…π‘˜π‘™ ′ ) =
|π‘…π‘˜π‘™ ∩ π‘…π‘˜π‘™ ′ |
max(|π‘…π‘˜π‘™ |, |π‘…π‘˜π‘™ ′ |)
πœƒπ‘†πΌπ‘€ is a weight that can have different values for different 𝑙; we will explain in
our results section how we set πœƒπ‘†πΌπ‘€ in our different tasks.
3. πœ‘π‘ƒ 𝑂𝑃 : This potential attempts to capture similarity between a message sequence
π‘₯, a label sequence 𝑦 and a record field 𝑣. Mathematically,
𝑙
πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝐴
= 𝑣) =
∑︁
max
π‘˜
𝑗
𝑙
πœ‘π‘ƒ 𝑂𝑃 _𝑇 𝐸𝑅𝑀 (π‘₯𝑗 , 𝑦 𝑗 , 𝑅𝐴
= π‘£π‘˜ )
|𝑣|
𝑙
πœ‘π‘ƒ 𝑂𝑃 _𝑇 𝐸𝑅𝑀 (π‘₯, 𝑦, 𝑅𝐴
= 𝑣) can be expressed as the sum of three terms,
βˆ™ 𝐼𝐷𝐹 (π‘₯𝑗 ) · ℐ[π‘₯𝑗 = 𝑣 π‘˜ ]
βˆ™ 𝛼𝐼𝐷𝐹 (π‘₯𝑗 ) · (ℐ[π‘₯𝑗−1 = 𝑣 π‘˜−1 ] + ℐ[π‘₯𝑗+1 = 𝑣 π‘˜+1 ])
βˆ™ ℐ[π‘₯𝑗 = 𝑣 π‘˜ and π‘₯ contains 𝑣]/|𝑣|
60
The πœ‘π‘ƒ 𝑂𝑃 score will be high if record field candidate 𝑣 of type 𝑙 overlaps textually with words in the sentence π‘₯ that are labeled with tag 𝑙. Note that
this function tolerates minor differences between a candidate record field value
and text within a message, since it looks at similarity at a token level. This is
superior to straight string matching which is not tolerant towards such minor
differences.
4. πœ‘πΆπ‘‚π‘ :
πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝐴 ) =
⎧
βŽͺ
⎨1 πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝑙 ) > 0 for all 𝑙
𝐴
βŽͺ
⎩0 otherwise
The πœ‘πΆπ‘‚π‘ score will be high if all fields in the record 𝑅𝐴 overlap textually
with words in the sentence π‘₯ with label sequence 𝑦. The πœ‘πΆπ‘‚π‘ score is a good
indicator of whether a particular record is a good candidate for a given sentence
π‘₯. Just as the πœ‘π‘ƒ 𝑂𝑃 score tries to soft-match a record value with a sentence,
the πœ‘πΆπ‘‚π‘ score tries to soft-match an entire record with a sentence.
We now proceed to approximate the posterior distribution Pr(𝑅, 𝐴, 𝑦|π‘₯) with the
distribution 𝑄(𝑅, 𝐴, 𝑦) which can be expressed as,
Pr(𝑅, 𝐴, 𝑦|π‘₯) ≈ 𝑄(𝑅, 𝐴, 𝑦)
(οΈƒ 𝐾
)οΈƒ (οΈƒ 𝑛
)οΈƒ
∏︁ ∏︁
∏︁
=
π‘ž(π‘…π‘˜π‘™ )
π‘ž(𝐴𝑖 )π‘ž(𝑦𝑖 )
π‘˜=1
𝑖=1
𝑙
Note that we make a number of independence assumptions for distribution 𝑄
to make computations easier – note that the variables 𝑅, 𝐴 and 𝑦 are in fact not
independent in the posterior distribution Pr(𝑅, 𝐴, 𝑦|π‘₯).
Optimal variational factors can be found by performing the following updates,
61
βˆ™ Labeling update:
]οΈ€}οΈ€
{οΈ€
[οΈ€
𝑙
)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝐴 )
ln π‘ž(𝑦 𝑗 = 𝑙) ∝ Ex𝑄/π‘ž(𝑦𝑗 ) ln πœ‘π‘†πΈπ‘„ (π‘₯, 𝑦, 𝑗) + ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝐴
]οΈ€
[οΈ€
𝑙
= ln πœ‘π‘†πΈπ‘„ (π‘₯, 𝑦, 𝑗) + Ex𝑄/π‘ž(𝑦𝑗 ) ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝐴
)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝐴 )
∑︁
′
= ln πœ‘π‘†πΈπ‘„ (π‘₯, 𝑦, 𝑗) +
π‘ž(𝐴 = 𝑧)π‘ž(𝑦 𝑗 = 𝑙)π‘ž(𝑅𝑧𝑙 = 𝑣)
𝑗 ′ ΜΈ=𝑗,𝑙′ ,𝑣,𝑧
[οΈ€
]οΈ€
ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝑧𝑙 = 𝑣)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝑧𝑙 = 𝑣)
βˆ™ Alignment update:
{οΈ€
[οΈ€
]οΈ€}οΈ€
𝑙
ln π‘ž(𝐴 = 𝑧) ∝ Ex𝑄/π‘ž(𝐴) ln πœ‘π‘†πΈπ‘„ (π‘₯, 𝑦) + ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝐴
)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝐴 )
{οΈ€ [οΈ€
]οΈ€}οΈ€
𝑙
∝ Ex𝑄/π‘ž(𝐴) ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝐴
)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝐴 )
∑︁
[οΈ€
]οΈ€
𝑙
π‘ž(𝑅𝑧𝑙 = 𝑣)π‘ž(𝑦 𝑗 = 𝑙) ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝐴
)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝐴 )
=
𝑗,𝑣,𝑙
βˆ™ Record update:
ln π‘ž(π‘…π‘˜π‘™ = 𝑣) ∝ Ex𝑄/π‘ž(π‘…π‘˜π‘™ )
{οΈƒ
∑︁
}οΈƒ
ln πœ‘π‘ˆ 𝑁 𝑄 (π‘…π‘˜′ , 𝑣) +
π‘˜′
=
∑︁
π‘ž(π‘…π‘˜π‘™ ′
∑︁
ln [πœ‘π‘ƒ 𝑂𝑃 (π‘₯𝑖 , 𝑦𝑖 , 𝑣)πœ‘πΆπ‘‚π‘ (π‘₯𝑖 , 𝑦𝑖 , 𝑣)]
𝑖
′
′
= 𝑣 ) ln πœ‘π‘ˆ 𝑁 𝑄 (𝑣, 𝑣 )
π‘˜ΜΈ=π‘˜′ ,𝑣 ′
+
∑︁
𝑖
π‘ž(𝐴𝑖 = π‘˜)
∑︁
[οΈ€
]οΈ€
π‘ž(𝑦𝑖𝑗 = 𝑙) ln πœ‘π‘ƒ 𝑂𝑃 (π‘₯, 𝑦, 𝑅𝑧𝑙 = 𝑣, 𝑗)πœ‘πΆπ‘‚π‘ (π‘₯, 𝑦, 𝑅𝑧𝑙 = 𝑣, 𝑗)
𝑗
Note that this update is in fact expensive, since we need to iterate through all
possible messages to compute the new optimal π‘ž(π‘…π‘˜π‘™ = 𝑣) value.
We initialize π‘ž(𝑦) factors with marginals obtained from the CRF classifier; the
π‘ž(𝑅) and π‘ž(𝐴) factors are initialized to a uniform distribution with some noise.
We precompute the (π‘₯, 𝑙, 𝑣) tuples that produce non-zero πœ‘π‘†πΈπ‘„ and πœ‘πΆπ‘‚π‘ values.
It is worth pointing out that figuring out the list of possible record values for each
label 𝑙 is a non-trivial task, and deserves better explanation. One heuristic method
to compute record values for a label 𝑙 is to filter all words that have a probability
62
of being labeled by 𝑙 of 0.1 or greater, and then concatenating all such consecutive
words together to obtain a full record.
4.3.3
Datasets
We use a couple of different datasets to evaluate the performance of our system.
Besides using the adulteration-related article dataset, we also use a Twitter event
dataset.
βˆ™ Twitter event data: We use our system to extract entities and cluster tweets
that talk about music concerts that took place in New York. Our goal here is
to have all tweets that talk about the same concert map to the same cluster.
We try to extract the name of the artist and the venue of the concert from the
tweet.
βˆ™ Adulteration article data: We also evaluate our system on the original
dataset we intended our system for. Given a set of articles that talk about
food adulteration, we try to cluster news articles that talk about the same
adulteration incident to map to the same cluster.
4.3.4
Evaluation
Here, we want to measure the accuracy of two sepearate, yet dependent tasks – record
extraction and clustering. We present results obtained on both datasets – the Twitter
event corpus and the adulteration-related article corpus.
βˆ™ We first show the performance of record extraction on the Twitter event dataset.
We annotate the corpus with a list of gold records. To measure recall, we use
a stable marriage algorithm between the list of distinct records extracted and
the gold list. We run two types of experiments here – one, where we assume
we have access to the gold list of candidates for every type of field, and another
where we determine a list of candidates for every field type from unstructured
text using a heuristic method.
63
Note that most of the system is unsupervised – the only parameters learnt from
training data are the CRF parameters, which has 1900 tweets. Our testing
data consists of 240 tweets describing music events in New York. We tried
extracting 15 records.
With gold records
Without gold records
Precision
0.61
0.44
Recall
0.58
0.42
Table 4.6: Recall and precision of record extraction on Twitter corpus
For comparison, we present the results of running a simple entity recognizer on
the same dataset. Note that in this task, we are trying to compute the precision
and recall of correctly tagging tokens of type “ARTIST” and “VENUE”, rather
than trying to extract entire records, which can be thought of as (artist, venue)
pairs.
Entity Precision
ARTIST
0.57
VENUE
0.85
Total
0.75
Recall F-measure
0.25
0.35
0.88
0.87
0.54
0.63
Table 4.7: Precision, recall and F1 of NER on Twitter event data
We now present the performance of our system on the adulteration dataset. We
try extracting records with three fields – food, adulterant and location.
Adulteration (53 articles)
Adulteration (116 articles)
Precision
0.40
0.47
Recall
0.38
0.23
Table 4.8: Recall and precision of record extraction on adulteration datasets
βˆ™ We look at clustering performance on two different subsets of our adulteration
article dataset – a small one with few incidents (53 articles), and a larger one
with more incidents (116 articles). For each dataset, we try classifying articles
64
into 15 different clusters. To cluster adulteration- related articles, we use records
consisting of food, adulterant and location of adulteration.
Adulteration (subset)
Adulteration (complete)
Precision
0.51
0.17
Recall F-measure
0.49
0.50
0.39
0.24
Table 4.9: Recall, precision and F-measure of clustering using combined approach on
adulteration dataset
There are a couple of reasons for the low performance– The performance of our clustering approach is still dependent on the underlying Named Entity Recognizer. Recall from section 4.2 that predicting
“ADULTER” tags on unseen data is still fairly noisy – the F-measure of
just recognizing adulterants is low (0.313).
For example, in an article about the adulteration of scotch, the sentence
At one bar, a mixture that included rubbing alcohol and caramel coloring
was sold as scotch.
indicates that “scotch” was adulterated with “rubbing alcohol” and “caramel
coloring”. However, there exist instances where “scotch” exists in the training data as an adulterant, so πœ‘π‘†πΈπ‘„ (scotch, FOOD) is low which prevents
the record (scotch, rubbing alcohol) from being extracted from that particular record.
With a more robust recognizer (perhaps trained on cleaner and less noisy
data), we should be able to obtain superior clustering results.
– Using πœ‘π‘ˆ 𝑁 𝑄 to ensure uniqueness of clusters doesn’t work overly well in
practice. Finding a better way to disambiguate between similar records
and clusters remains an area of future work.
– We compute consistency potentials πœ‘πΆπ‘‚π‘ over 5-sentence stretches instead
of over the entire article in the interest of efficiency. This seems reasonable
65
in most cases – all fields of a record should occur in the article relatively
close to each other. Finding an optimal value for the “paragraph” length
may give us better results.
66
Chapter 5
Contributions and Future Work
5.1
Contributions
The main contributions of this thesis are as follows,
βˆ™ Developing an algorithm that can handle sparse data and an imbalance in number of class labels to produce reasonable performance on the (food, adulterant)
co-existence determination task, by casting the problem as a ranking problem
instead of as a standard classification problem.
βˆ™ Using existing query refinement techniques to retrieve articles related to adulteration. We presented two high-level approaches – one where we try to find
articles related to a specific incident, using our (food, adulterant) co-existence
detection algorithm, and another where we try to filter out adulteration-related
articles from a steady stream of news articles scraped from a generic news source
like Google News.
βˆ™ Implementing a system that combines record extraction and clustering to aid
in de-duplication. We tested our system on two datasets – a Twitter dataset
containing tweets describing music concerts in New York, and our standard
dataset of adulteration articles. We compared this approach to standard approaches used to solve each of these tasks individually.
67
5.2
Future Work
The following are major areas of future work,
βˆ™ Improving the quality of retrieval results, by making the retrieval methods described in Chapter 3 more robust and better tuned to the task of retrieving
articles related to adulteration.
βˆ™ Improving the performance of the combined extraction and clustering system
described in Chapter 4. Currently running our system with records containing
4 fields is prohibitively expensive, so scaling up performance of the system is
an area of future work as well, especially if we want to make the system online.
Clustering accuracy can be improved upon as well.
βˆ™ Building inference models that leverage the created dataset to better understand
the factors that go into an adulteration incident.
68
Bibliography
[1] Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large
plain-text collections. In ACM DL, 2000.
[2] Viterbi AJ. Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. In IEEE Transactions on Information Theory, 1967.
[3] Edward Benson, Aria Haghighi, and Regina Barzilay. Event discovery in social
media feeds. In ACL, 2011.
[4] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral
graph partitioning. In KDD, 2001.
[5] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating
non-local information into information extraction systems by gibbs sampling. In
ACL, 2005.
[6] Thorsten Joachims. Transductive learning via spectral graph partitioning. In
ICML, 2003.
[7] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. An introduction to information retrieval, 2009.
[8] Gander W., Golub G., and von Matt U. A constrained eigenvalue problem., 1989.
69