Building and processing a dataset containing articles related to food adulteration

Building and processing a dataset containing articles related to food adulteration by Deepak Narayanan Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering, Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 c Massachusetts Institute of Technology 2015. All rights reserved. ○ Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 22, 2015 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regina Barzilay Professor Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert Meyer Chairman, Masters of Engineering Thesis Committee 2 Building and processing a dataset containing articles related to food adulteration by Deepak Narayanan Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2015, in partial fulfillment of the requirements for the degree of Masters of Engineering, Computer Science and Engineering Abstract In this thesis, I explored the problem of building a dataset containing news articles related to adulteration, and curating this dataset in an automated fashion. In particular, we looked at food-adulterant co-existence detection, query reforumulation, and entity extraction and text deduplication. All proposed algorithms were implemented in Python, and performance was evaluated on multiple datasets. Methods described in this thesis can be generalized to other applications as well. Thesis Supervisor: Regina Barzilay Title: Professor 3 4 Acknowledgments First and foremost, I would like to thank my advisor, Prof. Regina Barzilay, for giving me my first real opportunity at academic research. Without her feedback, this thesis would not be possible. I would also like to thank Prof. Tommi Jaakola for his insightful comments – a lot of the initial work in this thesis was a bi-product of suggestions made by him. Rushi Ganmukhi has been a major contributor to the FDA project over the last year, and a lot of the work in this thesis was done in close collaboration with him. To the other members of the RBG group – thank you! Attending weekly group meetings has given me a glimpse into how academic research is done. In particular, Karthik Narasimhan and Tao Lei were always extremely willing to answer my questions, however dumb. Last, I would like to thank my mom for always pushing me to become better, both as a person and as a student. I would not be here without her support and dedication. 5 6 Contents 1 Introduction 13 2 Food and adulterant co-existence 19 2.1 SVM classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Collaborative filtering classifier . . . . . . . . . . . . . . . . . . . . . 20 2.3 Passive-aggressive ranker . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Collaborative filtering classifier . . . . . . . . . . . . . . . . . 26 2.4.2 Passive-aggressive ranker . . . . . . . . . . . . . . . . . . . . . 27 3 Query refinement and relevance feedback 3.1 31 Pseudo-relevance feedback with a transductive classifier . . . . . . . . 33 3.1.1 Transductive learning . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Relevance feedback . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Query template selection . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Filtering for adulteration-related articles using a Language Model . . 41 3.3.1 42 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Article clustering and record extraction 4.1 43 Named Entity Recognition using Conditional Random Fields . . . . . 43 4.1.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . . 44 4.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.3 Finding optimal label sequence . . . . . . . . . . . . . . . . . 46 7 4.2 4.3 4.1.4 Entity recognition . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Clustering using k-means . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 Clustering using HCA . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Combining entity recognition and clustering . . . . . . . . . . . . . . 55 4.3.1 Mathematical tools . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 Task setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Contributions and Future Work 67 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8 List of Figures 1-1 Different data sources that can be used to extract information about adulteration from . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3-1 Example of query refining . . . . . . . . . . . . . . . . . . . . . . . . 32 3-2 Example of patterns extracted from tweets describing music events in New York . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3-3 Design of snowball system . . . . . . . . . . . . . . . . . . . . . . . . 40 4-1 CRF model – here 𝑥 is the sequence of observed variables and 𝑠 is the sequence of hidden variables . . . . . . . . . . . . . . . . . . . . . . . 44 4-2 Example clustering of documents using HCA . . . . . . . . . . . . . . 51 4-3 Pseudocode of HCA algorithm, with termination inter-cluster similarity threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 53 10 List of Tables 2.1 F-measures obtained with collaborative filtering with different 𝑑 values 27 2.2 MAP scores on adulteration dataset . . . . . . . . . . . . . . . . . . . 28 4.1 Precision, recall and F1 of NER on training data . . . . . . . . . . . . 47 4.2 Precision, recall and F1 of NER on testing data . . . . . . . . . . . . 47 4.3 F-measure of clustering algorithms with and w/o keyword extraction on Google News articles . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 F-measure of clustering algorithms with and w/o keyword extraction on set of 53 adulteration-related articles . . . . . . . . . . . . . . . . 4.5 53 54 F-measure of clustering algorithms with and w/o keyword extraction on set of 116 adulteration-related articles . . . . . . . . . . . . . . . . 54 4.6 Recall and precision of record extraction on Twitter corpus . . . . . . 64 4.7 Precision, recall and F1 of NER on Twitter event data . . . . . . . . 64 4.8 Recall and precision of record extraction on adulteration datasets . . 64 4.9 Recall, precision and F-measure of clustering using combined approach on adulteration dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 11 65 12 Chapter 1 Introduction Over the last couple of years, instances of food adulteration have been on the rise. Here in the United States, the Food and Drug Administration (FDA) regulates and enforces laws over food safety. With food-related illnesses becoming increasingly common, the FDA has recently become more interested in identifying and understanding the risks associated with imported FDA-regulated products. The reasons for increased risk are often varying and hard to pinpoint. For instance, a couple of years ago, milk imported from China contained an increased amount of melamine (a chemical that can be extremely harmful to human beings but often added to foods to increase their apparent protein content) – this later was attributed to the increased grain prices in China. In order to make ends meet and cut costs, farmers added melamine to heavily watered-down milk to keep the protein content of the diluted milk high. In addition, since food adulteration is largely decentralized, one of the primary challenges with understanding the causes of such risks is being able to piece together a number of disparate data sources. With the unified data, it would be possible to run inference models to obtain timely prospective information about instances of increased risk. The main goals of this thesis are two-fold – one, building a system capable of retrieving documents (in particular, news articles) that talk about specific adulteration incidents, and two, developing algorithms that are capable of curating this dataset 13 and making it more usable in an automated way. Retrieval of relevant news articles Our approach to retrieving all news articles related to food adulteration largely relies on the ability to determine if a particular food product and adulterant can co-exist with each other. To this end, much of the intial work presented in this thesis is concentrated on building a classifier that is capable of determining if a food product and an adulterant can co- exist. We have information about some cases of adulteration, but have no information about most (food, adulterant) pairs. Given that similar food products are more likely to be adulterated by the same adulterants, and symmetrically, similar adulterants are more likely to adulterate the same foods, it seems possible to build a classifier that predicts such co-existences, using information about foods and adulterants that is easily available. One approach that we considered involved using a standard classification algorithm (such as SVM) to classify pairs of foods and adulterants – here each pair has some feature representation, and we try to classify each such vector as positive or negative. One inherent problem with this method is that it doesn’t really take into account interactions between different food and adulterant features (even though some of these relationships can potentially be exposed by using kernels). It seems much more natural to solve this problem using collaborative filtering – a technique used by many recommender systems. Collaborative filtering is very loosely defined as the process of filtering for information or patterns using techniques that combine multiple view points or data sources. Netflix extensively uses collaborative filtering for movie recommendations. Most people have not watched all movies, so a good way of predicting a particular person’s reaction to a movie is to first find people with similar movie tastes, and then use the reviews of these like-minded people to generate a predicted rating. The two approaches we’ve described so far face the same inherent problem – for our particular task there are no negative training examples (because we only have 14 information about a food product and adulterant co-existing, we don’t have any information about a food product and adulterant not co- existing) – a naive way of overcoming this problem is to cast all (or just a subset of) food - adulterant pairs not seen in our datasources as negative in the training step. However, this seems like a poor way to solve this problem since some of these (food, adulterant) pairs may in fact co-exist (and with this approach we would mark them as negative in training). Moreover, our main goal remains trying to make predictions about unseen food and adulterant pairs – labeling all of these pairs as negative in our training seems at odds with the goals of our task, and makes it less likely that we will in fact be able to discover new co-existing (food, adulterant) pairs. Later in this thesis, we present an approach that attempts to circumvent this problem in a more natural manner, by looking at the problem from the perspective of ranking. Once we have a sufficiently robust classifier which can predict co- relations with reasonable accuracy and precision, the next important step is to develop an automated process that not only generates queries given food products and adulterants that coexist with each other, but also retrieves documents relevant to the generated query. We assume here that we have access to a stanndard information retrieval system like Google that is capable of retrieving documents relevant to a particular query. Our query formulation and information retrieval tasks are in fact intertwined, since the only way we can really evaluate the performance of the queries generated by the above process is by evaluating the quality of the documents generated by running a retrieval task with the given query. We talk about some ways of generating suitable queries given metadata about the incident in Chapter 3 of this thesis. At the end of chapter 3, we briefly discuss another way of looking at the article retrieval problem. Curation of dataset Creating a dataset of incidents that is easily usable isn’t confined to merely retrieving documents that describe an adulteration incident. In order for these articles to be 15 “understandable” by a computer system, some additional steps are imperative – for example, we often need to extract recognized entities from an unstructured blob of text, because computer programs understand structured information a lot better than unstructured information. Furthermore, extracting well-defined fields from unstructured text allows us to access subsets of our dataset in much easier ways. For example, extracting the food and adulterant from every adulteration-related article allows us to look at adulteration incidents involving “milk” and “melamine” easily. We present results of running the current state-of-the-art entity recognition system, which makes use of a CRF, on our dataset of adulteration articles. In addition, since articles are often extracted from a number of diverse sources, articles describing the same incident may in fact be lexically very different. Figuring out that these articles are “duplicates” is not as trivial as comparing two strings. We model deduplication as a clustering task, where we’re trying to classify articles related to the same incident into the same cluster, and classify unrelated articles into different clusters. One can also think of making this task online – as soon as we retrieve an article, we can run our duplicate detection algorithm to determine if the new article is a duplicate of an existing article. This seems especially important when we’re using our algorithms to detect potential new adulteration incidents in real time. The majority of chapter 4 describes an approach using variational inference to combine the above two tasks – record extraction and clustering – to reduce the noise associated with running each of these tasks individually. Our hope is that articles talking about the same incident must contain the same entitites, and articles with the same entities should be classified into the same cluster. Use cases of this dataset Once we have a database of news articles related to cases of adulteration, we can move on to building models that attempt to understand the types of events that may 16 Figure 1-1: Different data sources that can be used to extract information about adulteration from lead to elevated risks. These indicators are most likely to be some complex pattern at the intersection of multiple data sources (for example, events now could lead to an increased number of adulteration incidents four months down the road), which is why its imperative to first build a reliable retrieval system that is capable of collecting data from these multiple data sources. Note that even though the techniques presented in this thesis were tuned with the specific application of retrieving adulteration-related articles, the overall techniques are easily generalizable and can be applied to other, more generic, information retrieval tasks. 17 18 Chapter 2 Food and adulterant co-existence Our approach to retrieving articles requires a relatively comprehensive list of coexisting foods and adulterants. Keeping this in mind, the first task we look at is the task of detecting whether a food and adulterant can co-exist with each other. With a classifier capable of predicting food-adulterant co-existence with reasonable accuracy, we can retrieve news articles that describe an incident involving the particular food and adulterant. 2.1 SVM classifier Our first approach, which is also the most naive approach, is to build a simple Support Vector Machine classifier that classifies (food, adulterant) pairs as positive or negative depending on whether the food and adulterant indeed co-exist or not. Given a feature vector representation 𝑥 for a food, and a feature vector representation 𝑧 for an adulterant, we construct a feature vector representation for a (food, adulterant) pair as vec(𝑥𝑧 𝑇 ), which allows food features and adulterant features to interact jointly (unlike concatenating 𝑥 and 𝑧 for example) – for example, if 𝑥 has a feature which takes value 1 if the food is high in protein content and 0 otherwise, and 𝑧 has a feature which takes value 1 if the adulterant is melamine and 0 otherwise, then the (food, adulterant) pair would have a feature which has value 1 if the food is high in protein content and the adulterant is melamine – such a representation gives 19 the classifier more information to work with. Given the feature vector representation of a (food, adulterant) pair, we try classifying different pairs as positive or negative by attempting to find a separator between the positive and negative pairs seen in the training set. Note that this separator need not be a straight line and can take different shapes depending on the type of kernel used. However, we observed that this approach has the following fundamental problems, ∙ Our training data fundamentally doesn’t have negative examples, because we can only observe positive examples. Not observing a (food, adulterant) pairs can mean one of two things – one, the food and adulterant indeed can’t co-exist, or two, the food and adulterant co-exist but we just don’t have evidence of them co-existing. In our linear classifier approach, we are forced to have some negative examples provided to the classifier as part of the training data; otherwise our classifier will learn to classify every example as positive! We can choose a subset of (food, adulterant) pairs that we haven’t seen to be “negative examples” – it is possible with this approach to label examples that are truly positive to be negative, which would lead the classifier to learn incorrect relationships. ∙ A standard linear classifier is also not able to take full advantage of the similarities between different foods and different adulterants. It intuitively makes sense that similar foods are adulterated by similar adulterants and similar adulterants adulterate similar foods, but a standard maximum-margin classifier is not able to leverage this information effectively. 2.2 Collaborative filtering classifier Collaborative filtering seems more compatible with the task at hand, since collaborative filtering inherently attempts to leverage similarity in different domains to make better classification decisions. 20 Our collaboratively-filtered classifier can be mathematically formulated as finding a linear classifier of the form 𝑥𝑇 𝑈 𝑉 𝑇 𝑧, where 𝑥 is the feature vector representation of a food product and 𝑧 is the feature vector representation of an adulterant, and 𝑈 and 𝑉 are weight parameter matrices. Here, we assume that the food vector 𝑥 has dimension 𝑑1 ×1, the adulterant vector 𝑧 has dimension 𝑑2 × 1, and the paramters 𝑈 and 𝑉 have dimensions 𝑑1 × 𝑑 and 𝑑2 × 𝑑 respectively (𝑑 is an adjustable parameter). Training this classifier involves finding the optimal parameters 𝑈 and 𝑉 that minimize the loss function. In this case, we will use the quadratic loss function, but other loss functions can be used as well. Mathematically, our task is to find parameters 𝑈 and 𝑉 that minimize total loss given as, 𝐿(𝑈, 𝑉 ) = ∑︁ (𝑦𝑖𝑗 − 𝑥𝑇𝑖 𝑈 𝑉 𝑇 𝑧𝑗 )2 (𝑖,𝑗)∈𝑇 Here, 𝑇 is the set of food-adulterant pairs (𝑖, 𝑗) in our training set, and 𝑦𝑖𝑗 is the gold label of the pair (𝑖, 𝑗) – as with the linear classifier approach described above, we need to choose a subset of (food, adulterant) pairs to be “negative”. As is evident, this formulation is trying to move predictions as close to true labels as possible. We can think of this entire setup as a low rank decomposition of a weight matrix 𝑊. To prevent overfitting, we introduce a regularization term (the sum of the Frobenius norms of 𝑈 and 𝑉 ), to obtain the following expression, 𝐿(𝑈, 𝑉 ) = ∑︁ 𝐶 · (𝑦𝑖𝑗 − 𝑥𝑇𝑖 𝑈 𝑉 𝑇 𝑧𝑗 )2 + |𝑈 |2𝐹 + 𝑉 |2𝐹 (𝑖,𝑗)∈𝑇 Here, 𝐶 is a constant used to trade off between the loss function term and the regularization term – the higher the value of 𝐶, the more the classifier listens to training examples. We want to find values of 𝑈 and 𝑉 that minimize 𝐿(𝑈, 𝑉 ). To simplify notation, let us represent 𝑉 𝑇 𝑧𝑗 as 𝑣𝑗 . 𝐿(𝑈, 𝑉 ) is now given by the 21 expression, ⎞ ⎛ 𝐿(𝑈, 𝑉 ) = ⎝ ∑︁ 𝐶 · (𝑦𝑖𝑗 − vec(𝑈 )𝑇 vec(𝑥𝑖 𝑣𝑗𝑇 ))2 ⎠ + |𝑈 |2𝐹 + 𝑉 |2𝐹 (𝑖,𝑗)∈𝑇 Consider the partial derivative of 𝐿(𝑈, 𝑉 ) with respect to 𝑈 , 𝜕𝐿 =0 𝜕𝑈 ⎛ ⇒ 2𝐶 ⎝ ⎞ ∑︁ (𝑦𝑖𝑗 − vec(𝑈 )𝑇 vec(𝑥𝑖 𝑣𝑗𝑇 )) · −vec(𝑥𝑖 𝑣𝑗𝑇 )⎠ + 2 · vec(𝑈 ) = 0 (𝑖,𝑗)∈𝑇 ⎛ ⇒ vec(𝑈 ) + 𝐶 ⎝ ⎞ ∑︁ vec(𝑥𝑖 𝑣𝑗𝑇 )vec(𝑥𝑖 𝑣𝑗𝑇 )𝑇 ⎠ · vec(𝑈 ) = 𝐶( (𝑖,𝑗)∈𝑇 ∑︁ 𝑦𝑖𝑗 vec(𝑥𝑖 𝑣𝑗𝑇 )) (𝑖,𝑗)∈𝑇 Now, let’s define 𝐴 as ⎛ 𝐴=𝐼 +𝐶⎝ ⎞ ∑︁ vec(𝑥𝑖 𝑣𝑗𝑇 )vec(𝑥𝑖 𝑣𝑗𝑇 )𝑇 ⎠ (𝑖,𝑗)∈𝑇 and 𝑏 as ⎛ 𝑏=𝐶⎝ ⎞ ∑︁ 𝑦𝑖𝑗 vec(𝑥𝑖 𝑣𝑗𝑇 )⎠ (𝑖,𝑗)∈𝑇 Note that 𝐴 is a matrix of dimension 𝑑1 𝑑 × 𝑑1 𝑑 and 𝑏 is a vector of dimension 𝑑1 𝑑 × 1. With these definitions, we can see that 𝐴 · vec(𝑈 ) = 𝑏 ⇒ vec(𝑈 ) = 𝐴−1 𝑏 We can compute the updated value of parameter 𝑉 similarly, with 𝐴′ defined as ⎛ 𝐴′ = 𝐼 + 𝐶 ⎝ ⎞ ∑︁ vec(𝑧𝑗 𝑤𝑖𝑇 )vec(𝑧𝑗 𝑤𝑖𝑇 )𝑇 ⎠ (𝑖,𝑗)∈𝑇 22 and 𝑏′ defined as ⎛ 𝑏′ = 𝐶 ⎝ ⎞ ∑︁ 𝑦𝑖𝑗 vec(𝑧𝑗 𝑤𝑖𝑇 )⎠ (𝑖,𝑗)∈𝑇 (here, again for simplicity of notation, we define 𝑤𝑖 as 𝑈 𝑇 𝑥𝑖 ). 𝐴′ is of dimension 𝑑2 𝑑 × 𝑑2 𝑑 and 𝑏′ is of dimension 𝑑2 𝑑 × 1. With these definitions, we see that 𝐴′ · vec(𝑉 ) = 𝑏′ ⇒ vec(𝑉 ) = 𝐴′−1 𝑏′ As with any collaboratively trained model, 𝑈 and 𝑉 are initialized as random matrices of the right dimension. Then, to obtain optimal values of 𝑈 and 𝑉 , we first fix 𝑈 and find optimal 𝑉 , and then fix 𝑉 and find optimal 𝑈 . We repeat the above two steps until convergence. Since we can have vastly different number of positive and negative examples, we set different values of 𝐶 for examples with positive and negative labels. We set 𝐶+ to be equal to 1/𝑛− and 𝐶− to be equal to 1/𝑛+ . This ensures that if we have less positive examples, mislabeling a positive example is penalized more heavily. This is a fairly standard technique used in other learning tasks as well to account for a disparity in number of different labels available in training. 2.3 Passive-aggressive ranker Both approaches we’ve described so far fail to deal with the lack of negative examples effectively. We now describe an approach that doesn’t make assumptions about pairs that we haven’t seen so far. Instead of fundamentally treating this task as a classification problem where we need to predict a label given some properties about a food and adulterant, we can think of this task as a ranking problem where we try to rank pairs that we’ve seen before higher than pairs we haven’t seen so far. 23 Let 𝐴 = {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the set of all adulterants we have ever seen and 𝐹 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } be the set of all foods we have ever seen. Then, since we have seen an incomplete set of (food, adulterant) pairs in our training data, we can represent our training data as, 𝑥1 {𝑦𝑖11 , 𝑦𝑖12 , . . . , 𝑦𝑖1𝑠1 } 𝑥2 .. . {𝑦𝑖21 , 𝑦𝑖22 , . . . , 𝑦𝑖2𝑠2 } .. . 𝑥𝑛 {𝑦𝑖𝑛1 , 𝑦𝑖𝑛2 , . . . , 𝑦𝑖𝑛𝑠𝑛 } where {𝑦𝑖11 , 𝑦𝑖12 , . . . , 𝑦𝑖1𝑠1 } is the set of adulterants that we know co-exist with the food 𝑥1 , {𝑦𝑖21 , 𝑦𝑖22 , . . . , 𝑦𝑖2𝑠2 } is the set of adulterants that we know co-exist with the food 𝑥2 , etc. For simplicity of notation, let us denote the set of adulterants that co-exist with food 𝑥𝑗 as the set 𝑆𝑗 . We now want to calculate a weight vector 𝑊 such that for all 𝑘 ∈ 𝑆𝑗 , 𝑥𝑇𝑗 𝑊 𝑦𝑘 ≥ max 𝑥𝑇𝑗 𝑊 𝑦𝑙 + 1 𝑙̸∈𝑆𝑗 The main intuition here is that all adulterants that are known to co-exist with a particular food should rank (or score) higher than all other adulterants that we have no information about in the context of its co-existence with the particular food. Note that we have no common scale across foods – we only make comparisons between adulterants for a particular food. Also note that this formulation does not need to make any assumptions about negative examples in the dataset – this ensures that the model doesn’t see incorrect relationships in the training dataset. There are two ways of going about solving this system of inequations. The first way involves minimizing the norm of the weight matrix, subject to the specified constraints – similar to the way a maximum margin classifier is formulated. This 24 however, can blow up computationally. We can also think of the above system of constraint equations in terms of loss functions. Instead of satisfying the following constraint equation, 𝑥𝑇𝑗 𝑊 𝑦𝑘 ≥ max 𝑥𝑇𝑗 𝑊 𝑦𝑙 + 1 𝑙̸∈𝑆𝑗 we can think about minimizing the loss, Loss(𝑥𝑇𝑗 𝑊 𝑦𝑘 − max 𝑥𝑇𝑗 𝑊 𝑦𝑙 ) 𝑙̸∈𝑆𝑗 where Loss(𝑧) = max{1 − 𝑧, 0} (the hinge loss function). We can now use a passive-aggressive like algorithm to update the weight parameters with every encountered training example. If our initial weight vector is 𝑊 , then for every (food, adulterant) pair (𝑥𝑗 , 𝑦𝑘 ) a new weight vector 𝑊 ′ can be obtained by minimizing, Loss(𝑥𝑇𝑗 𝑊 𝑦𝑘 − 𝑥𝑇𝑗 𝑊 𝑦𝑚 ) + 𝜆 · |𝑊 ′ − 𝑊 |2 2 where 𝜆 is a parameter that controls how close 𝑊 ′ is to 𝑊 , and 𝑚 = argmax𝑙̸∈𝑆𝑗 𝑥𝑇𝑗 𝑊 𝑦𝑙 . (We can think of the second term as a regularization term that attempts to prevent overfitting) This results in the following update, 𝑇 𝑊 ′ = 𝑊 + 𝜂(vec{𝑥𝑗 𝑦𝑘𝑇 } − vec{𝑥𝑗 𝑦𝑚 }) Here, 𝜂 is the learrning rate – the smaller the value of 𝜂, the closer 𝑊 ′ remains to 𝑊. We perform such updates to 𝑊 for every (food, adulterant) pair that exists in our training data, and repeat until convergence. 25 2.4 Evaluation 2.4.1 Collaborative filtering classifier We measure the performance of our collaborative filtering classifier using a metric called F-measure which is commonly used in information retrieval. Before defining what we mean by F-measure, we define two other metrics, called precision and recall. ∙ Precision: In information retrieval, precision is defined as the fraction of retrieved documents that are relevant. In our task, we represent a (food, adulterant) pair as a document. Mathematically, we can express this as, Precision = Number of relevant and retrieved documents Number of retrieved documents ∙ Recall: Recall is defined as the fraction of relevant documents that are retrieved. Mathematically, this can be expressed as, Recall = Number of relevant and retrieved documents Number of relevant documents ∙ F-measure: F-measure is the harmonic mean of precision and recall. That is, F-measure = 2 · Recall · Precision Recall + Precision We present the results of our collaborative filtering model below. We ran our experiment on a European database of food adulteration incidents. We extracted all foods and adulterants involved in adulteration incidents from the database, and then split all such extracted foods 80 − 20 into training and testing sets. Training and testing foods were then combined with all known adulterants to get training (food, adulterant) pairs and testing (food, adulterant) pairs. Our training set contained 280,000 pairs and our testing set contained 68,000 pairs. 26 We use a food feature vector representation that consists of a binary expansion of the food’s food category along with a meta feature vector that is a binary expansion of the origin country and country in which the incident was reported. The adulterant feature vector consists of a binary expansion of the adulterant’s adulterant ID. Average F-measure over 5 runs 0.29 0.36 0.38 Collaborative filtering with 𝑑 = 1 Collaborative filtering with 𝑑 = 3 Collaborative filtering with 𝑑 = 5 Table 2.1: F-measures obtained with collaborative filtering with different 𝑑 values We do not present results obtained with our SVM classifier because the results obtained are extremely low, due to the reasons discussed earlier in this chapter. The results of our collaboratively filtering approach, even though somewhat promising, can still be improved. We hypothesize that not being able to deal with negative examples in a clean way is the main reason why. 2.4.2 Passive-aggressive ranker We measure the performance of our passive-aggressive ranker using a different metric called Mean Average Precision (MAP). As with F-measure, the MAP score ranges between 0.0 and 1.0, with a higher score indicating better ranking performance. Before defining what we mean by Mean Average Precision, we explain Average Precision. Precision @ 𝐾 is defined as the fraction of relevant documents in the top 𝐾 documents retrieved by a query. Mathematically, ∑︀𝑅 Average Precision = 𝑖=1 Precision @ 𝐾𝑖 𝑅 where 𝑅 is the total number of relevant documents, and 𝐾𝑖 is the rank of the 𝑖th relevant, retrieved document. 27 Mean average precision is the average precision averaged across multiple queries / rankings. Mean Average Precision is a metric for ranking performance that takes into account percentage of relevant documents retrieved at various cutoff points – it is ordersensitive, in that an ordering with five relevant documents followed by five irrelevant documents would have a much higher score than an ordering with five irrelevant documents followed by five relevant documents. We present the performance of our passive-aggressive ranker below. We use the same dataset as above, (all foods are split 80 − 20 into training and testing sets). The number of foods in the training set is 1273, and the number of foods in the testing set is 337. The number of adulterants is 124. We train our ranker on the training set, and then rank all adulterants for foods in the test set, given the gold list of adulterants co-existing with the particular food. As with our collaborative filtering classifier, our food vector representation consists of a binary representation of the food’s category along with metadata information about the incident. Our adulterant vector representation, as before, consists of a binary representation of the adulterant’s adulterant ID. To evaluate performance, we rank all adulterants for every food in our test set, and compute the MAP score given the gold list of adulterants co-existing with a food in the testing set, extracted from the European dataset. Average MAP Passive aggressive ranker on training data 0.84 Passive aggressive ranker on testing data 0.59 Passive aggressive ranker on both training and testing data 0.75 Table 2.2: MAP scores on adulteration dataset MAP has a relatively simple interpretation – the reciprocal of the MAP score gives the rank of the returned, relevant document on average. For example, a MAP score of 0.5 means that on average, returned relevant documents had a rank of 2; a MAP score of 0.25 means that on average, returned relevant documents had a rank of 4, etc. 28 Given this interpretation, our MAP score of around 0.59 indicates relatively good performance since it means on average, co-existing adulterants ranked in the top 2 among all adulterants for a previously unseen food product. 29 30 Chapter 3 Query refinement and relevance feedback Query refinement is a problem that is of much interest in the field of Information Retrieval. Given that people often find it hard to accurately express their search needs in the form of a short and concise query, popular search engines such as Google and Bing use query refinement techniques to improve the quality of search results returned to the user. Spelling correction and query expansion (Google’s "did you mean" feature) are popular applications of query refinement. Very simplistically, query refinement can be thought of as a ranking procedure – given a query, potential query replacements are generated, which can then be sorted using a scoring function to generate the “best” refined queries. The scoring function used is heavily dependent on the task at hand – for example, for spelling correction, a good scoring function is the edit distance between the candidate query and the original query, since common misspellings can often be attributed to typos. Better scoring functions also take into account context to improve the quality of results – for example, even though “there” and “their” are both valid words with similar spellings, they obviously cannot be used interchangeably. The major problem with query refinement is that for it to be effective, large amounts of data are needed. In practice, both spelling correction and query expansion depend on query logs (extensive records of how users have searched in the 31 Figure 3-1: Example of query refining past) for both a good list of possible candidates, as well as accurate enough data to meaningfully rank the generated list of candidates. We don’t really have query logs at our disposal in this problem, which means it is unclear how to use standard query refinement techniques to solve our problem – in fact, the problem we are trying to solve involves generating queries that would retrieve better data, so using a technique that requires a lot of data creates a classic chicken-and-egg problem. Relevance feedback is a technique frequently used by information retrieval systems to improve the quality of search results returned to a user, and does not require a large amount of data. Relevance feedback uses user input to get a better idea of the type of results the user is looking. Relevance feedback algorithms usually work well when all relevant documents can be clustered together, and when it’s relatively straight forward to distinguish between relevant and irrelevant documents for a particular query. Since it’s impractical to expect someone to manually annotate all retrieved documents as good or bad in an offline system that does not accept user feedback in an easy way, it is necessary to devise a method to use relevance feedback in a more automated way. Given a classifier capable of determining if a particular document is a news article reporting an incident of adulteration, we can use relevance feedback on our problem – start with some initial version of a query, rate each document in the set of documents retrieved by that query as appropriate or not using the classifier and then reiterate on the query until a reasonable number of relevant documents are retrieved. We call 32 this modified relevance feedback approach pseudo-relevance feedback. Standard classification algorithms, however, often require a fair amount of training data to generate reasonably correct results, which increases the amount of annotation required. In this chapter, we describe a transductive classification approach that attempts to produce a generalizable classifier with minimal training data. We begin by describing the general transductive learning setup with pseudorelevance feedback, and then discuss alternate ways of retrieving relevant documents. 3.1 Pseudo-relevance feedback with a transductive classifier We first discuss our transductive classification scheme, and then discuss a potential pseudo-relevance feedback approach. 3.1.1 Transductive learning In transductive learning, we try to predict the labels of test examples directly by observing training examples. This is in contrast to inductive learning where we attempt to infer general rules from training examples, and then apply these general rules to test examples. Since transductive learning is trying to solve an easier task (by not trying to learn a generalizable mapping), transductive tasks most often require less training data (and hence less annotation) than inductive tasks. − − →, with In a transductive inference setup [6], we are given 𝑛 points → 𝑥1 , → 𝑥2 , . . . , − 𝑥 𝑛 true labels 𝑦1 , 𝑦2 , . . . , 𝑦𝑛 . At training time, we are given the true labels of a subset of points 𝑌 ⊂ {𝑦1 , 𝑦2 , . . . , 𝑦𝑛 } (known as the training points), and we want to predict the labels 𝑦 ∈ {−1, 1} of all other points as accurately as possible. The main difference between this setup and that of a standard inductive learning task is that we have access to the testing examples during training. 33 We want to train a transductive learner that is self-consistent – that is, our classifier should not be overly affected by any one training example. If a classifier is overly influenced by a couple of training examples, then it is less generalizable and more likely to make mistakes on examples it hasn’t seen before. A metric for self-consistency is leave-one-out error. Consider a 𝑘-nearest neighbor classifier that determines the label of an unseen point 𝑥 based on the labels of the majority of 𝑥’s 𝑘 nearest neighbors. Define 𝛿𝑖 , ∑︀ → 𝑗∈𝐾𝑁 𝑁 (− 𝑥𝑖 ) 𝛿𝑖 = 𝑦 𝑖 𝑦𝑗 𝑘 If 𝛿𝑖 < 0, then clearly we will make a leave-one-out error on 𝑥𝑖 , since 𝛿𝑖 being less than 0 implies that a majority of 𝑥𝑖 ’s 𝑘 nearest neighbors have a label opposite to 𝑥𝑖 ’s true label. We can modify the standard 𝑘-nearest neighbor classifier presented above to include weights, creating a weighted 𝑘-nearest neighbor classifier. If 𝑤𝑖𝑗 is the similarity between data points 𝑖 and 𝑗, then if ∑︀ → 𝑗∈𝐾𝑁 𝑁 (− 𝑥𝑖 ) 𝛿𝑖 = 𝑦𝑖 ∑︀ 𝑦𝑗 𝑤𝑖𝑗 → 𝑚∈𝐾𝑁 𝑁 (− 𝑥𝑖 ) 𝑤𝑖𝑚 we will have a leave-one-out error on 𝑥𝑖 if 𝛿𝑖 < 0. Given that we want to keep leave-one-out error as small as possible, our goal should be to make 𝛿𝑖 as large as possible (larger 𝛿𝑖 implies lower error). We can formulate this as the following optimization problem. max − → 𝑦 𝑛 ∑︁ ∑︁ → 𝑖=1 𝑗∈𝑘𝑁 𝑁 (− 𝑥𝑖 ) 𝑦𝑖 𝑦𝑗 ∑︀ 𝑤𝑖𝑗 → 𝑚∈𝐾𝑁 𝑁 (− 𝑥𝑖 ) 𝑤𝑖𝑚 s.t. 𝑦𝑖 = 1, if 𝑖 ∈ 𝑌 and positive s.t. 𝑦𝑖 = −1, if 𝑖 ∈ 𝑌 and negative ∀𝑗, 𝑦𝑗 ∈ {−1, 1} 34 Let us define 𝐴𝑖𝑗 as 𝑤𝑖𝑗 ∑︀ → 𝑚∈𝐾𝑁 𝑁 (− 𝑥𝑖 ) 𝑤𝑖𝑚 Then we see that for our objective to be maximized, we need to maximize 𝑛 ∑︁ ∑︁ 𝑦𝑖 𝑦𝑗 𝐴𝑖𝑗 𝑖=1 𝑗∈𝑘𝑁 𝑁 (𝑥𝑖 ) for all (𝑖, 𝑗) for which 𝑦𝑖 𝑦𝑗 = −1. We can now think of the 𝑥𝑖 ’s as vertices in a graph, and 𝐴𝑖𝑗 as the weight of the edge from 𝑖 to 𝑗. Then if 𝐺+ represents the sub-graph with all positive vertices (all vertices with label of +1), and 𝐺− represents the sub-graph with all negative vertices (all vertices with label −1), we need to maximize −cut(𝐺+ , 𝐺− ) (the sum of edge weights between vertices in 𝐺+ and 𝐺− ), or minimize cut(𝐺+ , 𝐺− ). We can think of this classifier now as a modified clustering task, where we’re trying to classify unlabeled examples close to positive training examples as positive and classify unlabeled examples close to negative training examples as negative. Note that the problem as currently formulated, closely resembles the classic mincut problem. However, simply finding a solution to this problem may lead to potentially highly undesirable solutions – for example, a viable partition could involve one sub-graph having just one vertex, and the other sub-graph having all other vertices. This is a degenerate solution. Part of the reason for this problem is the fact that our above formulation doesn’t take into account the number of edges crossing the cut – the degenerate solution which includes one vertex on one side, and all other vertices on the other side has a very small number of edges passing through the cut, which automatically prefers cuts of smaller weight. This problem can be overcome by instead trying to minimize cut(𝐺+ , 𝐺− ) (the |𝐺+ ||𝐺− | average cut weight) subject to the same constraints as before. Obtaining an exact solution to this problem is extremely hard. However, there exist efficient spectral methods that approximate the solution to this problem. We 35 describe one such method below. [4] Before proceeding, let us define the Laplacian 𝐿 of a graph 𝐺 as the 𝑛 × 𝑛 symmetric matrix whose (𝑖, 𝑗)th entry is given by, ⎧ ∑︀ ⎪ ⎪ ⎪ if 𝑖 = 𝑗 ⎪ 𝑘 𝐸𝑖𝑘 , ⎪ ⎨ 𝐿𝑖𝑗 = −𝐸𝑖𝑗 , 𝑖 ̸= 𝑗 and edge (𝑖, 𝑗) exists ⎪ ⎪ ⎪ ⎪ ⎪ ⎩0 otherwise We claim that for any arbitrary 𝑛 × 1 vector 𝑥, 𝑥𝑇 𝐿𝑥 = ∑︀ (𝑖,𝑗)∈𝐸 (𝑥𝑖 − 𝑥𝑗 )2 𝐸𝑖𝑗 . This is easy to observe from the fact that the Laplacian matrix 𝐿 can be expressed as 𝐼𝐺 𝐼𝐺𝑇 , where 𝐼𝐺 is the incidence matrix of the graph 𝐺 and has dimensions of 𝑛×𝑚. The column in matrix 𝐼𝐺 representing the edge between vertices 𝑖 and 𝑗 has all √︀ √︀ zeros except at the 𝑖th and 𝑗th rows, which has values 𝐸𝑖𝑗 and − 𝐸𝑖𝑗 respectively. 𝑥𝑇 𝐿𝑥 can now be expressed as (𝐼𝐺𝑇 𝑥)𝑇 (𝐼𝐺𝑇 𝑥). √︀ 𝐸𝑖𝑗 (𝑥𝑖 −𝑥𝑗 ) where 𝑒 is an ∑︀ edge between vertices 𝑖 and 𝑗. This implies that (𝐼𝐺𝑇 𝑥)𝑇 (𝐼𝐺𝑇 𝑥) = (𝑖,𝑗)∈𝐸 (𝑥𝑖 − 𝑥𝑗 )2 𝐸𝑖𝑗 . Observe that 𝐼𝐺𝑇 𝑥 is a column-vector with 𝑒th row value If we constrain 𝑧𝑖 to only take values {𝛾+ , 𝛾− }, where 𝛾+ = 𝛾− = |{𝑖 : 𝑧𝑖 < 0}| and |{𝑖 : 𝑧𝑖 > 0}| |{𝑖 : 𝑧𝑖 > 0}| , we see that 𝑧 𝑇 𝐿𝑧 is equal to |{𝑖 : 𝑧𝑖 < 0}| 𝑛2 cut(𝐺+ , 𝐺− ) |{𝑖 : 𝑧𝑖 < 0}||{𝑖 : 𝑧𝑖 > 0}| provided 𝑧𝑖 = 𝛾+ for all positive points and 𝑧𝑖 = 𝛾− for all negative points. 𝑛 is a constant (total number of examples), which implies that minimizing the average cut subject to constraints is equivalent to minimizing 𝑧 𝑇 𝐿𝑧 subject to the constraints. In addition, observe that from the way 𝑧 was constructed above, 𝑧 𝑇 𝑧 = 𝑛 and 𝑧𝑇 1 = 0 Using all this information, we can formulate our objective as the following optimization problem (note that we’re still ignoring our original constraint equations), 36 min 𝑧 𝑇 𝐿𝑧 𝑧 s.t. 𝑧 𝑇 1 = 0 and 𝑧 𝑇 𝑧 = 𝑛 − − We try to ensure that the predictions → 𝑧 stay close to the expected predictions → 𝛾 (that is, try to ensure that the constraints are satisfied) by adding in a loss term to the formulation above, min 𝑧 𝑇 𝐿𝑧 + 𝑐(𝑧 − 𝛾)𝑇 𝐶(𝑧 − 𝛾) 𝑧 s.t. 𝑧 𝑇 1 = 0 and 𝑧 𝑇 𝑧 = 𝑛 Here, C is a matrix that allows us to weight the importance of different labels − differently, and → 𝛾 is a vector with the 𝑖th element equal to 𝛾+ if the 𝑖th element has a positive label, 𝛾− if the 𝑖th element has a negative label, and 0 otherwise. We can find an approximate solution to the above system by performing an eigendecomposition on 𝐿, which is possible since 𝐿 is a symmetric matrix. Let’s express 𝐿 as 𝑈 Σ𝑈 𝑇 , where Σ is a diagonal matrix containing the eigen values of 𝐿, and 𝑈 is the eigenvalue matrix and is unitary. Also from the way 𝐿 is formulated, we see that → − 1 is an eigenvector of 𝐿, with a corresponding eigenvalue of 0. → − Let 𝑉 be the eigenvector matrix with 1 removed, and 𝐷 be the eigenvalue matrix with 0 removed. Now 𝑈 Σ𝑈 𝑇 ≈ 𝑉 𝐷𝑉 𝑇 . Substituting 𝑧 = 𝑉 𝑤 produces, 37 min 𝑤𝑇 𝐿𝑤 + 𝑐(𝑉 𝑤 − 𝛾)𝑇 𝐶(𝑉 𝑤 − 𝛾) 𝑧 s.t. 𝑤𝑇 𝑤 = 𝑛 [8] specifies how to obtain the optimal solution to this problem. Given an adjacency matrix 𝐴 of a graph, the following steps can be performed to obtain label assignments that minimize the total leave-one-out error, ∙ Compute the Laplacian 𝐿 of the matrix 𝐴. ∙ Compute the smallest 2 to 𝑑 + 1 eigenvalues and eigenvectors of 𝐿 – these form 𝐷 and 𝑉 respectively. − ∙ Compute 𝐺 = (𝐷 + 𝑐𝑉 𝑇 𝐶𝑉 ) and 𝑏 = 𝑐𝑉 𝑇 𝐶 → 𝛾 and find smallest eigen vector 𝜆* of the matrix, ⎡ ⎤ 𝐺 −𝐼 ⎦ ⎣ → − → − − 𝑛1 𝑏 𝑏 𝑇 𝐺 → − ∙ Compute predictions as 𝑧 * = 𝑉 (𝐺 − 𝜆* 𝐼)−1 𝑏 . These 𝑧’s can be used for ranking. ∙ An appropriate threshold can be selected to get discretized class labels of +1 and −1. 3.1.2 Relevance feedback The technique described above allows us to easily determine which articles are “positive” and which articles are “negative” given minimum annotation (one or two examples of both classes). A pseudo-relevance feedback algorithm can then make use of these predictions to improve the quality of search results produced. However, if our classifier isn’t accurate enough, it is possible that we suffer from a problem known as query drift, where the queries obtained from moving queries based 38 <ARTIST> performing Live at <VENUE> So excited for <ARTIST> today at <VENUE> <VENUE> is going down tonight with <ARTIST> performing! Figure 3-2: Example of patterns extracted from tweets describing music events in New York on classifier output are in fact worse than the initial queries we started off with, because of incorrect predictions made by the classifier. In practice, we have observed that there is a fair amount of noise in our classifier output, leading to sub-optimal results. Making the classifier more robust remains an area of future work. 3.2 Query template selection Instead of relying on a classifier’s performance on unseen data, we can try to improve the retrieval process by exclusively making conclusions from training data. We use techniques borrowed from a system called Snowball [1] to detect query templates from text. These extracted query templates can be subsequently used to create queries that can be fed through a retrieval system to retrieve articles. In this task, we want to extract generic phrases that are representative of the contexts records are usually found in unstructured text. For example, let’s say we were trying to extract contexts in which (artist, venue) tuples are found in a corpus of tweets referring to music events in New York. Some of the templates that we would like to extract from these tweets are shown in figure 3-2. We start off with a list of “seed tuples”. We go through the corpus of documents, and extract all phrases that contain occurrences of all fields in the record or tuple. These phrases can be converted to generic patterns – every specific occurrence of a record field is replaced by its field type. For example, “John Lennon performed in Madison Square Garden today" can be converted to “<ARTIST> performed in <VENUE> today". Note that this extracted pattern does not allow any arbitrary word to precede 39 “performed”, and any arbitrary word to precede “today” – the words in the sentence must be tagged with the tag ARTIST and VENUE respectively. We use a NamedEntity Recognizer (discussed more in the next chapter) to extract tags for every word in the corpus. We can now use the patterns just extracted to obtain more tuples. These new tuples can subsequently be used to generate more patterns – the process continues until we extract enough records / patterns. The overall system can be understood better from Figure 3-2. Figure 3-3: Design of snowball system We use a vector space model to ensure that patterns similar to each other, but not completely identical, are not considered to be the same “pattern” by the system. Once we have a list of possible query templates that capture a number of the contexts in which the records are found in unstructured text, we can combine our templates with metadata about the incidents (in this case, foods and adulterants involved in the adulteration incident) to generate queries that can subsequently be 40 fed into a vanilla retrieval system to retrieve new articles that can be added to the dataset. 3.3 Filtering for adulteration-related articles using a Language Model Another approach to retrieving new adulteration-related articles is by filtering through a stream of Google News articles for adulteration-related articles. This approach is fundamentally different from the one we’ve discussed so far, where we’re looking for articles describing specific incidents. There are a couple of different ways we can try filtering for adulteration-related articles. One way is to build a standard classifier that attempts to predict if the topic of the article is “food adulteration”. Another way is to build a language model [7] that attempts to differentiate between the language used in news articles that talk about adulteration and the language used in news articles that don’t talk about adulteration. Our hypothesis here is that articles that talk about adulteration use a lot more technical and domain-specific language than articles related to other topics like sports or business. We use a standard bigram language model with Kneisser-Ney smoothing. One can easily extend these ideas to higher 𝑛-grams. Smoothing is a technique used in Language Models to deal with data sparsity. On many occasions, we don’t see a 𝑛-gram in its entirety in our training set. To overcome this problem, with smoothing, we backoff to a simpler model. For the sake of simplicity, we discuss Kneisser-Ney smoothing in the context of bigrams. In a standard backoff model, if we didn’t see a bigram in our training set, we backoff to using a standard unigram model. This works well sometimes; however, it often doesn’t work well since a unigram model completely neglects the context inherent in a sentence. 41 Instead, Kneisser-Ney attempts to capture how likely a word appears in an unseen word context, by counting the number of unique words that precede the given word. Mathematically, we can write this as, Pr𝐶𝑂𝑁 𝑇 (𝑤𝑖 ) ∝ |{𝑤𝑖−1 : 𝑐(𝑤𝑖−1 , 𝑤𝑖 ) > 0}| After normalizing the above expression appropriately so that Pr𝐶𝑂𝑁 𝑇 sums to 1 for all 𝑤𝑖 , we obtain the following expression for Pr𝐾𝑁 , Pr𝐾𝑁 (𝑤𝑖 |𝑤𝑖−1 ) = max(𝑐(𝑤𝑖 , 𝑤𝑖−1 ) − 𝛿, 0) |{𝑤𝑖−1 : 𝑐(𝑤𝑖−1 , 𝑤𝑖 ) > 0}| +𝜆 𝑐(𝑤𝑖−1 ) |{𝑤𝑗−1 : 𝑐(𝑤𝑗−1 , 𝑤𝑗 ) > 0}| where 𝜆 is a normalizing constant. 3.3.1 Evaluation We train our Language model on a subset of our corpus of adulteration-related articles (200 articles), and then evaluate performance on the remaining subset of adulteration-related articles along with non-adulteration related articles scraped from Google News. We decide an average negative log likelihood to decide whether a news articles talks about adulteration or not. Using this approach, we obtain an F-measure of 0.61. We hypothesize that this approach will do a lot better with a more extensive training dataset. 42 Chapter 4 Article clustering and record extraction For the purposes of our dataset creation task, the mere retrieval of relevant news articles is not sufficient, simply because raw articles are unstructured and hence difficult to understand for a computer program. An important goal of this project is to extract fields from the articles we manage to retrieve. In addition, it is useful to de-duplicate articles on the basis of the incident they are describing. Later in this section, we will describe an approach that attempts to accomplish both these tasks in a single, unified manner. Before describing our combined approach, we describe some well-known ways of going about solving the individual problems described above – one, the problem of extracting records from unstructured text, and two, clustering texts according to their content. 4.1 Named Entity Recognition using Conditional Random Fields Conditional Random Fields are a well-established way of extracting records from text. CRFs are a class of machine learning algorithms that are used for structured 43 Figure 4-1: CRF model – here 𝑥 is the sequence of observed variables and 𝑠 is the sequence of hidden variables prediction. CRF aims at utilizing information inherently prevalent in a sequence to make better classification decisions. It is possible to perform a task such as Part-of-Speech tagging by trying to predict the part-of-speech tag of each word in a sentence in isolation. But this approach won’t work well, because there exists a number of words that function as different parts-of-speech in different contexts – for example, the word “address” can function as both a noun and an adjective, so without the context in which the word is used, it would be impossible to always predict the right tag associated with the word. 4.1.1 Mathematical formulation Mathematically, we can think of this as a task where we try to predict the optimum label sequence 𝑙 given a sentence 𝑠. In other words, we want to find the label sequence 𝑙 that maximizes the conditional probability Pr(𝑙|𝑠). Given a sentence 𝑠, we can score a labeling 𝑙 of 𝑠 using a scoring function, score(𝑙|𝑠) = 𝑚 ∑︁ 𝑛 ∑︁ 𝜆𝑗 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = 𝑗=1 𝑖=1 𝑛 ∑︁ ¯ · 𝑓¯(𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) 𝜆 𝑖=1 Observe from this formulation that a label depends on its preceding label, the word it corresponds to and the entire sentence the word belongs to. Here, 𝜆𝑗 is used to weight the different 𝑓𝑗 differently. We can now express the probability of a label sequence given a sentence 𝑠 as, 44 ∑︀ ∑︀𝑛 exp[ 𝑚 exp[score(𝑙|𝑠)] 𝑗=1 𝑖=1 𝜆𝑗 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 )] ∑︀ ∑︀ ∑︀ Pr(𝑙|𝑠) = ∑︀ = 𝑚 𝑛 ′ ′ ′ 𝑙′ exp[score(𝑙 |𝑠)] 𝑙′ exp[ 𝑗=1 𝑖=1 𝜆𝑗 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 )] For a classic part-of-speech tagging problem, possible 𝑓𝑗 ’s that can be used are ∙ 𝑓1 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = 1 if 𝑙𝑖 = VERB; 0 otherwise. ∙ 𝑓2 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = −1 if 𝑙𝑖 = NOUN and 𝑙𝑖−1 = ADJECTIVE; 0 otherwise. ∙ 𝑓3 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = 1 if 𝑠[𝑖] = THE; 0 otherwise. ∙ 𝑓4 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) = tf(𝑠[𝑖]) · idf(𝑠[𝑖]), where tf(𝑤) is the number of times the word 𝑤 occurs in the sentence 𝑠, and idf(𝑤) is the inverse document frequency of the word 𝑤. 4.1.2 Training A problem that we skirted in the previous section was how to set the feature weights 𝜆𝑗 . It is clear that setting 𝜆𝑗 to arbitrary values will not work well, since we may be giving importance to features that are in reality not that important. It turns out that we can find optimal weight parameters using gradient ascent. Gradient ascent is an optimization algorithm that attempts to find the local maximum of a multivariable function 𝐹 (𝑥), making use of the fact that the function 𝐹 (𝑥) at some point 𝑡 increases fastest in the direction of the gradient of 𝐹 (𝑥) at the point 𝑡. Essentially, we start off with a randomly assigned initial weight vector, and then try to move towards more optimal weight vectors by moving in the direction of the 𝜕 log Pr(𝑙|𝑠) gradient . 𝜕𝜆 For every feature 𝑗 we can compute the gradient of log Pr(𝑙|𝑠) with respect to the corresponding weight 𝜆𝑗 , 𝑚 𝑚 ∑︁ ∑︁ 𝜕 log Pr(𝑙|𝑠) ∑︁ ′ = 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) − Pr(𝑙′ |𝑠) 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖′ , 𝑙𝑖−1 ) 𝜕𝜆𝑗 ′ 𝑖=1 𝑖=1 𝑙 45 However, note that the number of possible 𝑙′ is exponential in the length of the sentence, so this update looks intractable. However, observe that 𝑚 𝑚 ∑︁ ∑︁ 𝜕 log Pr(𝑙|𝑠) ∑︁ ′ = 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) − Pr(𝑙′ |𝑠)𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖′ , 𝑙𝑖−1 ) 𝜕𝜆𝑗 ′ 𝑖=1 𝑖=1 𝑙 = 𝑚 ∑︁ 𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖 , 𝑙𝑖−1 ) − 𝑚 ∑︁ ∑︁ ′ ′ ) , 𝑙𝑖′ |𝑠)𝑓𝑗 (𝑠, 𝑖, 𝑙𝑖′ , 𝑙𝑖−1 Pr(𝑙𝑖−1 ′ 𝑖=1 𝑙𝑖−1 ,𝑙𝑖′ 𝑖=1 Here we marginalized Pr(𝑙′ |𝑠) over all labels in the label sequence 𝑙′ that are not ′ at indices 𝑖−1 and 𝑖. Computing the probability of all pairs of labels (𝑙𝑖−1 , 𝑙𝑖′ ) requires 𝑂(𝑘 2 ) time where 𝑘 is the number of possible labels – this is clearly tractable. We can now move 𝜆𝑗 in the direction of this gradient, to obtain a new weight value 𝜆′𝑗 , 𝜆′𝑗 = 𝜆𝑗 + 𝛼 𝜕 log Pr(𝑙|𝑠) 𝜕𝜆𝑗 Here, as stated before, we are moving the weight vector in the direction of maximum change of the log-probability of a label sequence given a sentence. 𝛼 is the learning rate. The larger the value of 𝛼, the faster 𝜆 moves in the direction of the gradient; if the value of 𝛼 is very high however, the updates will be inaccurate, and 𝜆 will move far away from its true optimal value. ¯ until convergence, or for a fixed number of iterations. We make updates to 𝜆 4.1.3 Finding optimal label sequence Given a trained CRF, we want to find the optimal label sequence of a given sentence 𝑠. One way of doing this is to compute Pr(𝑙′ |𝑠) for every possible label sequence 𝑙′ of a sentence 𝑠 to figure out which label sequence 𝑙′ is most likely given a sentence 𝑠. However, for a sentence of length 𝑛, if the total number of labels for each word is 𝑘, then the total number of labeling sequences for the sentence 𝑠 is 𝑘 𝑛 , which is exponential in the length of the sentence. However, since our formulation only depends on pairs of labels, we can use Viterbi’s 46 algorithm [2] to cut the complexity down to 𝑂(𝑛𝑘 2 ). 4.1.4 Entity recognition We can utilize the above framework to recognize entities within an unstructured blob of text. We can tag all entities that interest us with their respective tags, and all words encountered in our training corpus that aren’t specific entities can be labeled with a generic tag – let us call this generic tag a ‘O’. Now, when we run our trained model on our testing corpus, any contiguous sequence of words tagged with the same tag is treated as a recognized entity of that particular tag. 4.1.5 Evaluation We run Stanford’s NER code [5] on our dataset of adulteration-related news articles, trying to recognize foods and adulterants. We split our dataset of approximately 150 articles 70 − 30 into training and testing sets. We obtain the following precision, accuracy and recall for the different fields. Entity Precision ADULTER 0.846 FOOD 0.998 Total 0.923 Recall 0.653 0.973 0.796 F-measure 0.737 0.985 0.854 Table 4.1: Precision, recall and F1 of NER on training data Entity Precision ADULTER 0.522 FOOD 0.798 Total 0.667 Recall 0.224 0.326 0.279 F-measure 0.313 0.463 0.393 Table 4.2: Precision, recall and F1 of NER on testing data We see that on examples not yet seen, there is still a fair amount of noise in the accuracy of the fields extracted, especially for the adulterant field. As is expected, 47 performance on the training set is significantly better than performance on the testing set. Slightly more surprising is the difference in performance of foods versus adulterants. This can be probably attributed to the fact that the names of adulterants are often extremely technical, which means the number of unseen adulterant tokens in the test dataset is a lot higher than the number of unseen food tokens. 48 4.2 Clustering Our dataset consists of a number of articles that describe instances of adulteration. These articles are collected from a number of very different sources, and are often describing the same adulteration incident, even though the exact content of the articles may not be completely identical. Since we can have multiple duplicates for every incident, it makes sense to cast this as a clustering problem, where ideally, two articles are in the same cluster if they describe the same incident, and articles in different clusters describe different incidents. 4.2.1 Clustering using k-means K-means is one of the more popular clustering algorithms. Like most clustering algorithms, it is completely unsupervised. Mathematical formulation Given 𝑛 points (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) where each 𝑥𝑖 is a 𝑑-dimensional vector, we want to assign each 𝑥𝑖 to a set 𝑆𝑗 where 1 ≤ 𝑗 ≤ 𝑘, such that 𝑆1 ∪ 𝑆2 ∪ . . . 𝑆𝑘−1 ∪ 𝑆𝑘 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and each of the sets 𝑆𝑗 are disjoint (that is 𝑥𝑖 can belong to exactly one cluster). We can mathematically formulate this task as trying to find the set 𝑆 = {𝑆1 , 𝑆2 , . . . , 𝑆𝑘 } that minimizes, 𝑘 ∑︁ ∑︁ ||𝑥 − 𝜇𝑖 ||2 𝑖=1 𝑥∈𝑆𝑖 Here, we can think of 𝜇𝑖 as the representative element of the cluster 𝑖. Minimizing the above expression with respect to 𝜇𝑖 shows us that 𝜇𝑖 must be equal to the centroid of the data points assigned to the cluster 𝑖. 49 Description Every iteration of k-means clustering involves two important steps, 1. Assignment step: Each point 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 must be assigned to one of 𝑘 clusters. This is done by assigning each point to the cluster with nearest representative point. (𝑡) (𝑡) (𝑡) 𝑆𝑖 = {𝑥𝑝 : ||𝑥𝑝 − 𝜇𝑖 ||2 ≤ ||𝑥𝑝 − 𝜇𝑗 ||2 ∀𝑗, 1 ≤ 𝑗 ≤ 𝑘} 2. Update step: Once every point is assigned to a cluster, we need to update the centroid of each cluster. Each new centroid is given by the expression, 𝜇𝑡+1 = 𝑖 1 ∑︁ (𝑡) |𝑆𝑖 | 𝑥𝑗 (𝑡) 𝑥𝑗 ∈𝑆𝑖 The above two steps are repeated until convergence. Every point 𝑥𝑖 is initially assigned to a random cluster. 4.2.2 Clustering using HCA Hierarchical clustering is a method of clustering that tries to build a hierarchy of clusters. It circumvents the problem of initializing data points to clusters by starting every point as its own cluster of size 1, and then collapsing clusters that are similar to obtain larger clusters. Formulation As with k-means, the first step in this process involves vectorizing the news article. Once we have a vector representation of each news article, we can run HCA on the resulting vectors to obtain clusters, such that all vectors in a cluster are similar to each other. 50 Figure 4-2: Example clustering of documents using HCA 51 In Hierarchical Agglomerative clustering, every article starts off as its own cluster – that is, if we start with 𝑛 articles, we initially start with 𝑛 different clusters. Every iteration of the algorithm, we attempt to merge the two closest clusters. Note that we can have different inter-cluster similarity metrics. We describe them in greater detail below, ∙ Single-link clustering: The distance between two clusters is the distance between their most similar members. This inter-cluster similarity metric’s main disadvantage is that it disregards the fact that the furthest points in clusters that are going to be merged may in fact be really far apart. 𝑑𝐶𝐿𝑈 𝑆 (𝐴, 𝐵) = min 𝑑(𝑥, 𝑦) 𝑥∈𝐴,𝑦∈𝐵 ∙ Complete-link clustering: The distance between two clusters is the distance between their most dissimilar members. 𝑑𝐶𝐿𝑈 𝑆 (𝐴, 𝐵) = max 𝑑(𝑥, 𝑦) 𝑥∈𝐴,𝑦∈𝐵 ∙ Average-link clustering: The distance between two clusters is the average of the distances between all pairs of points in the two clusters. Mathematically, 𝑑𝐶𝐿𝑈 𝑆 (𝐴, 𝐵) = ∑︁ ∑︁ 1 𝑑(𝑥, 𝑦) |𝐴| · |𝐵| 𝑥∈𝐴 𝑦∈𝐵 52 Pseudocode HAC(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , 𝑡) 1 Initialize 𝐶 2 for 𝑖 = 1 to 𝑛 3 𝐶[𝑖] = 𝑥𝑖 4 maxClusterSimilarity = ComputeMaxClusterSimilarity(𝐶) 5 while maxClusterSimilarity < 𝑡 6 𝑖, 𝑗 = FindMostSimilarClusters(𝐶) 7 MergeClusters(𝐶, 𝑖, 𝑗) 8 maxClusterSimilarity = ComputeMaxClusterSimilarity(𝐶) Figure 4-3: Pseudocode of HCA algorithm, with termination inter-cluster similarity threshold 4.2.3 Evaluation We run our two clustering algorithms on two datasets – a set of news articles scraped from Google news, and our dataset of adulteration-related articles. We first present results obtained on the Google news corpus. We use tf-idf vector representations of the text, with and without keyword extraction. We chose 20 distinct incidents scraped from the landing page of news.google.com, and then tried clustering the 76 obtained articles into 20 clusters. Clustering algorithm HCA (Complete-link) HCA (Single-link) HCA (Average-link) K-means F-measure with keywords 0.83 0.67 0.87 0.84 F-measure w/o keywords 0.69 0.25 0.63 0.77 Table 4.3: F-measure of clustering algorithms with and w/o keyword extraction on Google News articles We next present results obtained on adulteration-related articles. We run our experiments on two subsets of the adulteration article dataset at our disposal – one with 52 articles and another with 116 articles. Both times, we tried to cluster the dataset of articles into 15 clusters. 53 Clustering algorithm HCA (Complete-link) HCA (Single-link) HCA (Average-link) K-means F-measure with k/w extr. F-measure w/o k/w extr. 0.67 0.43 0.62 0.35 0.57 0.47 0.13 0.65 Table 4.4: F-measure of clustering algorithms with and w/o keyword extraction on set of 53 adulteration-related articles Clustering algorithm HCA (Complete-link) HCA (Single-link) HCA (Average-link) F-measure with k/w extr. F-measure w/o k/w extr. 0.35 0.19 0.13 0.10 0.27 0.11 Table 4.5: F-measure of clustering algorithms with and w/o keyword extraction on set of 116 adulteration-related articles Most of the clustering algorithms seem to perform well on the Google news corpus dataset – this may partially be due to the fact that a number of these articles come from very different domains (Sports, business, current affairs) so distinguishing between them using word frequencies actually works pretty well in practice. Adulteration articles on the other hand are probably pretty similar on a word token level, which means performance on the adulteration dataset is far worse. More clever feature representations probably exist, but we hypothesize that the best possible performance of these algorithms is not much better than that presented in Table 4.4 and Table 4.5. 54 4.3 Combining entity recognition and clustering Both approaches that we described above are not perfect and include a lot of noise in their results – sometimes, the NER algorithm extracts incorrect fields from the article and sometimes, the clustering algorithm inserts articles into the wrong cluster. With the knowledge that articles in the same cluster should have the same entities and articles with the same entities should belong to the same clusters, we claim that precision of both clustering and entity recognition can be improved [3]. Before, describing our approach in detail, we describe some tools needed to understand the approach better. 4.3.1 Mathematical tools Coordinate ascent Coordinate ascent is a technique used to optimize a multivariable function 𝐹 (𝑥). We can maximize a multivariable function 𝐹 (𝑥) by maximizing along one dimension at a time. Given a setting of variables 𝑥𝑘 at the end of some iteration 𝑘, we can find the optimal setting of variables 𝑥𝑘+1 at the end of iteration 𝑘 + 1 by finding 𝑥𝑘+1 for each 𝑖 dimension 𝑖, by computing, 𝑘+1 𝑘+1 𝑘 𝑘 𝑥𝑘+1 = arg max 𝑓 (𝑥𝑘+1 1 , 𝑥2 , . . . , 𝑥𝑖−1 , 𝑦, 𝑥𝑖+1 . . . , 𝑥𝑛 ) 𝑖 𝑦∈ℛ We then start off with some arbitrary initial vector 𝑥0 , and subsequently compute 𝑥1 , 𝑥2 , . . .. By doing line search, we get 𝐹 (𝑥0 ) ≤ 𝐹 (𝑥1 ) ≤ 𝐹 (𝑥2 ) ≤ . . .. Note that this is slightly different from gradient ascent where we attempt to modify all parameters in one go, instead of in a sequential manner. 55 Variational inference For many models, computing the posterior is computationally intractable. To overcome this problem, we can try to find a distribution over latent variables that approximates the true posterior of the distribution. We call this approximated distribution 𝑞, and we use it instead of the posterior to make predictions about the data. The Kullback-Leibler divergence is a measure for closeness of two distributions. Given two distributions 𝑝 and 𝑞, the KL divergence is defined as, KL(𝑞||𝑝) = E𝑞 [log 𝑞(𝑍) ] 𝑝(𝑍) In variational inference, we try to approximate a posterior distribution 𝑝(𝑍|𝑥) with a distribution 𝑞(𝑍) that is chosen out of a family of distributions. 𝑞 is picked in such a way so as to minimize the KL divergence between 𝑝 and 𝑞. Intuitively, this makes sense, because KL(𝑞||𝑝) satisfies the following nice properties, ∙ For all points where the value of the distribution 𝑞 is high and 𝑝 is high, contribution towards KL(𝑝||𝑞) is low. ∙ For all points where the value of the distribution 𝑞 is high and 𝑝 is low, contribution towards KL(𝑝||𝑞) is high. ∙ For all points where the value of the distribution 𝑞 is low, contribution towards 𝑞(𝑍) KL(𝑝||𝑞) is low, since we’re computing the expectation of log with re𝑝(𝑍|𝑥) spect to the distribution 𝑞. It is important to note that 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) is not symmetric, and that 𝐾𝐿(𝑝(𝑧|𝑥)||𝑞(𝑧)) satisfies a number of the nice properties that 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) does. However, 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) is a lot easier to compute since it involves computing the expectation over the distribution 𝑞, which is a lot easier than computing the expectation over the distribution 𝑝(𝑧|𝑥). 56 Note that 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥) can be written as, [︂ 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) = Ex𝑞 𝑞(𝑍) log 𝑝(𝑍|𝑥) ]︂ = Ex𝑞 [log 𝑞(𝑍)] − Ex𝑞 [log 𝑝(𝑍|𝑥)] = Ex𝑞 [log 𝑞(𝑍)] − Ex𝑞 [log 𝑝(𝑍, 𝑥)] + log 𝑝(𝑥) = −(Ex𝑞 [log 𝑝(𝑍, 𝑥)] − Ex𝑞 [log 𝑞(𝑍)]) + log 𝑝(𝑥) Since 𝑝(𝑥) does not depend on 𝑞 in any way, we can think of minimizing 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) as simply minimizing the expression Ex𝑞 [log 𝑞(𝑍)] − Ex𝑞 [log 𝑝(𝑍, 𝑥)]. In mean-field variational inference, we usually factorize the distribution 𝑞(𝑧1 , . . . , 𝑧𝑚 ) as the product of independent factors 𝑞(𝑧𝑗 ), that is, 𝑞(𝑧1 , . . . , 𝑧𝑚 ) = 𝑚 ∏︁ 𝑞(𝑧𝑗 ) 𝑗=1 With this factorization, we can express 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥) as, 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) = Ex𝑞 (log 𝑞(𝑧)) − Ex𝑞 (log 𝑝(𝑧|𝑥)) + constant ∑︁ = Ex𝑗 (log 𝑞(𝑧𝑗 )) − Ex𝑞 (log 𝑝(𝑧|𝑥)) + constant 𝑗 Here, Ex𝑗 log 𝑞(𝑧𝑗 ) is the expectation of log 𝑞(𝑧𝑗 ) with respect to the distribution 𝑞(𝑧𝑗 ). Further, we will use the chain rule to simplify the distribution 𝑝(𝑧1:𝑚 |𝑥1:𝑛 ) as, 𝑝(𝑧1:𝑚 |𝑥1:𝑛 ) = 𝑚 ∏︁ 𝑝(𝑧𝑗 |𝑧1:(𝑗−1) , 𝑥1:𝑛 ) 𝑗=1 Using this, we can further simplify our expression for 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) as, 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) = ∑︁ Ex𝑗 (log 𝑞(𝑧𝑗 )) − 𝑗 ∑︁ Ex𝑞 (log 𝑝(𝑧𝑗 |𝑥, 𝑧1:(𝑗−1) )) 𝑗 We will use coordinate ascent to find optimal values of our variational factors, 57 optimizing for one factor keeping all other factors constant. We can now split 𝐾𝐿(𝑞(𝑧)||𝑝(𝑧|𝑥)) as a sum of functions of 𝑧𝑘 . Each of these functions can be expressed in the following form, ∫︁ ℒ𝑘 = Ex𝑘 log 𝑞(𝑧𝑘 ) − 𝑞(𝑧𝑘 )Ex−𝑘 log 𝑝(𝑧𝑘 |𝑥, 𝑧1:(𝑗−1) )𝑑𝑧𝑘 Differentiating ℒ𝑘 with respect to 𝑞(𝑧𝑘 ) and equating to 0 gives us, 𝑑ℒ𝑘 = Ex−𝑘 [log 𝑝(𝑧𝑘 |𝑧−𝑘 , 𝑥)] − log 𝑞(𝑧𝑘 ) − 1 = 0 𝑑𝑞(𝑧𝑘 ) ⇒ 𝑞 * (𝑧𝑘 ) ∝ exp{Ex−𝑘 [log 𝑝(𝑧𝑘 |𝑍−𝑘 , 𝑥)]} 4.3.2 Task setup News articles talking about the same event, intuitively, should talk about the same entities. We utilize this intuition to aid our clustering task – articles belonging to the same cluster should produce entities that are the same. Before describing the approach in greater detail, we describe the notation used. Notation We assume that we have 𝑛 messages to cluster. We define the set of messages as 𝑥 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 }. Each record is a collection of fields. We define the set of all records to be 𝑅 – 𝑅𝑘 refers to the 𝑘th record. Each record field is referred to by its label 𝑙 – more clearly, we refer to the 𝑙th field of the 𝑘th record 𝑅𝑘𝑙 . We assume that |𝑅| = 𝑚. Each message 𝑥𝑖 maps to a single record with id 𝐴𝑖 . For all 𝑖 ∈ {1, 2, . . . , 𝑛}, 𝐴𝑖 ∈ {1, 2, . . . , 𝑚}. 𝐴 refers to set of all alignments. In addition, we define 𝑦 as the set of all labeling sequences. 𝑦𝑖 is the label sequence of the message 𝑥𝑖 . 58 Formulation In this approach, we want to find 𝑅, 𝐴 and 𝑦 that maximize Pr(𝑅, 𝐴, 𝑦|𝑥), which can be thought of as the probability of records, alignments and labelings given a set of messages. We express the probability Pr(𝑅, 𝐴, 𝑦|𝑥) by the following mathematical expression, (︃ )︃ ∏︁ Pr(𝑅, 𝐴, 𝑦|𝑥) = 𝜑𝑆𝐸𝑄 (𝑥𝑖 , 𝑦𝑖 ) 𝑖 )︃ (︃ ∏︁ 𝜑𝑈 𝑁 𝑄 (𝑅𝑙 ) 𝑙 (︃ )︃ ∏︁ 𝑙 𝜑𝑃 𝑂𝑃 (𝑥𝑖 , 𝑦𝑖 , 𝑅𝐴 ) 𝑖 𝑖,𝑙 )︃ (︃ ∏︁ 𝜑𝐶𝑂𝑁 (𝑥𝑖 , 𝑦𝑖 , 𝑅𝐴𝑖 ) 𝑖 We explain the meanings of the above potential functions below, 1. 𝜑𝑆𝐸𝑄 : The sequential potential attempts to estimate the likelihood of a label sequence 𝑦𝑖 given a message 𝑥𝑖 . The potential over a message label sequence decomposes over pairs of labels as follows, 𝑇 𝜑𝑆𝐸𝑄 (𝑥, 𝑦) = exp{𝜃𝑆𝐸𝑄 𝑓𝑆𝐸𝑄 (𝑥, 𝑦)} {︃ }︃ ∑︁ 𝑇 = exp 𝜃𝑆𝐸𝑄 𝑓𝑆𝐸𝑄_𝑇 𝐸𝑅𝑀 (𝑥, 𝑦 𝑗 , 𝑦 𝑗+1 ) 𝑗 Note that this formulation is exactly the same as the one used to compute optimal label sequences using CRFs. Here, 𝑓𝑆𝐸𝑄_𝑇 𝐸𝑅𝑀 is the dot product of a weight vector 𝜆𝑆𝐸𝑄 and a feature vector (with features similar to the ones described in section 4.1). The weights 𝜆𝑆𝐸𝑄 are learnt using a standard log likelihood objective function, and represent 59 the only weights in this model that are learnt. 2. 𝜑𝑈 𝑁 𝑄 : In our clustering task, we want to prevent similar records from clustering into multiple clusters. We do this by trying to prevent the same field of different records from overlapping textually. Mathematically, we can express this as, 𝜑𝑈 𝑁 𝑄 (𝑅𝑙 ) = ∏︁ 𝜑𝑈 𝑁 𝑄_𝑇 𝐸𝑅𝑀 (𝑅𝑘𝑙 , 𝑅𝑘𝑙 ′ ) 𝑘̸=𝑘′ where 𝑅𝑘𝑙 and 𝑅𝑘𝑙 ′ are the values of the field 𝑙 for records 𝑅𝑘 and 𝑅𝑘′ . We define a 𝜑𝑈 𝑁 𝑄_𝑇 𝐸𝑅𝑀 that discourages textual similarity between different field values in a record, 𝑇 𝜑𝑈 𝑁 𝑄_𝑇 𝐸𝑅𝑀 (𝑅𝑘𝑙 , 𝑅𝑘𝑙 ′ ) = exp{−𝜃𝑆𝐼𝑀 𝑓𝑆𝐼𝑀 (𝑅𝑘𝑙 , 𝑅𝑘𝑙 ′ )} Here, 𝑓𝑆𝐼𝑀 (𝑅𝑘𝑙 , 𝑅𝑘𝑙 ′ ) = |𝑅𝑘𝑙 ∩ 𝑅𝑘𝑙 ′ | max(|𝑅𝑘𝑙 |, |𝑅𝑘𝑙 ′ |) 𝜃𝑆𝐼𝑀 is a weight that can have different values for different 𝑙; we will explain in our results section how we set 𝜃𝑆𝐼𝑀 in our different tasks. 3. 𝜑𝑃 𝑂𝑃 : This potential attempts to capture similarity between a message sequence 𝑥, a label sequence 𝑦 and a record field 𝑣. Mathematically, 𝑙 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝐴 = 𝑣) = ∑︁ max 𝑘 𝑗 𝑙 𝜑𝑃 𝑂𝑃 _𝑇 𝐸𝑅𝑀 (𝑥𝑗 , 𝑦 𝑗 , 𝑅𝐴 = 𝑣𝑘 ) |𝑣| 𝑙 𝜑𝑃 𝑂𝑃 _𝑇 𝐸𝑅𝑀 (𝑥, 𝑦, 𝑅𝐴 = 𝑣) can be expressed as the sum of three terms, ∙ 𝐼𝐷𝐹 (𝑥𝑗 ) · ℐ[𝑥𝑗 = 𝑣 𝑘 ] ∙ 𝛼𝐼𝐷𝐹 (𝑥𝑗 ) · (ℐ[𝑥𝑗−1 = 𝑣 𝑘−1 ] + ℐ[𝑥𝑗+1 = 𝑣 𝑘+1 ]) ∙ ℐ[𝑥𝑗 = 𝑣 𝑘 and 𝑥 contains 𝑣]/|𝑣| 60 The 𝜑𝑃 𝑂𝑃 score will be high if record field candidate 𝑣 of type 𝑙 overlaps textually with words in the sentence 𝑥 that are labeled with tag 𝑙. Note that this function tolerates minor differences between a candidate record field value and text within a message, since it looks at similarity at a token level. This is superior to straight string matching which is not tolerant towards such minor differences. 4. 𝜑𝐶𝑂𝑁 : 𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝐴 ) = ⎧ ⎪ ⎨1 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝑙 ) > 0 for all 𝑙 𝐴 ⎪ ⎩0 otherwise The 𝜑𝐶𝑂𝑁 score will be high if all fields in the record 𝑅𝐴 overlap textually with words in the sentence 𝑥 with label sequence 𝑦. The 𝜑𝐶𝑂𝑁 score is a good indicator of whether a particular record is a good candidate for a given sentence 𝑥. Just as the 𝜑𝑃 𝑂𝑃 score tries to soft-match a record value with a sentence, the 𝜑𝐶𝑂𝑁 score tries to soft-match an entire record with a sentence. We now proceed to approximate the posterior distribution Pr(𝑅, 𝐴, 𝑦|𝑥) with the distribution 𝑄(𝑅, 𝐴, 𝑦) which can be expressed as, Pr(𝑅, 𝐴, 𝑦|𝑥) ≈ 𝑄(𝑅, 𝐴, 𝑦) (︃ 𝐾 )︃ (︃ 𝑛 )︃ ∏︁ ∏︁ ∏︁ = 𝑞(𝑅𝑘𝑙 ) 𝑞(𝐴𝑖 )𝑞(𝑦𝑖 ) 𝑘=1 𝑖=1 𝑙 Note that we make a number of independence assumptions for distribution 𝑄 to make computations easier – note that the variables 𝑅, 𝐴 and 𝑦 are in fact not independent in the posterior distribution Pr(𝑅, 𝐴, 𝑦|𝑥). Optimal variational factors can be found by performing the following updates, 61 ∙ Labeling update: ]︀}︀ {︀ [︀ 𝑙 )𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝐴 ) ln 𝑞(𝑦 𝑗 = 𝑙) ∝ Ex𝑄/𝑞(𝑦𝑗 ) ln 𝜑𝑆𝐸𝑄 (𝑥, 𝑦, 𝑗) + ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝐴 ]︀ [︀ 𝑙 = ln 𝜑𝑆𝐸𝑄 (𝑥, 𝑦, 𝑗) + Ex𝑄/𝑞(𝑦𝑗 ) ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝐴 )𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝐴 ) ∑︁ ′ = ln 𝜑𝑆𝐸𝑄 (𝑥, 𝑦, 𝑗) + 𝑞(𝐴 = 𝑧)𝑞(𝑦 𝑗 = 𝑙)𝑞(𝑅𝑧𝑙 = 𝑣) 𝑗 ′ ̸=𝑗,𝑙′ ,𝑣,𝑧 [︀ ]︀ ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝑧𝑙 = 𝑣)𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝑧𝑙 = 𝑣) ∙ Alignment update: {︀ [︀ ]︀}︀ 𝑙 ln 𝑞(𝐴 = 𝑧) ∝ Ex𝑄/𝑞(𝐴) ln 𝜑𝑆𝐸𝑄 (𝑥, 𝑦) + ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝐴 )𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝐴 ) {︀ [︀ ]︀}︀ 𝑙 ∝ Ex𝑄/𝑞(𝐴) ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝐴 )𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝐴 ) ∑︁ [︀ ]︀ 𝑙 𝑞(𝑅𝑧𝑙 = 𝑣)𝑞(𝑦 𝑗 = 𝑙) ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝐴 )𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝐴 ) = 𝑗,𝑣,𝑙 ∙ Record update: ln 𝑞(𝑅𝑘𝑙 = 𝑣) ∝ Ex𝑄/𝑞(𝑅𝑘𝑙 ) {︃ ∑︁ }︃ ln 𝜑𝑈 𝑁 𝑄 (𝑅𝑘′ , 𝑣) + 𝑘′ = ∑︁ 𝑞(𝑅𝑘𝑙 ′ ∑︁ ln [𝜑𝑃 𝑂𝑃 (𝑥𝑖 , 𝑦𝑖 , 𝑣)𝜑𝐶𝑂𝑁 (𝑥𝑖 , 𝑦𝑖 , 𝑣)] 𝑖 ′ ′ = 𝑣 ) ln 𝜑𝑈 𝑁 𝑄 (𝑣, 𝑣 ) 𝑘̸=𝑘′ ,𝑣 ′ + ∑︁ 𝑖 𝑞(𝐴𝑖 = 𝑘) ∑︁ [︀ ]︀ 𝑞(𝑦𝑖𝑗 = 𝑙) ln 𝜑𝑃 𝑂𝑃 (𝑥, 𝑦, 𝑅𝑧𝑙 = 𝑣, 𝑗)𝜑𝐶𝑂𝑁 (𝑥, 𝑦, 𝑅𝑧𝑙 = 𝑣, 𝑗) 𝑗 Note that this update is in fact expensive, since we need to iterate through all possible messages to compute the new optimal 𝑞(𝑅𝑘𝑙 = 𝑣) value. We initialize 𝑞(𝑦) factors with marginals obtained from the CRF classifier; the 𝑞(𝑅) and 𝑞(𝐴) factors are initialized to a uniform distribution with some noise. We precompute the (𝑥, 𝑙, 𝑣) tuples that produce non-zero 𝜑𝑆𝐸𝑄 and 𝜑𝐶𝑂𝑁 values. It is worth pointing out that figuring out the list of possible record values for each label 𝑙 is a non-trivial task, and deserves better explanation. One heuristic method to compute record values for a label 𝑙 is to filter all words that have a probability 62 of being labeled by 𝑙 of 0.1 or greater, and then concatenating all such consecutive words together to obtain a full record. 4.3.3 Datasets We use a couple of different datasets to evaluate the performance of our system. Besides using the adulteration-related article dataset, we also use a Twitter event dataset. ∙ Twitter event data: We use our system to extract entities and cluster tweets that talk about music concerts that took place in New York. Our goal here is to have all tweets that talk about the same concert map to the same cluster. We try to extract the name of the artist and the venue of the concert from the tweet. ∙ Adulteration article data: We also evaluate our system on the original dataset we intended our system for. Given a set of articles that talk about food adulteration, we try to cluster news articles that talk about the same adulteration incident to map to the same cluster. 4.3.4 Evaluation Here, we want to measure the accuracy of two sepearate, yet dependent tasks – record extraction and clustering. We present results obtained on both datasets – the Twitter event corpus and the adulteration-related article corpus. ∙ We first show the performance of record extraction on the Twitter event dataset. We annotate the corpus with a list of gold records. To measure recall, we use a stable marriage algorithm between the list of distinct records extracted and the gold list. We run two types of experiments here – one, where we assume we have access to the gold list of candidates for every type of field, and another where we determine a list of candidates for every field type from unstructured text using a heuristic method. 63 Note that most of the system is unsupervised – the only parameters learnt from training data are the CRF parameters, which has 1900 tweets. Our testing data consists of 240 tweets describing music events in New York. We tried extracting 15 records. With gold records Without gold records Precision 0.61 0.44 Recall 0.58 0.42 Table 4.6: Recall and precision of record extraction on Twitter corpus For comparison, we present the results of running a simple entity recognizer on the same dataset. Note that in this task, we are trying to compute the precision and recall of correctly tagging tokens of type “ARTIST” and “VENUE”, rather than trying to extract entire records, which can be thought of as (artist, venue) pairs. Entity Precision ARTIST 0.57 VENUE 0.85 Total 0.75 Recall F-measure 0.25 0.35 0.88 0.87 0.54 0.63 Table 4.7: Precision, recall and F1 of NER on Twitter event data We now present the performance of our system on the adulteration dataset. We try extracting records with three fields – food, adulterant and location. Adulteration (53 articles) Adulteration (116 articles) Precision 0.40 0.47 Recall 0.38 0.23 Table 4.8: Recall and precision of record extraction on adulteration datasets ∙ We look at clustering performance on two different subsets of our adulteration article dataset – a small one with few incidents (53 articles), and a larger one with more incidents (116 articles). For each dataset, we try classifying articles 64 into 15 different clusters. To cluster adulteration- related articles, we use records consisting of food, adulterant and location of adulteration. Adulteration (subset) Adulteration (complete) Precision 0.51 0.17 Recall F-measure 0.49 0.50 0.39 0.24 Table 4.9: Recall, precision and F-measure of clustering using combined approach on adulteration dataset There are a couple of reasons for the low performance– The performance of our clustering approach is still dependent on the underlying Named Entity Recognizer. Recall from section 4.2 that predicting “ADULTER” tags on unseen data is still fairly noisy – the F-measure of just recognizing adulterants is low (0.313). For example, in an article about the adulteration of scotch, the sentence At one bar, a mixture that included rubbing alcohol and caramel coloring was sold as scotch. indicates that “scotch” was adulterated with “rubbing alcohol” and “caramel coloring”. However, there exist instances where “scotch” exists in the training data as an adulterant, so 𝜑𝑆𝐸𝑄 (scotch, FOOD) is low which prevents the record (scotch, rubbing alcohol) from being extracted from that particular record. With a more robust recognizer (perhaps trained on cleaner and less noisy data), we should be able to obtain superior clustering results. – Using 𝜑𝑈 𝑁 𝑄 to ensure uniqueness of clusters doesn’t work overly well in practice. Finding a better way to disambiguate between similar records and clusters remains an area of future work. – We compute consistency potentials 𝜑𝐶𝑂𝑁 over 5-sentence stretches instead of over the entire article in the interest of efficiency. This seems reasonable 65 in most cases – all fields of a record should occur in the article relatively close to each other. Finding an optimal value for the “paragraph” length may give us better results. 66 Chapter 5 Contributions and Future Work 5.1 Contributions The main contributions of this thesis are as follows, ∙ Developing an algorithm that can handle sparse data and an imbalance in number of class labels to produce reasonable performance on the (food, adulterant) co-existence determination task, by casting the problem as a ranking problem instead of as a standard classification problem. ∙ Using existing query refinement techniques to retrieve articles related to adulteration. We presented two high-level approaches – one where we try to find articles related to a specific incident, using our (food, adulterant) co-existence detection algorithm, and another where we try to filter out adulteration-related articles from a steady stream of news articles scraped from a generic news source like Google News. ∙ Implementing a system that combines record extraction and clustering to aid in de-duplication. We tested our system on two datasets – a Twitter dataset containing tweets describing music concerts in New York, and our standard dataset of adulteration articles. We compared this approach to standard approaches used to solve each of these tasks individually. 67 5.2 Future Work The following are major areas of future work, ∙ Improving the quality of retrieval results, by making the retrieval methods described in Chapter 3 more robust and better tuned to the task of retrieving articles related to adulteration. ∙ Improving the performance of the combined extraction and clustering system described in Chapter 4. Currently running our system with records containing 4 fields is prohibitively expensive, so scaling up performance of the system is an area of future work as well, especially if we want to make the system online. Clustering accuracy can be improved upon as well. ∙ Building inference models that leverage the created dataset to better understand the factors that go into an adulteration incident. 68 Bibliography [1] Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In ACM DL, 2000. [2] Viterbi AJ. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. In IEEE Transactions on Information Theory, 1967. [3] Edward Benson, Aria Haghighi, and Regina Barzilay. Event discovery in social media feeds. In ACL, 2011. [4] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, 2001. [5] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005. [6] Thorsten Joachims. Transductive learning via spectral graph partitioning. In ICML, 2003. [7] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. An introduction to information retrieval, 2009. [8] Gander W., Golub G., and von Matt U. A constrained eigenvalue problem., 1989. 69

Building and processing a dataset containing articles related to food adulteration

Related documents

Products

Support

Building and processing a dataset containing articles related to food adulteration

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib