Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University KDD-MLG 2011 2011-08-21, San Diego, CA, USA Overview • Graph-based SSL • The Problem with Text Data • Implicit Manifolds – Two GSSL Methods – Three Similarity Functions • A Framework • Results • Related Work Graph-based SSL • Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph. • Naturally they are a good fit for network data Labeled class A How to label the rest of the nodes in the graph? Labeled Class B The Problem with Text Data • Documents are often represented as feature vectors of words: The importance of a Web page is an inherently subjective matter, which depends on the readers… In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use… You're not cool just because you have a lot of followers on twitter, get over yourself… cool web search make over you 0 4 8 2 5 3 0 8 7 4 3 2 1 0 0 0 1 2 The Problem with Text Data • Feature vectors are often sparse • But similarity matrix is not! Mostly non-zero - any two documents are likely to have a word in common 27 125 23 - 23 Mostly zeros - any document contains only a small fraction of the vocabulary - 125 27 cool web search make over you 0 4 8 2 5 3 0 8 7 4 3 2 1 0 0 0 1 2 The Problem with Text Data • A similarity matrix is the input to many GSSL methods • How can we apply GSSL methods efficiently? O(n2) time to construct 27 125 - 23 - 125 - 23 27 O(n2) space to store Too expensive! Does not scale up to big datasets! > O(n2) time to operate on The Problem with Text Data • Solutions: 1. Make the matrix sparse 2. Implicit Manifold A lot of cool work has gone into how to do this… But this is what we’ll talk about! (If you do SSL on large-scale nonnetwork data, there is a good chance you are using this already...) Implicit Manifolds What do you mean by…? • A pair-wise similarity matrix is a manifold under which the data points “reside”. • It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did. Sounds too good. What’s the catch? 8 Implicit Manifolds • Two requirements for using implicit manifolds on text (-like) data: 1. The GSSL method can be implemented with matrix-vector multiplications 2. We can decompose the dense similarity matrix into a series of sparse matrix-matrix multiplications As long as they are met, we can obtain the exact same results without ever constructing a similarity matrix! Let’s look at some specifics, starting with two GSSL methods 9 If you’ve never heard of them… Two GSSL Methods • Harmonic functions method (HF) • MultiRankWalk (MRW) You might have heard one of these: MRW’s cousins: HF’s cousins: Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001) Learning with local and global consistency (Zhou et al. 2004) Graph-based SSL as a generative model (He et al. 2007) forward Propagation Ghost edges for via classification (Gallagher al. 2008)(w/ restart) randometwalks and others … Gaussian fields and harmonic functions classifier (Zhu et al. 2003) Weighted-voted relational network classifier (Macskassy & Provost 2007) Weakly-supervised classification via random walks (Talukdar et al. 2008) Propagation via Adsorption (Baluja et al. 2008) backward Learning on diffusion maps (Lafon random walks & Lee 2006) and others … 10 Two GSSL Methods • Harmonic functions method (HF) Let’s look at their implicit manifold qualification… • MultiRankWalk (MRW) In both of these iterative implementations, the core computation are matrix-vector multiplications! 11 Three Similarity Functions • Inner product similarity • Cosine similarity • Bipartite graph walk Side note: perhaps all the cool work on matrix factorization could be useful here too… Simple, good for binary features Often used for document categorization Can be viewed as a bipartite graph walk, or as feature reweighting by relative frequency; related to TF-IDF term weighting… 12 How about a similarity function we actually use for text data? Putting Them Together • Example: HF + inner product similarity: Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=FFT1 S FF T Parentheses are important! • Iteration update becomes: v t 1 Construction: O(n) 1 D ( F ( F v )) T Storage: O(n) t Operation: O(n) 13 Putting Them Together • Example:O(n) HF + cosine similarity: Construction: Storage: O(n) S N c FF N c T • Iteration update: Operation: O(n) v t 1 Diagonal cosine normalizing matrix 1 Compact storage: we don’t need a cosinenormalized version of the feature vectors D (N c (F(F (N c v )))) T t Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=NcFFTNc1 14 A Framework • So what’s the point? 1. Towards a principled way for researchers to apply GSSL methods to text data, and the conditions under which they can do this efficiently 2. Towards a framework on which researchers can develop and discover new methods (and recognizing old ones) 3. Building a SSL tool set – pick one that works for you, according to all the great work that has been done 15 A Framework How about MultiRankWalk with a low restart probability? Hmm… I have a large dataset with very few training labels, what should I try? Choose your SSL Method… Harmonic Functions MultiRankWalk 16 A Framework Can’t go wrong with cosine similarity! But the documents in my dataset are kinda long… … and pick your similarity function Inner Product Cosine Similarity Bipartite Graph Walk 17 Results • On 2 commonly used document categorization datasets • On 2 less common NP classification datasets • Goal: to show they do work on large text datasets, and consistent with what we know about these SSL methods and similarity functions 18 Results • Document categorization 19 Results • Noun phrase classification dataset: 20 Questions? Additional Information 22 MRW: RWR for Classification We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks MRW: Seed Preference • Obtaining labels for data points is expensive • We want to minimize cost for obtaining labels • Observations: – Some labels inherently more useful than others – Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others? Seed Preference • Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label • The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data • We consider 3 preferences: – Random – Link Count – PageRank Nodes with highest counts make the list Nodes with highest scores make the list MRW: The Question • What really makes MRW and wvRN different? • Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods – note that they are call by many names: MRW Random walk with restart wvRN Reverse random walk Great…but we still don’t Regularized random walk Random walk with sink nodes know why the differences in Hitting time their behavior on these Local & global consistency Harmonic functions on graphs network datasets! Personalized PageRank Iterative averaging of neighbors MRW: The Question • It’s difficult to answer exactly why MRW does better with a smaller number of seeds. • But we can gather probable factors from their propagation models: MRW 1 Centrality-sensitive wvRN Centrality-insensitive Exponential drop-off 2 No drop-off / damping / damping factor Propagation of Propagation of different 3 different classes done classes interact independently 1. Centrality-sensitive: seeds have different scores and not necessarily the highest MRW: The Question Seed labels underlined • An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 0.020 0.019 0.017 0.017 0.013 0.011 0.010 0.010 0.008 0.007 0.007 0.007 0.006 0.006 0.005 0.005 firstdownpolitics.com 1.000 neoconservatives.blogspot.com neoconservatives.blogspot.com strangedoctrines.typepad.com 2. Exponential drop- 1.000 jmbzine.com off: much less sure 1.000 jmbzine.com strangedoctrines.typepad.com presidentboxer.blogspot.com about nodes further 0.593 millers_time.typepad.com 0.585 rooksrant.com away from seeds decision08.blogspot.com 0.568 purplestates.blogspot.com gopandcollege.blogspot.com 0.553 ikilledcheguevara.blogspot.com charlineandjamie.com 0.540 restoreamerica.blogspot.com We still don’t really marksteyn.com 0.539 billrice.org understand0.529 it yet.kalblog.com blackmanforbush.blogspot.com 3. Classes propagate reggiescorner.blogspot.com 0.517 right-thinking.com independently: fearfulsymmetry.blogspot.com 0.517 charlineandjamie.com tom-hanna.org is both quibbles-n-bits.com 0.514 very crankylittleblog.blogspot.com likely a conservative and a undercaffeinated.com 0.510 hasidicgentile.org liberal blog (good or bad?) samizdata.net 0.509 stealthebandwagon.blogspot.com pennywit.com 0.509 carpetblogger.com RWR ranking as features to SVM Similar MRW: Relatedformulation, Work different view • MRW is very much related to Random walk without restart, heuristic stopping – “Local and global consistency” (Zhou et al. 2004) – “Web content categorization using link information” (Gyongyi et al. 2006) – “Graph-based semi-supervised learning as a generative model” (He et al. 2007) • Seed preference is related to the field of active learning – Active learning chooses which data point to label next Authoritative seed based on previous labels; the labeling ispreference interactive a good – Seed preference is a batch labeling method base line for active learning on network data! Results • How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels x-axis: number of seed labels per class A Web-Scale Knowledge Base • Read the Web (RtW) project: Build a never-ending system that learns to extract information from unstructured web pages, resulting in a knowledge base of structured information. 31 Noun Phrase and Context Data • As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced: – NP-context co-occurrence data – NP-NP co-occurrence data • These datasets can be treated as graphs 32 Noun Phrase and Context Data • NP-context data: … know that drinking pomegranate juice may not be a bad … before context pomegranate juice JELL-O Jagermeister NP-context graph NP 3 2 5 8 2 1 after context know that drinking _ _ may not be a bad _ is made from _ promotes responsible 33 Noun Phrase and Context Data • NP-NP data: Context can be used for weighting edges or making a more complex graph … French Constitution Council validates burqa ban … NP context NP veil French Constitution Council burqa ban hot pants French Court NP-NP graph Jagermeister JELL-O 34 Noun Phrase Categorization • We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. • Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 35 Noun Phrase Categorization • We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. • Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 36 Noun Phrase Categorization • Preliminary experiment: – Small subset of the NP-context data • 88k NPs • 99k contexts – Find category “city” • Start with a handful of seeds – Ground truth set of 2,404 city NPs created using 37