Feature/Kernel Learning in Semi-supervised scenarios Kevin Duh kevinduh@u.washington.edu UW 2007 Workshop on SSL for Language Processing Agenda 1. Intro: Feature Learning in SSL 1. Assumptions 2. General algorithm/setup 2. 3. 4. 5. 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning Related work Semi-supervised learning (SSL): Three general assumptions • How can unlabeled data help? 1. Bootstrap assumption: – Automatic labels on unlabeled data has sufficient accuracy – Can be incorporated into training set 2. Low Density Separation assumption: – Classifier should not cut through a high density region because clusters of data likely come from the same class – Unlabeled data can help identify high/low density regions 3. Change of Representation assumption Low Density Assumption o + + o o - - - + + - o - o o o o Green line cuts through High density region Black line cuts through Low density region Change of Representation Assumption • Basic learning process: 1. Define a feature representation of data 2. Map each sample into feature vector 3. Apply a vector classifier learning algorithm • What if the original feature set is bad? – Then learning algo’s will have a hard time • • – • Ex: Highly correlation among features Ex: Irrelevant features Different learning algo’s deal with it differently Assumption: Large amounts of unlabeled data can help us learn better feature representation Two stage learning process • Stage 1: Feature learning 1. Define an original feature representation 2. Using unlabeled data, learn a “better” feature representation • Stage 2: Supervised training 1. Map each labeled training sample into new feature vector 2. Apply a vector classifier learning algo (Basic learning process) labele d Data Feature Vector Classifier (2-stage learning process) unlabeled Data Original Feature Vector labele d Data New Feature Vector New Feature Vector Classifier Philosophical reflection • Machine learning is not magic – Human knowledge is encoded in feature representation – Machine & statistics only learn the relative weight/importance of each feature on data • Two-stage learning is not magic – Merely extended machine/statistical analysis to the feature “engineering” stage – Human knowledge still needed in: • Original feature representation • Deciding how original feature transforms into new feature (Here is the research frontier!) • Designing features and learning weights on features are really the same thing – it’s just division of labor between human and machine. Wait, how is this different from … • Feature selection/extraction? – This approach applies traditional feature selection to a larger, unlabeled dataset. – In particular, we apply unsupervised feature selection methods (semi-supervised feature selection is an unexplored area) • Supervised learning? – Supervised learning is a critical component • We can use all we know in feature selection & supervised learning here: – There components aren’t new: the new thing here is the semisupervised perspective. – So far there is little work on this in language processing (compared to auto-labeling): lots of research opportunities!!! Research questions • Given lots of unlabeled data and an original feature representation, how can we transform into a better, new features? 1. How to define “better”? (objective) – Ideally, “better” = “leads to higher classification accuracy, but also this depends on the downstream classifier algo – Is there a surrogate metric for “better” that does not require labels? 2. How to model the feature transformation? (modeling) – Mostly we use linear transform: y=Ax • • x = original feature; y = new feature; A = transform matrix to learn 3. Given the objective and model, how to find A? (algo) Agenda 1. Intro: Feature Learning in SSL 2. 3 Case Studies 1. Correlated features 2. Irrelevant features 3. Feature Set is not sufficiently expressive 3. Algo1: Latent Semantic Analysis 4. Kernel Learning 5. Related work Running Example: Topic classification • Binary classification of documents by topics: Politics vs. Sports • Setup: – Data: Each document is vectorized so that each element represents a unigram and its count • Doc1: “The President met the Russian President” • Vocab: [The President met Russian talks strategy Czech missile] • Vector1: [2 2 1 1 0 0 0 0 ] – Classifier: Some supervised training algorithm • We will discuss 3 cases where original feature set may be unsuitable for training 1. Highly correlated features • Unigrams (features) in “Politics”: – – – – Republican, conservative, GOP White, House, President Iowa, primary Israel, Hamas, Gaza • Unigrams in “Sports”: – – – – game, ballgame strike-out, baseball NBA, Sonics Tiger, Woods, golf • OTHER EXAMPLES? • Highly correlate features represent redundant information: – A document with “Republican” isn’t necessarily more about politics than a document with “Republican” and “conservative.” – Some classifiers are hurt by highly correlated (redundant) features by “double-counting”, e.g. Naïve Bayes & other generative models 2. Irrelevant features • Unigrams in both “Politics” & “Sports”: – the, a, but, if, while (stop words) – win, lose (topic neutral words) – Bush (politician or sports player) • Very rare unigrams in either topic – Wazowski, Pitcairn (named entities) – Brittany Speers (typos, noisy features) • OTHER EXAMPLES? • Irrelevant features un-necessarily expands the “search space”. They should get zero weight, but classifier may not do so. 3. Feature Set is not sufficiently expressive • Original feature set: – 3 features: [leaders challenge group] – Politics or Sports? • “The leaders of the terrorist group issued a challenge” • “It’s a challenge for the cheer leaders to take a group picture” • Different topics have similar (same) feature vectors • A more expressive (discriminative) feature set: – 5 features: [leader challenge group cheer terrorist] – Bigrams: [“the leaders” “cheer leaders” …] • Vector-based linear classifiers simply doesn’t work – Data is “inseparable” by the classifier • Non-linear classifiers (e.g. kernel methods, decision trees, neural networks) automatically learn combinations of features – But, there is still no substitute for an human expert thinking of features Agenda 1. 2. 3. 4. 5. Intro: Feature Learning in SSL 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning Related work Dealing with correlated and/or irrelevant features • Feature clustering methods: – Idea: cluster features so that similarly distributed features are mapped together – Algorithms: • Word clustering (using Brown algo, spectral clustering, kmeans…) • Latent Semantic Analysis • Principle Components Analysis (PCA): – Idea: Map original feature vector into directions of high variance • Words with similar counts across all documents should be deleted from feature set • Words with largely varying counts *might* be a discriminating feature Latent Semantic Analysis (LSA) • Currently, a document is represented by a set of terms (words). • LSA: represent a document by a set of “latent semantic” variables, which are coarser-grained than terms. – A kind of “feature-clustering” – Latent semantics could cluster together synonyms or similar words, but in general it may be difficult to say “what” the latent semantics corresponds to. – Strengths: data-driven, scalable algorithm (with good results in natural language processing and information retrieval!) LSA Algorithm 1. Represent data as term-document matrix 2. Perform singular value decomposition 3. Map documents to new feature vector by the vectors with largest singular values • Term-document matrix: – – – t terms, d documents. Matrix is size (t x d) Each column is a document (represented by how many times each term occurs in that doc) Each row is a term (represented by how many times it occurs in various documents) Term-document matrix: X term1 term2 term3 doc1 5 4 0 doc2 5 3 0 doc3 0 0 7 doc4 0 1 6 doc5 10 7 3 • Distance between 2 documents: dot product between column vectors – d1’ * d2 = 5*5+4*3+0*0 = 37 – d1’ * d3 = 5*0+4*0+0*7 = 0 – X’ * X is the matrix of document distances • Distance between 2 terms: dot product between row vectors – t1 * t2’ = 5*4+5*3*0*0+0*1+10*7=105 – t1 * t3’ = 5*0+5*0+0*7+0*6+10*3=30 – X * X’ is the matrix of term distances Singular Value Decomposition (SVD) • Eigendecomposition: – A is a square matrix – Av = ev • (v is eigenvector, e is eigenvalue) – A = VEV’ • (E is diagonal matrix of eigenvalues • (V is matrix of eigenvectors; V*V’=V’*V = I) • SVD: – X is (t x d) rectangular matrix – Xd = st • (s is singular value, t & d are left/right singular vectors) – X = TSD’ • (S is t x d matrix with diagonal singular values and others 0) • (T is t x t, D is d x d matrix; T*T’=I; D*D’=I) SVD on term-document matrix • X = T*S*D’ = (t x t)* (t x d) * (d x d) -0.78 0.25 0.56 -0.55 0.11 -0.82 -0.27 -0.96 0.05 15.34 0 0 9.10 0 0 0 0 0.85 0 0 0 0 0 0 -0.40 0.18 -0.52 -0.36 -0.62 -0.36 0.17 0.42 0.66 -0.45 -0.12 -0.73 0.45 -0.37 -0.30 -0.14 -0.62 -0.56 0.51 0.08 -0.81 0.04 0.09 -0.15 0.54 • Low-rank approx X ~ (t x k)*(k x k)*(k x d) -0.78 0.25 -0.55 0.11 -0.27 -0.96 15.34 0 0 9.10 -0.40 -0.36 0.18 -0.52 -0.36 -0.62 0.17 0.42 0.66 -0.45 • X’*X = (DS’T’)(TSD) = D(S’S)D • New document vector representation: d’*T*inv(S) Agenda 1. 2. 3. 4. Intro: Feature Learning in SSL 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning 1. Links between features and kernels 2. Algo2: Neighborhood kernels 3. Algo3: Gaussian Mixture Model kernels 5. Related work Philosophical reflection: Distances • (Almost) Everything in machine learning is about distances: – A new sample is labeled positive because it is “closer” in distance to positive training examples • How to define distance? – Conventional way: dot product of feature vectors – That’s why feature representation is important • What if we directly learn the distance? – d(x1,x2): inputs are two samples, output is a numer – Argument: it doesn’t matter what features we define; all that matters is whether positive samples are close in distance to other positive samples, etc. What is a kernel? • Kernel is a distance function: – Dot product in feature space • x1= [a b]; x2 = [c d] , d(x1,x2)=x1’*x2 = ac+bd – Polynomial kernel: • • • • Map feature into higher-order polynomial x1 = [aa bb ab ba]; x2 = [cc dd cd dc] d(x1,x2) = aacc + bbdd + 2abcd Kernel trick: polynomial kernel = (ac+bd)^2 • When we define a kernel, we are implicitly defining a feature space – Recall in SVMs, kernels allow classification in spaces non-linear w.r.t. the original features Kernel Learning in Semi-supervised scenarios 1. Learn a kernel function k(x1,x2) from unlabeled+labeled data 2. Use this kernel when training a supervised classifier from labeled data (2-stage learning process) unlabeled Data Algo labele d Data Kernel New Kernel Kernel-based Classifier Neighborhood kernel (1/3) • Idea: Distance between x1, x2 is based on distances between their neighbors • Below are samples in feature space – Would you say the distances d(x1,x2) and d(x2,x7) are equivalent? – Or does it seem like x2 should be closer to x1 than x3? x10 x11 x1 x2 x6 x4 x5 x3 x7 x9 x8 Neighborhood kernel (2/3) • Analogy: “I am close to you if you are my clique of close friends.” – Neighborhood kernel k(x1,x2): avg distance among neighbors of x1, x2 • Eg. x1= [1]; x2 = [6], x3=[5], x4=[2] – Original distance: x2-x1 = 6-1 = 5 – Neighbor distance = 4 • closest neighbor to x1 is x4; closest neighbor to x2 is x3 • [(x2-x1) + (x3-x1) + (x2-x4) + (x3-x4)]/4 = [5+4+4+3]/4 = 4 Neighborhood kernel (3/3) • Mathematical form: • Design questions: – What’s the base distance? What’s the distance used for finding close neighbors? – How many neighbors to use? • Important: When do you expect neighborhood kernels to work? What assumptions does this method employ? Gaussian Mixture Model Kernels 1. Perform a (soft) clustering of your labeled + unlabeled dataset (by fitting a Gaussian mixture model) x10 x11 x1 x2 x6 x4 2. x5 x3 x7 x9 x8 The distance between two samples is defined by whether they are in the same cluster: K(x,y) = prob[x in cluster1]*prob[y in cluster1] + prob[x in cluster2]*prob[y in cluster2] + prob[x in cluster3]*prob[y in cluster3] + … Seeing relationships • We’ve presented various methods, but at some level they are all the same… • Gaussian mixture kernel & neighborhood kernel? • Gaussian mixture kernel & feature clustering? • Can you defined a feature learning version of neighborhood kernels? • Can you define a kernel learning version of LSA? Agenda 1. 2. 3. 4. 5. Intro: Feature Learning in SSL 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning Related work Related Work: Feature Learning • Look into machine learning & signal processing literature for unsupervised feature selection / extraction • Discussed here: – Two-stage learning: • C. Oliveira, F. Cozman, and I. Cohen. “Splitting the unsupervised and supervised components of semi-supervised learning.” In ICML 2005 Workshop on Learning with Partially Classified Training Data, 2005. – Latent Semantic Analysis: • S. Deerwester et.al., “Indexing by Latent Semantic Analysis”, Journal of Soc. For Information Sciences (41), 1990. • T. Hofmann, “Probabilistic Latent Semantic Analysis”, Uncertainty in AI, 1999 – Feature clustering: • W. Li and A. McCallum. “Semi-supervised sequence modeling with syntactic topic models”. AAAI 2005 • Z. Niu, D.-H. Ji, and C. L. Tan. A semi-supervised feature clustering algorithm with application to word sense disambiguation. HLT 2005 • Semi-supervised feature selection is a new area to explore: • Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and Unsupervised Learning'‘, ICML 2007 Related Work: Kernel learning • Beware that many kernel learning algorithms are transductive (don’t apply to samples not originally in your training data): • N. Christianni, et.al. “On kernel target alignment”, NIPS 2002 • G. Lanckriet, et.al. “Learning a kernel matrix with semi-definite programming,” ICML 2002 • C. Ong, A. Smola, R. Williamson, “Learning the kernel with hyperkernels”, Journal of Machine Learning Research (6), 2005 • Discussed here: – Neighborhood kernel and others: • J. Weston, et.al. “Semi-supervised protein classification using cluster kernels”, Bioinformatics 2005 21(15) – Gaussian mixture kernel and KernelBoost: • T. Hertz, A.Bar-Hillel and D. Weinshall. “Learning a Kernel Function for Classification with Small Training Samples.” ICML 2006 • Other keywords: “distance learning”, “distance metric learning” • E. Xing, M. Jordan, S. Russell, “Distance metric learning with applications to clustering with side-information”, NIPS 2002 Related work: SSL • Good survey article: • Xiaojin Zhu. “Semi-supervised learning literature survey.” Wisconsin CS TechReport • Graph-based SSL are popular. Some can be related to kernel learning • SSL on sequence models (of particular importance to NLP) • J. Lafferty, X. Zhu and Y. Liu. “Kernel Conditional Random Fields: Representation and Clique Selection.” ICML 2004 • Y. Altun, D. Mcallester, and M. Belkin. “Maximum margin semi-supervised learning for structured variables”, NIPS 2005 • U. Brefeld and T. Scheffer. “Semi-supervised learning for structured output variables,” ICML 2006 • Other promising methods: – Fisher kernels and features from generative models: • T. Jaakkola & D. Haussler. “Exploiting generative models in discriminative classifiers.” NIPS 1998 • A. Holub, M. Welling, and P. Perona. “Exploiting unlabelled data for hybrid object classification.” NIPS 2005 Workshop on Inter-class transfer • A. Fraser and D. Marcu. “Semi-supervised word alignment.” ACL 2006 – Multi-task learning formulation: • R. Ando and T. Zhang. A High-Performance Semi-Supervised Learning Method for Text Chunking. ACL 2006 • J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006 Summary • Feature learning and kernel learning are two sides of the same coin • Feature/kernel learning in SSL consists of a 2-stage approach: – (1) learn better feature/kernel with unlabeled data – (2) learn traditional classifier on labeled data • 3 Cases of inadequate feature representation: correlated, irrelevant, not expressive • Particular algorithms presented: – LSA, Neighborhood kernels, GMM kernels • Many related areas, many algorithms, but underlying assumptions may be very similar