ppt version

Feature/Kernel Learning in Semi-supervised scenarios Kevin Duh kevinduh@u.washington.edu UW 2007 Workshop on SSL for Language Processing Agenda 1. Intro: Feature Learning in SSL 1. Assumptions 2. General algorithm/setup 2. 3. 4. 5. 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning Related work Semi-supervised learning (SSL): Three general assumptions • How can unlabeled data help? 1. Bootstrap assumption: – Automatic labels on unlabeled data has sufficient accuracy – Can be incorporated into training set 2. Low Density Separation assumption: – Classifier should not cut through a high density region because clusters of data likely come from the same class – Unlabeled data can help identify high/low density regions 3. Change of Representation assumption Low Density Assumption o + + o o - - - + + - o - o o o o Green line cuts through High density region Black line cuts through Low density region Change of Representation Assumption • Basic learning process: 1. Define a feature representation of data 2. Map each sample into feature vector 3. Apply a vector classifier learning algorithm • What if the original feature set is bad? – Then learning algo’s will have a hard time • • – • Ex: Highly correlation among features Ex: Irrelevant features Different learning algo’s deal with it differently Assumption: Large amounts of unlabeled data can help us learn better feature representation Two stage learning process • Stage 1: Feature learning 1. Define an original feature representation 2. Using unlabeled data, learn a “better” feature representation • Stage 2: Supervised training 1. Map each labeled training sample into new feature vector 2. Apply a vector classifier learning algo (Basic learning process) labele d Data Feature Vector Classifier (2-stage learning process) unlabeled Data Original Feature Vector labele d Data New Feature Vector New Feature Vector Classifier Philosophical reflection • Machine learning is not magic – Human knowledge is encoded in feature representation – Machine & statistics only learn the relative weight/importance of each feature on data • Two-stage learning is not magic – Merely extended machine/statistical analysis to the feature “engineering” stage – Human knowledge still needed in: • Original feature representation • Deciding how original feature transforms into new feature (Here is the research frontier!) • Designing features and learning weights on features are really the same thing – it’s just division of labor between human and machine. Wait, how is this different from … • Feature selection/extraction? – This approach applies traditional feature selection to a larger, unlabeled dataset. – In particular, we apply unsupervised feature selection methods (semi-supervised feature selection is an unexplored area) • Supervised learning? – Supervised learning is a critical component • We can use all we know in feature selection & supervised learning here: – There components aren’t new: the new thing here is the semisupervised perspective. – So far there is little work on this in language processing (compared to auto-labeling): lots of research opportunities!!! Research questions • Given lots of unlabeled data and an original feature representation, how can we transform into a better, new features? 1. How to define “better”? (objective) – Ideally, “better” = “leads to higher classification accuracy, but also this depends on the downstream classifier algo – Is there a surrogate metric for “better” that does not require labels? 2. How to model the feature transformation? (modeling) – Mostly we use linear transform: y=Ax • • x = original feature; y = new feature; A = transform matrix to learn 3. Given the objective and model, how to find A? (algo) Agenda 1. Intro: Feature Learning in SSL 2. 3 Case Studies 1. Correlated features 2. Irrelevant features 3. Feature Set is not sufficiently expressive 3. Algo1: Latent Semantic Analysis 4. Kernel Learning 5. Related work Running Example: Topic classification • Binary classification of documents by topics: Politics vs. Sports • Setup: – Data: Each document is vectorized so that each element represents a unigram and its count • Doc1: “The President met the Russian President” • Vocab: [The President met Russian talks strategy Czech missile] • Vector1: [2 2 1 1 0 0 0 0 ] – Classifier: Some supervised training algorithm • We will discuss 3 cases where original feature set may be unsuitable for training 1. Highly correlated features • Unigrams (features) in “Politics”: – – – – Republican, conservative, GOP White, House, President Iowa, primary Israel, Hamas, Gaza • Unigrams in “Sports”: – – – – game, ballgame strike-out, baseball NBA, Sonics Tiger, Woods, golf • OTHER EXAMPLES? • Highly correlate features represent redundant information: – A document with “Republican” isn’t necessarily more about politics than a document with “Republican” and “conservative.” – Some classifiers are hurt by highly correlated (redundant) features by “double-counting”, e.g. Naïve Bayes & other generative models 2. Irrelevant features • Unigrams in both “Politics” & “Sports”: – the, a, but, if, while (stop words) – win, lose (topic neutral words) – Bush (politician or sports player) • Very rare unigrams in either topic – Wazowski, Pitcairn (named entities) – Brittany Speers (typos, noisy features) • OTHER EXAMPLES? • Irrelevant features un-necessarily expands the “search space”. They should get zero weight, but classifier may not do so. 3. Feature Set is not sufficiently expressive • Original feature set: – 3 features: [leaders challenge group] – Politics or Sports? • “The leaders of the terrorist group issued a challenge” • “It’s a challenge for the cheer leaders to take a group picture” • Different topics have similar (same) feature vectors • A more expressive (discriminative) feature set: – 5 features: [leader challenge group cheer terrorist] – Bigrams: [“the leaders” “cheer leaders” …] • Vector-based linear classifiers simply doesn’t work – Data is “inseparable” by the classifier • Non-linear classifiers (e.g. kernel methods, decision trees, neural networks) automatically learn combinations of features – But, there is still no substitute for an human expert thinking of features Agenda 1. 2. 3. 4. 5. Intro: Feature Learning in SSL 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning Related work Dealing with correlated and/or irrelevant features • Feature clustering methods: – Idea: cluster features so that similarly distributed features are mapped together – Algorithms: • Word clustering (using Brown algo, spectral clustering, kmeans…) • Latent Semantic Analysis • Principle Components Analysis (PCA): – Idea: Map original feature vector into directions of high variance • Words with similar counts across all documents should be deleted from feature set • Words with largely varying counts *might* be a discriminating feature Latent Semantic Analysis (LSA) • Currently, a document is represented by a set of terms (words). • LSA: represent a document by a set of “latent semantic” variables, which are coarser-grained than terms. – A kind of “feature-clustering” – Latent semantics could cluster together synonyms or similar words, but in general it may be difficult to say “what” the latent semantics corresponds to. – Strengths: data-driven, scalable algorithm (with good results in natural language processing and information retrieval!) LSA Algorithm 1. Represent data as term-document matrix 2. Perform singular value decomposition 3. Map documents to new feature vector by the vectors with largest singular values • Term-document matrix: – – – t terms, d documents. Matrix is size (t x d) Each column is a document (represented by how many times each term occurs in that doc) Each row is a term (represented by how many times it occurs in various documents) Term-document matrix: X term1 term2 term3 doc1 5 4 0 doc2 5 3 0 doc3 0 0 7 doc4 0 1 6 doc5 10 7 3 • Distance between 2 documents: dot product between column vectors – d1’ * d2 = 5*5+4*3+0*0 = 37 – d1’ * d3 = 5*0+4*0+0*7 = 0 – X’ * X is the matrix of document distances • Distance between 2 terms: dot product between row vectors – t1 * t2’ = 5*4+5*3*0*0+0*1+10*7=105 – t1 * t3’ = 5*0+5*0+0*7+0*6+10*3=30 – X * X’ is the matrix of term distances Singular Value Decomposition (SVD) • Eigendecomposition: – A is a square matrix – Av = ev • (v is eigenvector, e is eigenvalue) – A = VEV’ • (E is diagonal matrix of eigenvalues • (V is matrix of eigenvectors; V*V’=V’*V = I) • SVD: – X is (t x d) rectangular matrix – Xd = st • (s is singular value, t & d are left/right singular vectors) – X = TSD’ • (S is t x d matrix with diagonal singular values and others 0) • (T is t x t, D is d x d matrix; T*T’=I; D*D’=I) SVD on term-document matrix • X = T*S*D’ = (t x t)* (t x d) * (d x d) -0.78 0.25 0.56 -0.55 0.11 -0.82 -0.27 -0.96 0.05 15.34 0 0 9.10 0 0 0 0 0.85 0 0 0 0 0 0 -0.40 0.18 -0.52 -0.36 -0.62 -0.36 0.17 0.42 0.66 -0.45 -0.12 -0.73 0.45 -0.37 -0.30 -0.14 -0.62 -0.56 0.51 0.08 -0.81 0.04 0.09 -0.15 0.54 • Low-rank approx X ~ (t x k)*(k x k)*(k x d) -0.78 0.25 -0.55 0.11 -0.27 -0.96 15.34 0 0 9.10 -0.40 -0.36 0.18 -0.52 -0.36 -0.62 0.17 0.42 0.66 -0.45 • X’*X = (DS’T’)(TSD) = D(S’S)D • New document vector representation: d’*T*inv(S) Agenda 1. 2. 3. 4. Intro: Feature Learning in SSL 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning 1. Links between features and kernels 2. Algo2: Neighborhood kernels 3. Algo3: Gaussian Mixture Model kernels 5. Related work Philosophical reflection: Distances • (Almost) Everything in machine learning is about distances: – A new sample is labeled positive because it is “closer” in distance to positive training examples • How to define distance? – Conventional way: dot product of feature vectors – That’s why feature representation is important • What if we directly learn the distance? – d(x1,x2): inputs are two samples, output is a numer – Argument: it doesn’t matter what features we define; all that matters is whether positive samples are close in distance to other positive samples, etc. What is a kernel? • Kernel is a distance function: – Dot product in feature space • x1= [a b]; x2 = [c d] , d(x1,x2)=x1’*x2 = ac+bd – Polynomial kernel: • • • • Map feature into higher-order polynomial x1 = [aa bb ab ba]; x2 = [cc dd cd dc] d(x1,x2) = aacc + bbdd + 2abcd Kernel trick: polynomial kernel = (ac+bd)^2 • When we define a kernel, we are implicitly defining a feature space – Recall in SVMs, kernels allow classification in spaces non-linear w.r.t. the original features Kernel Learning in Semi-supervised scenarios 1. Learn a kernel function k(x1,x2) from unlabeled+labeled data 2. Use this kernel when training a supervised classifier from labeled data (2-stage learning process) unlabeled Data Algo labele d Data Kernel New Kernel Kernel-based Classifier Neighborhood kernel (1/3) • Idea: Distance between x1, x2 is based on distances between their neighbors • Below are samples in feature space – Would you say the distances d(x1,x2) and d(x2,x7) are equivalent? – Or does it seem like x2 should be closer to x1 than x3? x10 x11 x1 x2 x6 x4 x5 x3 x7 x9 x8 Neighborhood kernel (2/3) • Analogy: “I am close to you if you are my clique of close friends.” – Neighborhood kernel k(x1,x2): avg distance among neighbors of x1, x2 • Eg. x1= [1]; x2 = [6], x3=[5], x4=[2] – Original distance: x2-x1 = 6-1 = 5 – Neighbor distance = 4 • closest neighbor to x1 is x4; closest neighbor to x2 is x3 • [(x2-x1) + (x3-x1) + (x2-x4) + (x3-x4)]/4 = [5+4+4+3]/4 = 4 Neighborhood kernel (3/3) • Mathematical form: • Design questions: – What’s the base distance? What’s the distance used for finding close neighbors? – How many neighbors to use? • Important: When do you expect neighborhood kernels to work? What assumptions does this method employ? Gaussian Mixture Model Kernels 1. Perform a (soft) clustering of your labeled + unlabeled dataset (by fitting a Gaussian mixture model) x10 x11 x1 x2 x6 x4 2. x5 x3 x7 x9 x8 The distance between two samples is defined by whether they are in the same cluster: K(x,y) = prob[x in cluster1]*prob[y in cluster1] + prob[x in cluster2]*prob[y in cluster2] + prob[x in cluster3]*prob[y in cluster3] + … Seeing relationships • We’ve presented various methods, but at some level they are all the same… • Gaussian mixture kernel & neighborhood kernel? • Gaussian mixture kernel & feature clustering? • Can you defined a feature learning version of neighborhood kernels? • Can you define a kernel learning version of LSA? Agenda 1. 2. 3. 4. 5. Intro: Feature Learning in SSL 3 Case Studies Algo1: Latent Semantic Analysis Kernel Learning Related work Related Work: Feature Learning • Look into machine learning & signal processing literature for unsupervised feature selection / extraction • Discussed here: – Two-stage learning: • C. Oliveira, F. Cozman, and I. Cohen. “Splitting the unsupervised and supervised components of semi-supervised learning.” In ICML 2005 Workshop on Learning with Partially Classified Training Data, 2005. – Latent Semantic Analysis: • S. Deerwester et.al., “Indexing by Latent Semantic Analysis”, Journal of Soc. For Information Sciences (41), 1990. • T. Hofmann, “Probabilistic Latent Semantic Analysis”, Uncertainty in AI, 1999 – Feature clustering: • W. Li and A. McCallum. “Semi-supervised sequence modeling with syntactic topic models”. AAAI 2005 • Z. Niu, D.-H. Ji, and C. L. Tan. A semi-supervised feature clustering algorithm with application to word sense disambiguation. HLT 2005 • Semi-supervised feature selection is a new area to explore: • Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and Unsupervised Learning'‘, ICML 2007 Related Work: Kernel learning • Beware that many kernel learning algorithms are transductive (don’t apply to samples not originally in your training data): • N. Christianni, et.al. “On kernel target alignment”, NIPS 2002 • G. Lanckriet, et.al. “Learning a kernel matrix with semi-definite programming,” ICML 2002 • C. Ong, A. Smola, R. Williamson, “Learning the kernel with hyperkernels”, Journal of Machine Learning Research (6), 2005 • Discussed here: – Neighborhood kernel and others: • J. Weston, et.al. “Semi-supervised protein classification using cluster kernels”, Bioinformatics 2005 21(15) – Gaussian mixture kernel and KernelBoost: • T. Hertz, A.Bar-Hillel and D. Weinshall. “Learning a Kernel Function for Classification with Small Training Samples.” ICML 2006 • Other keywords: “distance learning”, “distance metric learning” • E. Xing, M. Jordan, S. Russell, “Distance metric learning with applications to clustering with side-information”, NIPS 2002 Related work: SSL • Good survey article: • Xiaojin Zhu. “Semi-supervised learning literature survey.” Wisconsin CS TechReport • Graph-based SSL are popular. Some can be related to kernel learning • SSL on sequence models (of particular importance to NLP) • J. Lafferty, X. Zhu and Y. Liu. “Kernel Conditional Random Fields: Representation and Clique Selection.” ICML 2004 • Y. Altun, D. Mcallester, and M. Belkin. “Maximum margin semi-supervised learning for structured variables”, NIPS 2005 • U. Brefeld and T. Scheffer. “Semi-supervised learning for structured output variables,” ICML 2006 • Other promising methods: – Fisher kernels and features from generative models: • T. Jaakkola & D. Haussler. “Exploiting generative models in discriminative classifiers.” NIPS 1998 • A. Holub, M. Welling, and P. Perona. “Exploiting unlabelled data for hybrid object classification.” NIPS 2005 Workshop on Inter-class transfer • A. Fraser and D. Marcu. “Semi-supervised word alignment.” ACL 2006 – Multi-task learning formulation: • R. Ando and T. Zhang. A High-Performance Semi-Supervised Learning Method for Text Chunking. ACL 2006 • J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006 Summary • Feature learning and kernel learning are two sides of the same coin • Feature/kernel learning in SSL consists of a 2-stage approach: – (1) learn better feature/kernel with unlabeled data – (2) learn traditional classifier on labeled data • 3 Cases of inadequate feature representation: correlated, irrelevant, not expressive • Particular algorithms presented: – LSA, Neighborhood kernels, GMM kernels • Many related areas, many algorithms, but underlying assumptions may be very similar

ppt version

Related documents

Products

Support

ppt version

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib