Information Retrieval & Data Mining: A Linear Algebraic Perspective Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: drineas Modern data Facts Computers make it easy to collect and store data. Costs of storage are very low and are dropping very fast. (most laptops have a storage capacity of more than 100 GB …) When it comes to storing data The current policy typically is “store everything in case it is needed later” instead of deciding what could be deleted. Data mining Facts Computers make it easy to collect and store data. Costs of storage are very low and are dropping very fast. (most laptops have a storage capacity of more than 100 GB …) When it comes to storing data The current policy typically is “store everything in case it is needed later” instead of deciding what could be deleted. Data Mining Extract useful information from the massive amount of available data. About the tutorial Tools Introduce matrix algorithms and matrix decompositions for data mining and information retrieval applications. Goal Learn a model for the underlying “physical” system generating the dataset. About the tutorial Tools Introduce matrix algorithms and matrix decompositions for data mining and information retrieval applications. Goal Learn a model for the underlying “physical” system generating the dataset. data algorithms mathematics Math is necessary to design and analyze principled algorithmic techniques to datamine the massive datasets that have become ubiquitous in scientific research. Why linear (or multilinear) algebra? Data are represented by matrices Numerous modern datasets are in matrix form. Data are represented by tensors Data in the form of tensors (multi-mode arrays) are becoming very common in the data mining and information retrieval literature in the last few years. Why linear (or multilinear) algebra? Data are represented by matrices Numerous modern datasets are in matrix form. Data are represented by tensors Data in the form of tensors (multi-mode arrays) are becoming very common in the data mining and information retrieval literature in the last few years. Linear algebra (and numerical analysis) provide the fundamental mathematical and algorithmic tools to deal with matrix and tensor computations. (This tutorial will focus on matrices; pointers to some tensor decompositions will be provided.) Why matrix decompositions? Matrix decompositions (e.g., SVD, QR, SDD, CX and CUR, NMF, MMMF, etc.) • They use the relationships between the available data in order to identify components of the underlying physical system generating the data. • Some assumptions on the relationships between the underlying components are necessary. • Very active area of research; some matrix decompositions are more than one century old, whereas others are very recent. Overview • Datasets in the form of matrices (and tensors) • Matrix Decompositions Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression Datasets in the form of matrices We are given m objects and n features describing the objects. (Each object has n numeric values describing it.) Dataset An m-by-n matrix A, Aij shows the “importance” of feature j for object i. Every row of A represents an object. Goal We seek to understand the structure of the data, e.g., the underlying process generating the data. Market basket matrices Common representation for association rule mining. n products (e.g., milk, bread, wine, etc.) Data mining tasks - Find association rules m customers E.g., customers who buy product x buy product y with probility 89%. Aij = quantity of j-th product purchased by the i-th customer - Such rules are used to make item display decisions, advertising decisions, etc. Social networks (e-mail graph) Represents the email communications between groups of users. n users Data mining tasks - cluster the users n users Aij = number of emails exchanged between users i and j during a certain time period - identify “dense” networks of users (dense subgraphs) Document-term matrices A collection of documents is represented by an m-by-n matrix (bag-of-words model). n terms (words) Data mining tasks - Cluster or classify documents m documents - Find “nearest neighbors” Aij = frequency of j-th term in i-th document - Feature selection: find a subset of terms that (accurately) clusters or classifies documents. Document-term matrices A collection of documents is represented by an m-by-n matrix (bag-of-words model). n terms (words) Data mining tasks - Cluster or classify documents m documents - Find “nearest neighbors” Aij = frequency of j-th term in i-th document - Feature selection: find a subset of terms that (accurately) clusters or classifies documents. Recommendation systems The m-by-n matrix A represents m customers and n products. products Data mining task customers Aij = utility of j-th product to i-th customer Given a few samples from A, recommend high utility products to customers. Biology: microarray data tumour specimens Microarray Data Rows: genes (¼ 5,500) Columns: 46 soft-issue tumour specimens genes (different types of cancer, e.g., LIPO, LEIO, GIST, MFH, etc.) Tasks: Pick a subset of genes (if it exists) that suffices in order to identify the “cancer type” of a patient Nielsen et al., Lancet, 2002 Biology: microarray data tumour specimens Microarray Data Rows: genes (¼ 5,500) Columns: 46 soft-issue tumour specimens genes (different types of cancer, e.g., LIPO, LEIO, GIST, MFH, etc.) Tasks: Pick a subset of genes (if it exists) that suffices in order to identify the “cancer type” of a patient Nielsen et al., Lancet, 2002 Human genetics Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … Matrices including hundreds of individuals and more than 300,000 SNPs are publicly available. Task :split the individuals in different clusters depending on their ancestry, and find a small subset of genetic markers that are “ancestry informative”. Human genetics Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … Matrices including hundreds of individuals and more than 300,000 SNPs are publicly available. Task :split the individuals in different clusters depending on their ancestry, and find a small subset of genetic markers that are “ancestry informative”. Tensors: recommendation systems Economics: • Utility is ordinal and not cardinal concept. • Compare products; don’t assign utility values. m customers Recommendation Model Revisited: • Every customer has an n-by-n matrix (whose entries are +1,-1) and represent pairwise product comparisons. • There are m such matrices, forming an n-by-n-by-m 3-mode tensor A. n products n products Tensors: hyperspectral images 128 frequencies Spectrally resolved images may be viewed as a tensor. ca. 500 pixels ca. 500 pixels Task: Identify and analyze regions of significance in the images. Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression The Singular Value Decomposition (SVD) Matrix rows: points (vectors) in a Euclidean space, e.g., given 2 objects (x & d), each described with respect to two features, we get a 2-by-2 matrix. feature 2 Recall: data matrices have m rows (one for each object) and n columns (one for each feature). Object d (d,x) Two objects are “close” if the angle between their corresponding vectors is small. Object x feature 1 SVD, intuition Let the blue circles represent m data points in a 2-D Euclidean space. 5 2nd (right) singular vector Then, the SVD of the m-by-2 matrix of the data will return … 4 1st (right) singular vector: direction of maximal variance, 3 2nd (right) singular vector: 1st (right) singular vector 2 4.0 4.5 5.0 5.5 6.0 direction of maximal variance, after removing the projection of the data along the first singular vector. Singular values 5 2 2nd (right) singular vector 1: measures how much of the data variance is explained by the first singular vector. 4 2: measures how much of the data variance is explained by the second singular vector. 3 1 1st (right) singular vector 2 4.0 4.5 5.0 5.5 6.0 SVD: formal definition 0 0 : rank of A U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. Let 1 ¸ 2 ¸ … ¸ be the entries of S. Exact computation of the SVD takes O(min{mn2 , m2n}) time. The top k left/right singular vectors/values can be computed faster using Lanczos/Arnoldi methods. Rank-k approximations via the SVD A = U S VT features objects noise = significant sig. significant noise noise Rank-k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Principal Components Analysis (PCA) essentially amounts to the computation of the Singular Value Decomposition (SVD) of a covariance matrix. SVD is the algorithmic tool behind MultiDimensional Scaling (MDS) and Factor Analysis. SVD is “the Rolls-Royce and the Swiss Army Knife of Numerical Linear Algebra.”* *Dianne O’Leary, MMDS ’06 feature 2 PCA and SVD Object d (d,x) Object x feature 1 Ak as an optimization problem Frobenius norm: Given , it is easy to find X from standard least squares. However, the fact that we can find the optimal is intriguing! Ak as an optimization problem Frobenius norm: Given , it is easy to find X from standard least squares. However, the fact that we can find the optimal is intriguing! Optimal = Uk, optimal X = UkTA. LSI: Ak for document-term matrices (Berry, Dumais, and O'Brien ’92) Latent Semantic Indexing (LSI) Replace A by Ak; apply clustering/classification algorithms on Ak. Pros n terms (words) - Less storage for small k. O(km+kn) vs. O(mn) m documents - Improved performance. Documents are represented in a “concept” space. Aij = frequency of j-th term in i-th document LSI: Ak for document-term matrices (Berry, Dumais, and O'Brien ’92) Latent Semantic Indexing (LSI) Replace A by Ak; apply clustering/classification algorithms on Ak. Pros n terms (words) - Less storage for small k. O(km+kn) vs. O(mn) m documents - Improved performance. Documents are represented in a “concept” space. Aij = frequency of j-th term in i-th document Cons - Ak destroys sparsity. - Interpretation is difficult. - Choosing a good k is tough. Ak and k-means clustering (Drineas, Frieze, Kannan, Vempala, and Vinay ’99) k-means clustering A standard objective function that measures cluster quality. (Often denotes an iterative algorithm that attempts to optimize the k-means objective function.) k-means objective Input: set of m points in Rn, positive integer k Output: a partition of the m points to k clusters Partition the m points to k clusters in order to minimize the sum of the squared Euclidean distances from each point to its cluster centroid. k-means, cont’d We seek to split the input points in 5 clusters. k-means, cont’d We seek to split the input points in 5 clusters. The cluster centroid is the “average” of all the points in the cluster. k-means: a matrix formulation Let A be the m-by-n matrix representing m points in Rn. Then, we seek to X is a special “cluster membership” matrix: Xij denotes if the i-th point belongs to the j-th cluster. k-means: a matrix formulation Let A be the m-by-n matrix representing m points in Rn. Then, we seek to X is a special “cluster membership” matrix: Xij denotes if the i-th point belongs to the j-th cluster. points clusters • Columns of X are normalized to have unit length. (We divide each column by the square root of the number of points in the cluster.) • Every row of X has at most one non-zero element. (Each element belongs to at most one cluster.) • X is an orthogonal matrix, i.e., XTX = I. SVD and k-means If we only require that X is an orthogonal matrix and remove the condition on the number of non-zero entries per row of X, then is easy to minimize! The solution is X = Uk. SVD and k-means If we only require that X is an orthogonal matrix and remove the condition on the number of non-zero entries per row of X, then is easy to minimize! The solution is X = Uk. Using SVD to solve k-means • We can get a 2-approximation algorithm for k-means. (Drineas, Frieze, Kannan, Vempala, and Vinay ’99, ’04) • We can get heuristic schemes to assign points to clusters. (Zha, He, Ding, Simon, and Gu ’01) • There exist PTAS (based on random projections) for the k-means problem. (Ostrovsky and Rabani ’00, ’02) • Deeper connections between SVD and clustering in Kannan, Vempala, and Vetta ’00, ’04. Ak and Kleinberg’s HITS algorithm (Kleinberg ’98, ’99) Hypertext Induced Topic Selection (HITS) A link analysis algorithm that rates Web pages for their authority and hub scores. Authority score: an estimate of the value of the content of the page. Hub score: an estimate of the value of the links from this page to other pages. These values can be used to rank Web search results. Ak and Kleinberg’s HITS algorithm Hypertext Induced Topic Selection (HITS) A link analysis algorithm that rates Web pages for their authority and hub scores. Authority score: an estimate of the value of the content of the page. Hub score: an estimate of the value of the links from this page to other pages. These values can be used to rank Web search results. Authority: a page that is pointed to by many pages with high hub scores. Hub: a page pointing to many pages that are good authorities. Recursive definition; notice that each node has two scores. Ak and Kleinberg’s HITS algorithm Phase 1: Given a query term (e.g., “jaguar”), find all pages containing the query term (root set). Expand the resulting graph by one move forward and backward (base set). Ak and Kleinberg’s HITS algorithm Phase 2: Let A be the adjacency matrix of the (directed) graph of the base set. Let h , a 2 Rn be the vectors of hub (authority) scores. Then, h = Aa and a = ATh h = AATh and a = ATAa. Ak and Kleinberg’s HITS algorithm Phase 2: Let A be the adjacency matrix of the (directed) graph of the base set. Let h , a 2 Rn be the vectors of hub (authority) scores. Then, h = Aa and a = ATh h = AATh and a = ATAa. Thus, the top left (right) singular vector of A corresponds to hub (authority) scores. Ak and Kleinberg’s HITS algorithm Phase 2: Let A be the adjacency matrix of the (directed) graph of the base set. Let h , a 2 Rn be the vectors of hub (authority) scores. Then, h = Aa and a = ATh h = AATh and a = ATAa. Thus, the top left (right) singular vector of A corresponds to hub (authority) scores. What about the rest? They provide a natural way to extract additional densely linked collections of hubs and authorities from the base set. See the “jaguar” example in Kleinberg ’99. SVD example: microarray data genes Microarray Data (Nielsen et al., Lancet, 2002) Columns: genes (¼ 5,500) Rows: 32 patients, three different cancer types (GIST, LEIO, SynSarc) SVD example: microarray data Microarray Data Applying k-means with k=3 in this 3D space results to 3 misclassifications. Applying k-means with k=3 but retaining 4 PCs results to one misclassification. Can we find actual genes (as opposed to eigengenes) that achieve similar results? SVD example: ancestry-informative SNPs Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … There are ¼ 10 million SNPs in the human genome, so this table could have ~10 million columns. Two copies of a chromosome (father, mother) Focus at a specific locus and assay the observed nucleotide bases (alleles). SNP: exactly two alternate alleles appear. Focus at a specific locus and assay the observed alleles. C T SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) SNPs individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … Focus at a specific locus and assay the observed alleles. C C SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) - Homozygotic at the first allele, e.g., C SNPs individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … Focus at a specific locus and assay the observed alleles. T T SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) Encode as 0 - Homozygotic at the first allele, e.g., C Encode as +1 - Homozygotic at the second allele, e.g., T Encode as -1 SNPs individuals … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … (a) Why are SNPs really important? Association studies: Locating causative genes for common complex disorders (e.g., diabetes, heart disease, etc.) is based on identifying association between affection status and known SNPs. No prior knowledge about the function of the gene(s) or the etiology of the disorder is necessary. The subsequent investigation of candidate genes that are in physical proximity with the associated SNPs is the first step towards understanding the etiological “pathway” of a disorder and designing a drug. (b) Why are SNPs really important? Among different populations (eg., European, Asian, African, etc.), different patterns of SNP allele frequencies or SNP correlations are often observed. Understanding such differences is crucial in order to develop the “next generation” of drugs that will be “population specific” (eventually “genome specific”) and not just “disease specific”. The HapMap project • Mapping the whole genome sequence of a single individual is very expensive. • Mapping all the SNPs is also quite expensive, but the costs are dropping fast. HapMap project (~$130,000,000 funding from NIH and other sources): Map approx. 4 million SNPs for 270 individuals from 4 different populations (YRI, CEU, CHB, JPT), in order to create a “genetic map” to be used by researchers. Also, funding from pharmaceutical companies, NSF, the Department of Justice*, etc. *Is it possible to identify the ethnicity of a suspect from his DNA? CHB and JPT Let A be the 90£2.7 million matrix of the CHB and JPT population in HapMap. • Run SVD (PCA) on A, keep the two (left) singular vectors, and plot the results. • Run a (naïve, e.g., k-means) clustering algorithm to split the data points in two clusters. Paschou, Ziv, Burchard, Mahoney, and Drineas, to appear in PLOS Genetics ’07 (data from E. Ziv and E. Burchard, UCSF) Paschou, Mahoney, Javed, Kidd, Pakstis, Gu, Kidd, and Drineas, Genome Research ’07 (data from K. Kidd, Yale University) EigenSNPs can not be assayed… Not altogether satisfactory: the (top two left) singular vectors are linear combinations of all SNPs, and – of course – can not be assayed! Can we find actual SNPs that capture the information in the (top two left) singular vectors? (E.g., spanning the same subspace …) Will get back to this later … Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions x Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression CX decomposition C C Constrain to contain exactly k columns of A. Notation: replace by C(olumns). Easy to prove that optimal X = C+A. (C+ is the Moore-Penrose pseudoinverse of C.) Also called “interpolative approximation”. (some extra conditions on the elements of X are required…) CX decomposition C C Why? If A is an object-feature matrix, then selecting “representative” columns is equivalent to selecting “representative” features. This leads to easier interpretability; compare to eigenfeatures, which are linear combinations of all features. Column Subset Selection problem (CSS) Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C that minimizes the above error over all O(nk) choices for C. Column Subset Selection problem (CSS) Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C that minimizes the above error over all O(nk) choices for C. C+: pseudoinverse of C, easily computed via the SVD of C. (If C = U S VT, then C+ = V S-1 UT.) PC = CC+ is the projector matrix on the subspace spanned by the columns of C. Column Subset Selection problem (CSS) Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C that minimizes the above error over all O(nk) choices for C. PC = CC+ is the projector matrix on the subspace spanned by the columns of C. Complexity of the problem? O(nkmn) trivially works; NP-hard if k grows as a function of n. (NP-hardness in Civril & Magdon-Ismail ’07) Spectral norm Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C such that is minimized over all O(nk) possible choices for C. Remarks: 1. PCA is the projection of A on the subspace spanned by the columns of C. 2. The spectral or 2-norm of an m-by-n matrix X is A lower bound for the CSS problem For any m-by-k matrix C consisting of at most k columns of A Ak Remarks: 1. This is also true if we replace the spectral norm by the Frobenius norm. 2. This is a – potentially – weak lower bound. Prior work: numerical linear algebra Numerical Linear Algebra algorithms for CSS 1. Deterministic, typically greedy approaches. 2. Deep connection with the Rank Revealing QR factorization. 3. Strongest results so far (spectral norm): in O(mn2) time some function p(k,n) Prior work: numerical linear algebra Numerical Linear Algebra algorithms for CSS 1. Deterministic, typically greedy approaches. 2. Deep connection with the Rank Revealing QR factorization. 3. Strongest results so far (Frobenius norm): in O(nk) time Working on p(k,n): 1965 – today Prior work: theoretical computer science Theoretical Computer Science algorithms for CSS 1. Randomized approaches, with some failure probability. 2. More than k rows are picked, e.g., O(poly(k)) rows. 3. Very strong bounds for the Frobenius norm in low polynomial time. 4. Not many spectral norm bounds… The strongest Frobenius norm bound Given an m-by-n matrix A, there exists an O(mn2) algorithm that picks at most O( k log k / 2 ) columns of A such that with probability at least 1-10-20 The CX algorithm Input: m-by-n matrix A, 0 < < 1, the desired accuracy Output: C, the matrix consisting of the selected columns CX algorithm • Compute probabilities pj summing to 1 • Let c = O(k log k / 2) • For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cpj} • Let C be the matrix consisting of the chosen columns (C has – in expectation – at most c columns) Subspace sampling (Frobenius norm) Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i) are not. Subspace sampling (Frobenius norm) Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i) are not. Subspace sampling in O(mn2) time Normalization s.t. the pj sum up to 1 Prior work in TCS Drineas, Mahoney, and Muthukrishnan 2005 • O(mn2) time, O(k2/2) columns Drineas, Mahoney, and Muthukrishnan 2006 • O(mn2) time, O(k log k/2) columns Deshpande and Vempala 2006 • O(mnk2) time and O(k2 log k/2) columns • They also prove the existence of k columns of A forming a matrix C, such that • Compare to prior best existence result: Open problems Design: • Faster algorithms (next slide) • Algorithms that achieve better approximation guarantees (a hybrid approach) Prior work spanning NLA and TCS Woolfe, Liberty, Rohklin, and Tygert 2007 (also Martinsson, Rohklin, and Tygert 2006) • O(mn logk) time, k columns, same spectral norm bounds as prior work • Beautiful application of the Fast Johnson-Lindenstrauss transform of Ailon-Chazelle A hybrid approach (Boutsidis, Mahoney, and Drineas ’07) Given an m-by-n matrix A (assume m ¸ n for simplicity): • (Randomized phase) Run a randomized algorithm to pick c = O(k logk) columns. • (Deterministic phase) Run a deterministic algorithm on the above columns* to pick exactly k columns of A and form an m-by-k matrix C. * Not so simple … A hybrid approach (Boutsidis, Mahoney, and Drineas ’07) Given an m-by-n matrix A (assume m ¸ n for simplicity): • (Randomized phase) Run a randomized algorithm to pick c = O(k logk) columns. • (Deterministic phase) Run a deterministic algorithm on the above columns* to pick exactly k columns of A and form an m-by-k matrix C. * Not so simple … Our algorithm runs in O(mn2) and satisfies, with probability at least 1-10-20, Comparison: Frobenius norm Our algorithm runs in O(mn2) and satisfies, with probability at least 1-10-20, 1. We provide an efficient algorithmic result. 2. We guarantee a Frobenius norm bound that is at most (k logk)1/2 worse than the best known existential result. Comparison: spectral norm Our algorithm runs in O(mn2) and satisfies, with probability at least 1-10-20, 1. Our running time is comparable with NLA algorithms for this problem. 2. Our spectral norm bound grows as a function of (n-k)1/4 instead of (n-k)1/2! 3. Do notice that with respect to k our bound is k1/4log1/2k worse than previous work. 4. To the best of our knowledge, our result is the first asymptotic improvement of the work of Gu & Eisenstat 1996. Randomized phase: O(k log k) columns Randomized phase: c = O(k logk) • Compute probabilities pj summing to 1 • For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cpj} • Let C be the matrix consisting of the chosen columns (C has – in expectation – at most c columns) Subspace sampling Vk: orthogonal matrix containing the top k right singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Remark: We need more elaborate subspace sampling probabilities than previous work. Subspace sampling in O(mn2) time Normalization s.t. the pj sum up to 1 Deterministic phase: k columns Deterministic phase • Let S1 be the set of indices of the columns selected by the randomized phase. • Let (VkT)S1 denote the set of columns of VkT with indices in S1, (An extra technicality is that the columns of (VkT)S1 must be rescaled …) • Run a deterministic NLA algorithm on (VkT)S1 to select exactly k columns. (Any algorithm with p(k,n) = k1/2(n-k)1/2 will do.) • Let S2 be the set of indices of the selected columns (the cardinality of S2 is exactly k). • Return AS2 (the columns of A corresponding to indices in S2) as the final output. Back to SNPs: CHB and JPT Let A be the 90£2.7 million matrix of the CHB and JPT population in HapMap. Can we find actual SNPs that capture the information in the top two left singular vectors? Results Number of SNPs Misclassifications 40 (c = 400) 6 50 (c = 500) 5 60 (c = 600) 3 70 (c = 700) 1 • Essentially as good as the best existing metric (informativeness). • However, our metric is unsupervised! (Informativeness is supervised: it essentially identifies SNPs that are correlated with population membership, given such membership information). • The fact that we can select ancestry informative SNPs in an unsupervised manner based on PCA is novel, and seems interesting. Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions x Singular Value Decomposition (SVD) x Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression CUR-type decompositions Carefully chosen U Goal: make (some norm) of A-CUR small. O(1) columns O(1) rows For any matrix A, we can find C, U and R such that the norm of A – CUR is almost equal to the norm of A-Ak. This might lead to a better understanding of the data. Theorem: relative error CUR (Drineas, Mahoney, & Muthukrishnan ’06, ’07) For any k, O(mn2) time suffices to construct C, U, and R s.t. holds with probability at least 1-, by picking O( k log k log(1/) / 2 ) columns, and O( k log2k log(1/) / 6 ) rows. From SVD to CUR Exploit structural properties of CUR to analyze data: n features m objects A CUR-type decomposition needs O(min{mn2, m2n}) time. Instead of reifying the Principal Components: • Use PCA (a.k.a. SVD) to find how many Principal Components are needed to “explain” the data. • Run CUR and pick columns/rows instead of eigen-columns and eigen-rows! • Assign meaning to actual columns/rows of the matrix! Much more intuitive! Sparse! CUR decompositions: a summary C: variant of the QR algorithm R: variant of the QR algorithm U: minimizes ||A-CUR||F No a priori bounds Solid experimental performance C: columns that span max volume U: W+ R: rows that span max volume Existential result Error bounds depend on ||W+||2 Spectral norm bounds! C: uniformly at random U: W+ R: uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method Drineas, Kannan, and Mahoney (SODA ’03, ’04) C: w.r.t. column lengths U: in linear/constant time R: w.r.t. row lengths Randomized algorithm Provable, a priori, bounds Explicit dependency on A – Ak Drineas, Mahoney, and Muthu (’05, ’06) C: depends on singular vectors of A. U: (almost) W+ R: depends on singular vectors of C (1+) approximation to A – Ak Computable in SVDk(A) time. G.W. Stewart (Num. Math. ’99, TR ’04 ) Goreinov, Tyrtyshnikov, and Zamarashkin (LAA ’97, Cont. Math. ’01) Williams and Seeger (NIPS ’01) Data applications of CUR CMD factorization (Sun, Xie, Zhang, and Faloutsos ’07, best paper award in SIAM Conference on Data Mining ‘07) A CUR-type decomposition that avoids duplicate rows/columns that might appear in some earlier versions of CUR-type decomposition. Many interesting applications to large network datasets, DBLP, etc.; extensions to tensors. Fast computation of Fourier Integral Operators (Demanet, Candes, and Ying ’06) Application in seismology imaging data (PBytes of data can be generated…) The problem boils down to solving integral equations, i.e., matrix equations after discretization. CUR-type structures appear; uniform sampling seems to work well in practice. Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions x Singular Value Decomposition (SVD) x Column-based Decompositions (CX, interpolative decomposition) x CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression Decompositions that respect the data Non-negative matrix factorization (Lee and Seung ’00) Assume that the Aij are non-negative for all i,j. The Non-negative Matrix Factorization Non-negative matrix factorization (Lee and Seung ’00) Assume that the Aij are non-negative for all i,j. Constrain and X to have only non-negative entries as well. This should respect the structure of the data better than Ak = UkSkVkT which introduces a lot of (difficult to interpret) negative entries. The Non-negative Matrix Factorization It has been extensively applied to: 1. Image mining (Lee and Seung ’00) 2. Enron email collection (Berry and Brown ’05) 3. Other text mining tasks (Berry and Plemmons ’04) Algorithms for NMF: 1. Multiplicative updage rules (Lee and Seung ’00, Hoyer ’02) 2. Gradient descent (Hoyer ’04, Berry and Plemmons ’04) 3. Alternating least squares (dating back to Paatero ’94) Algorithmic challenges for NMF Algorithmic challenges for the NMF: 1. NMF (as stated above) is convex given or X, but not if both are unknown. 2. No unique solution: many matrices and X that minimize the error. 3. Other optimization objectives could be chosen (e.g., spectral norm, etc.) 4. NMF becomes harder if sparsity constraints are included (e.g., X has a small number of non-zeros). 5. For the multiplicative update rules there exists some theory proving that they converge to a fixed point; this might be a local optimum or a saddle point. 6. Little theory is known for the other algorithms. Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions x x x x Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression SemiDiscrete Decomposition (SDD) Dk: diagonal matrix ASDD Xk Dk YkT Xk, Yk: all entries are in {-1,0,+1} SDD identifies regions of the matrix that have homogeneous density. SemiDiscrete Decomposition (SDD) SDD looks for blocks of similar height towers and similar depth holes: “bump hunting”. Applications include image compression and text mining. O’Leary and Peleg ’83, Kolda and O’Leary ’98, ’00, O’Leary and Roth ’06 The figures are from D. Skillkorn’s book on Data Mining with Matrix Decompositions. Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions x x x x x Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression Collaborative Filtering and MMMF User ratings for movies Goal: predict unrated movies (?) Collaborative Filtering and MMMF User ratings for movies Goal: predict unrated movies (?) Maximum Margin Matrix Factorization (MMMF) A novel, semi-definite programming based matrix decomposition that seems to perform very well in real data, including the Netflix challenge. Srebro, Rennie, and Jaakkola ’04, Rennie and Srebro ’05 Some pictures are from Srebro’s presentation in NIPS ’04. A linear factor model A linear factor model T User biases for different movie attributes All users T (Possible) solution to collaborative filtering: fit a rank (exactly) k matrix X to Y. Fully observed Y X is the best rank k approximation to Y. Azar, Fiat, Karlin, McSherry, and Saia ’01, Drineas, Kerenidis, and Raghavan ’02 Imputing the missing entries via SVD (Achlioptas and McSherry ’01, ’06) Reconstruction Algorithm • Compute the SVD of the matrix filling in the missing entries with zeros. • Some rescaling prior to computing the SVD is necessary, e.g., multiply by 1/(fraction of observed entries). • Keep the resulting top k principal components. Imputing the missing entries via SVD (Achlioptas and McSherry ’01, ’06) Reconstruction Algorithm • Compute the SVD of the matrix filling in the missing entries with zeros. • Some rescaling prior to computing the SVD is necessary, e.g., multiply by 1/(fraction of observed entries). • Keep the resulting top k principal components. Under assumptions on the “quality” of the observed entries, reconstruction accuracy bounds may be proven. The error bounds scale with the Frobenius norm of the matrix. A convex formulation T MMMF • Focus on §1 rankings (for simplicity). • Fit a prediction matrix X = UVT to the observations. A convex formulation T MMMF • Focus on §1 rankings (for simplicity). • Fit a prediction matrix X = UVT to the observations. Objectives (CONVEX!) • Minimize the total number of mismatches between the observed data and the predicted data. • Keep the trace norm of X small. A convex formulation T MMMF • Focus on §1 rankings (for simplicity). • Fit a prediction matrix X = UVT to the observations. Objectives (CONVEX!) • Minimize the total number of mismatches between the observed data and the predicted data. • Keep the trace norm of X small. MMMF and SDP T MMMF This may be formulated as a semi-definite program, and thus may be solved efficiently. Bounding the factor contribution T MMMF Instead of a hard rank constraint (non-convex), a softer constraint is introduced. The total number of contributing factors (number of columns/rows in U/VT) is unbounded, but their total contribution is bounded. Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions x x x x x x Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression Tensors Tensors appear both in Math and CS. • Connections to complexity theory (i.e., matrix multiplication complexity) • Data Set applications (i.e., Independent Component Analysis, higher order statistics, etc.) Also, many practical applications, e.g., Medical Imaging, Hyperspectral Imaging, video, Psychology, Chemometrics, etc. However, there does not exist a definition of tensor rank (and associated tensor SVD) with the – nice – properties found in the matrix case. Tensor rank A definition of tensor rank Given a tensor find the minimum number of rank one tensors into it can be decomposed. agrees with matrices for d=2 related to computing bilinear forms and algebraic complexity theory. BUT outer product only weak bounds are known tensor rank depends on the underlying ring of scalars computing it is NP-hard successive rank one approxiimations are no good Tensors decompositions Many tensor decompositions “matricize” the tensor 1. PARAFAC, Tucker, Higher-Order SVD, DEDICOM, etc. 2. Most are computed via iterative algorithms (e.g., alternating least squares). Given “unfold” create the “unfolded” matrix Useful links on tensor decompositions • Workshop on Algorithms for Modern Massive Data Sets (MMDS) ’06 http://www.stanford.edu/group/mmds/ Check the tutorial by Lek-Heng Lim on tensor decompositions. • Tutorial by Faloutsos, Kolda, and Sun in SIAM Data Mining Conference ’07 • Tammy Kolda’s web page http://csmr.ca.sandia.gov/~tgkolda/ Overview x • Datasets in the form of matrices (and tensors) x• Matrix Decompositions x x x x x x x Singular Value Decomposition (SVD) Column-based Decompositions (CX, interpolative decomposition) CUR-type decompositions Non-negative matrix factorization Semi-Discrete Decomposition (SDD) Maximum-Margin Matrix Factorization (MMMF) Tensor decompositions • Regression Coreset constructions Fast algorithms for least-squares regression Problem definition and motivation In many applications (e.g., statistical data analysis and scientific computation), one has n observations of the form: Model y(t) (unknown) as a linear combination of d basis functions: A is an n x d “design matrix” (n >> d): In matrix-vector notation, Least-norm approximation problems Recall a linear measurement model: In order to estimate x, solve: Application: data analysis in science • First application: Astronomy Predicting the orbit of the asteroid Ceres (in 1801!). Gauss (1809) -- see also Legendre (1805) and Adrain (1808). First application of “least squares optimization” and runs in O(nd2) time! • Data analysis: Fit parameters of a biological, chemical, economical, physical (astronomical), social, internet, etc. model to experimental data. Norms of common interest Let y = b and define the residual: Least-squares approximation: Chebyshev or mini-max approximation: Sum of absolute residuals approximation: Lp norms and their unit balls Recall the Lp norm for : Some inequality relationships include: Lp regression problems We are interested in over-constrained Lp regression problems, n >> d. Typically, there is no x such that Ax = b. Want to find the “best” x such that Ax ≈ b. Lp regression problems are convex programs (or better). There exist poly-time algorithms. We want to solve them faster. Exact solution to L2 regression Cholesky Decomposition: If A is full rank and well-conditioned, decompose ATA = RTR, where R is upper triangular, and solve the normal equations: RTRx = ATb. QR Decomposition: Slower but numerically stable, esp. if A is rank-deficient. Write A = QR, and solve Rx = QTb. Projection of b on the subspace spanned by the columns of A Singular Value Decomposition: Most expensive, but best if A is very ill-conditioned. Write A = USVT, in which case: xOPT = A+b = VS-1UTb. Complexity is O(nd2) , but constant factors differ. Pseudoinverse of A Questions … Approximation algorithms: Can we approximately solve Lp regression faster than “exact” methods? Core-sets (or induced sub-problems): Can we find a small set of constraints such that solving the Lp regression on those constraints gives an approximation to the original problem? Randomized algorithms for Lp regression Alg. 1 p=2 Sampling (core-set) (1+)-approx O(nd2) Drineas, Mahoney, and Muthu ’06, ’07 Alg. 2 p=2 Projection (no core-set) (1+)-approx O(nd logd) Sarlos ’06 Drineas, Mahoney, Muthu, and Sarlos ’07 Alg. 3 p [1,∞) Sampling (core-set) (1+)-approx O(nd5) +o(“exact”) DasGupta, Drineas, Harb, Kumar, & Mahoney ’07 Note: Clarkson ’05 gets a (1+)-approximation for L1 regression in O*(d3.5/4) time. He preprocessed [A,b] to make it “well-rounded” or “well-conditioned” and then sampled. Algorithm 1: Sampling for L2 regression Algorithm 1. Fix a set of probabilities pi, i=1…n, summing up to 1. 2. Pick the i-th row of A and the i-th element of b with probability min {1, rpi}, and rescale both by (1/min{1,rpi})1/2. 3. Solve the induced problem. Note: in expectation, at most r rows of A and r elements of b are kept. Sampling algorithm for L2 regression sampled rows of A sampled “rows” of b scaling to account for undersampling Our results for p=2 If the pi satisfy a condition, then with probability at least 1-, (A): condition number of A The sampling complexity is Notation U(i): i-th row of U : rank of A U: orthogonal matrix containing the left singular vectors of A. Condition on the probabilities The condition that the pi must satisfy is, for some (0,1] : lengths of rows of matrix of left singular vectors of A Notes: • O(nd2) time suffices (to compute probabilities and to construct a core-set). • Important question: Is O(nd2) necessary? Can we compute the pi’s, or construct a core-set, faster? The Johnson-Lindenstrauss lemma Results for J-L: • Johnson & Lindenstrauss ’84: project to a random subspace • Frankl & Maehara ’88: random orthogonal matrix • DasGupta & Gupta ’99: matrix with entries from N(0,1), normalized • Indyk & Motwani ’98: matrix with entries from N(0,1) • Achlioptas ’03: matrix with entries in {-1,0,+1} • Alon ’03: optimal dependency on n, and almost optimal dependency on Fast J-L transform (1 of 2) (Ailon & Chazelle ’06) Fast J-L transform (2 of 2) (Ailon & Chazelle ’06) Multiplication of the vectors by PHD is “fast”, since: • (Du) is O(d) - since D is diagonal; • (HDu) is O(d logd) – use Fast Fourier Transform algorithms; • (PHDu) is O(poly (logn)) - P has on average O(poly(logn)) non-zeros per row. O(nd logd) L2 regression Fact 1: since Hn (the n-by-n Hadamard matrix) and Dn (an n-by-n diagonal with § 1 in the diagonal, chosen uniformly at random) are orthogonal matrices, Thus, we can work with HnDnAx – HnDnb. Let’s use our sampling approach… O(nd logd) L2 regression Fact 1: since Hn (the n-by-n Hadamard matrix) and Dn (an n-by-n diagonal with § 1 in the diagonal, chosen uniformly at random) are orthogonal matrices, Thus, we can work with HnDnAx – HnDnb. Let’s use our sampling approach… Fact 2: Using a Chernoff-type argument, we can prove that the lengths of all the rows of the left singular vectors of HnDnA are, with probability at least .9, O(nd logd) L2 regression DONE! We can perform uniform sampling in order to keep r = O(d logd/2) rows of HnDnA; our L2 regression theorem guarantees the accuracy of the approximation. Running time is O(nd logd), since we can use the fast Hadamard-Walsh transform to multiply Hn and DnA. Open problem: sparse approximations Sparse approximations and l2 regression (Natarajan ’95, Tropp ’04, ’06) In the sparse approximation problem, we are given a d-by-n matrix A forming a redundant dictionary for Rd and a target vector b 2 Rd and we seek to solve subject to In words, we seek a sparse, bounded error representation of b in terms of the vectors in the dictionary. Open problem: sparse approximations Sparse approximations and l2 regression (Natarajan ’95, Tropp ’04, ’06) In the sparse approximation problem, we are given a d-by-n matrix A forming a redundant dictionary for Rd and a target vector b 2 Rd and we seek to solve subject to In words, we seek a sparse, bounded error representation of b in terms of the vectors in the dictionary. This is (sort of) under-constrained least squares regression. Can we use the aforementioned ideas to get better and/or faster approximation algorithms for the sparse approximation problem? Application: feature selection for RLSC Regularized Least Squares Regression (RLSC) Given a term-document matrix A and a class label for each document, find xopt to minimize Here c is the vector of labels. For simplicity assume two classes, thus ci = § 1. Application: feature selection for RLSC Regularized Least Squares Regression (RLSC) Given a term-document matrix A and a class label for each document, find xopt to minimize Here c is the vector of labels. For simplicity assume two classes, thus ci = § 1. Given a new document-vector q, its classification is determined by the sign of Feature selection for RLSC Feature selection for RLSC Is it possible to select a small number of actual features (terms) and apply RLSC only on the selected terms without a huge loss in accuracy? Well studied problem; supervised (they employ the class label vector c) algorithms exist. We applied our L2 regression sampling scheme to select terms; unsupervised! A smaller RLSC problem A smaller RLSC problem TechTC data from ODP (Gabrilovich and Markovitch ’04) TechTC data 100 term-document matrices; average size ¼ 20,000 terms and ¼ 150 documents. In prior work, feature selection was performed using a supervised metric called information gain (IG), an entropic measure of correlation with class labels. Conclusion of the experiments Our unsupervised technique had (on average) comparable performance to IG. Conclusions Linear Algebraic techniques (e.g., matrix decompositions and regression) are fundamental in data mining and information retrieval. Randomized algorithms for linear algebra computations contribute novel results and ideas, both from a theoretical as well as an applied perspective. Conclusions and future directions Linear Algebraic techniques (e.g., matrix decompositions and regression) are fundamental in data mining and information retrieval. Randomized algorithms for linear algebra computations contribute novel results and ideas, both from a theoretical as well as an applied perspective. Important directions • Faster algorithms • More accurate algorithms • Matching lower bounds • Implementations and widely disseminated software • Technology transfer to other scientific disciplines