Information Retrieval & Data Mining: A Linear Algebraic Perspective Petros Drineas drineas

advertisement
Information Retrieval & Data Mining:
A Linear Algebraic Perspective
Petros Drineas
Rensselaer Polytechnic Institute
Computer Science Department
To access my web page:
drineas
Modern data
Facts
Computers make it easy to collect and store data.
Costs of storage are very low and are dropping very fast.
(most laptops have a storage capacity of more than 100 GB …)
When it comes to storing data
The current policy typically is “store everything in case it is needed later”
instead of deciding what could be deleted.
Data mining
Facts
Computers make it easy to collect and store data.
Costs of storage are very low and are dropping very fast.
(most laptops have a storage capacity of more than 100 GB …)
When it comes to storing data
The current policy typically is “store everything in case it is needed later”
instead of deciding what could be deleted.
Data Mining
Extract useful information from the massive amount of available data.
About the tutorial
Tools
Introduce matrix algorithms and matrix decompositions for data mining and
information retrieval applications.
Goal
Learn a model for the underlying “physical” system generating the dataset.
About the tutorial
Tools
Introduce matrix algorithms and matrix decompositions for data mining and
information retrieval applications.
Goal
Learn a model for the underlying “physical” system generating the dataset.
data
algorithms
mathematics
Math is necessary to design and analyze
principled algorithmic techniques to datamine the massive datasets that have
become ubiquitous in scientific research.
Why linear (or multilinear) algebra?
Data are represented by matrices
Numerous modern datasets are in matrix form.
Data are represented by tensors
Data in the form of tensors (multi-mode arrays) are becoming very common
in the data mining and information retrieval literature in the last few years.
Why linear (or multilinear) algebra?
Data are represented by matrices
Numerous modern datasets are in matrix form.
Data are represented by tensors
Data in the form of tensors (multi-mode arrays) are becoming very common
in the data mining and information retrieval literature in the last few years.
Linear algebra (and numerical analysis) provide the fundamental mathematical
and algorithmic tools to deal with matrix and tensor computations.
(This tutorial will focus on matrices; pointers to some tensor decompositions will be provided.)
Why matrix decompositions?
Matrix decompositions
(e.g., SVD, QR, SDD, CX and CUR, NMF, MMMF, etc.)
• They use the relationships between the available data in order to identify
components of the underlying physical system generating the data.
• Some assumptions on the relationships between the underlying components
are necessary.
• Very active area of research; some matrix decompositions are more than
one century old, whereas others are very recent.
Overview
• Datasets in the form of matrices (and tensors)
• Matrix Decompositions
Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
Datasets in the form of matrices
We are given m objects and n features describing the objects.
(Each object has n numeric values describing it.)
Dataset
An m-by-n matrix A, Aij shows the “importance” of feature j for object i.
Every row of A represents an object.
Goal
We seek to understand the structure of the data, e.g., the underlying
process generating the data.
Market basket matrices
Common representation for association rule mining.
n products
(e.g., milk, bread, wine, etc.)
Data mining tasks
- Find association rules
m
customers
E.g., customers who buy product x buy
product y with probility 89%.
Aij = quantity of j-th
product purchased by
the i-th customer
- Such rules are used to
make item display decisions,
advertising decisions, etc.
Social networks (e-mail graph)
Represents the email communications between groups of users.
n users
Data mining tasks
- cluster the users
n users
Aij = number of emails
exchanged between
users i and j during a
certain time period
- identify “dense” networks
of users (dense subgraphs)
Document-term matrices
A collection of documents is represented by an m-by-n matrix (bag-of-words model).
n terms (words)
Data mining tasks
- Cluster or classify documents
m
documents
- Find “nearest neighbors”
Aij = frequency of j-th
term in i-th document
- Feature selection: find a subset
of terms that (accurately) clusters
or classifies documents.
Document-term matrices
A collection of documents is represented by an m-by-n matrix (bag-of-words model).
n terms (words)
Data mining tasks
- Cluster or classify documents
m
documents
- Find “nearest neighbors”
Aij = frequency of j-th
term in i-th document
- Feature selection: find a subset
of terms that (accurately) clusters
or classifies documents.
Recommendation systems
The m-by-n matrix A represents m customers and n products.
products
Data mining task
customers
Aij = utility of j-th
product to i-th customer
Given a few samples from A,
recommend high utility
products to customers.
Biology: microarray data
tumour specimens
Microarray Data
Rows: genes (¼ 5,500)
Columns: 46 soft-issue tumour specimens
genes
(different types of cancer, e.g., LIPO, LEIO, GIST,
MFH, etc.)
Tasks:
Pick a subset of genes (if it exists) that suffices in
order to identify the “cancer type” of a patient
Nielsen et al., Lancet, 2002
Biology: microarray data
tumour specimens
Microarray Data
Rows: genes (¼ 5,500)
Columns: 46 soft-issue tumour specimens
genes
(different types of cancer, e.g., LIPO, LEIO, GIST,
MFH, etc.)
Tasks:
Pick a subset of genes (if it exists) that suffices in
order to identify the “cancer type” of a patient
Nielsen et al., Lancet, 2002
Human genetics
Single Nucleotide Polymorphisms: the most common type of genetic variation in the
genome across different individuals.
They are known locations at the human genome where two alternate nucleotide bases
(alleles) are observed (out of A, C, G, T).
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
Matrices including hundreds of individuals and more than 300,000 SNPs are publicly available.
Task :split the individuals in different clusters depending on their ancestry, and
find a small subset of genetic markers that are “ancestry informative”.
Human genetics
Single Nucleotide Polymorphisms: the most common type of genetic variation in the
genome across different individuals.
They are known locations at the human genome where two alternate nucleotide bases
(alleles) are observed (out of A, C, G, T).
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
Matrices including hundreds of individuals and more than 300,000 SNPs are publicly available.
Task :split the individuals in different clusters depending on their ancestry, and
find a small subset of genetic markers that are “ancestry informative”.
Tensors: recommendation systems
Economics:
• Utility is ordinal and not cardinal concept.
• Compare products; don’t assign utility
values.
m
customers
Recommendation Model Revisited:
• Every customer has an n-by-n matrix
(whose entries are +1,-1) and represent pairwise product comparisons.
• There are m such matrices, forming an
n-by-n-by-m 3-mode tensor A.
n
products
n
products
Tensors: hyperspectral images
128 frequencies
Spectrally resolved images
may be viewed as a tensor.
ca. 500
pixels
ca. 500
pixels
Task:
Identify and analyze regions of significance in the images.
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
The Singular Value Decomposition (SVD)
Matrix rows: points (vectors) in a Euclidean space,
e.g., given 2 objects (x & d), each described with
respect to two features, we get a 2-by-2 matrix.
feature 2
Recall: data matrices have m rows (one for each
object) and n columns (one for each feature).
Object d
(d,x)
Two objects are “close” if the angle between their
corresponding vectors is small.
Object x
feature 1
SVD, intuition
Let the blue circles represent m
data points in a 2-D Euclidean space.
5
2nd (right)
singular vector
Then, the SVD of the m-by-2 matrix
of the data will return …
4
1st (right) singular vector:
direction of maximal variance,
3
2nd (right) singular vector:
1st (right)
singular vector
2
4.0
4.5
5.0
5.5
6.0
direction of maximal variance, after
removing the projection of the data
along the first singular vector.
Singular values
5
2
2nd (right)
singular vector
1: measures how much of the data variance
is explained by the first singular vector.
4
2: measures how much of the data variance
is explained by the second singular vector.
3
1
1st (right)
singular vector
2
4.0
4.5
5.0
5.5
6.0
SVD: formal definition
0
0
: rank of A
U (V): orthogonal matrix containing the left (right) singular vectors of A.
S: diagonal matrix containing the singular values of A.
Let 1 ¸ 2 ¸ … ¸  be the entries of S.
Exact computation of the SVD takes O(min{mn2 , m2n}) time.
The top k left/right singular vectors/values can be computed faster using
Lanczos/Arnoldi methods.
Rank-k approximations via the SVD
A
=
U
S
VT
features
objects
noise
=
significant
sig.
significant
noise
noise
Rank-k approximations (Ak)
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
S k: diagonal matrix containing the top k
singular values of A.
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix.
SVD is the algorithmic tool behind MultiDimensional
Scaling (MDS) and Factor Analysis.
SVD is “the Rolls-Royce and the Swiss Army Knife of
Numerical Linear Algebra.”*
*Dianne O’Leary, MMDS ’06
feature 2
PCA and SVD
Object d
(d,x)
Object x
feature 1
Ak as an optimization problem
Frobenius norm:
Given , it is easy to find X from standard least squares.
However, the fact that we can find the optimal  is intriguing!
Ak as an optimization problem
Frobenius norm:
Given , it is easy to find X from standard least squares.
However, the fact that we can find the optimal  is intriguing!
Optimal  = Uk, optimal X = UkTA.
LSI: Ak for document-term matrices
(Berry, Dumais, and O'Brien ’92)
Latent Semantic Indexing (LSI)
Replace A by Ak; apply clustering/classification algorithms on Ak.
Pros
n terms (words)
- Less storage for small k.
O(km+kn) vs. O(mn)
m
documents
- Improved performance.
Documents are represented in a “concept” space.
Aij = frequency of j-th
term in i-th document
LSI: Ak for document-term matrices
(Berry, Dumais, and O'Brien ’92)
Latent Semantic Indexing (LSI)
Replace A by Ak; apply clustering/classification algorithms on Ak.
Pros
n terms (words)
- Less storage for small k.
O(km+kn) vs. O(mn)
m
documents
- Improved performance.
Documents are represented in a “concept” space.
Aij = frequency of j-th
term in i-th document
Cons
- Ak destroys sparsity.
- Interpretation is difficult.
- Choosing a good k is tough.
Ak and k-means clustering
(Drineas, Frieze, Kannan, Vempala, and Vinay ’99)
k-means clustering
A standard objective function that measures cluster quality.
(Often denotes an iterative algorithm that attempts to optimize the k-means objective function.)
k-means objective
Input: set of m points in Rn, positive integer k
Output: a partition of the m points to k clusters
Partition the m points to k clusters in order to minimize the sum of the squared
Euclidean distances from each point to its cluster centroid.
k-means, cont’d
We seek to split the input points in 5 clusters.
k-means, cont’d
We seek to split the input points in 5 clusters.
The cluster centroid is the “average” of all the
points in the cluster.
k-means: a matrix formulation
Let A be the m-by-n matrix representing m points in Rn. Then, we seek to
X is a special “cluster membership” matrix: Xij denotes if the i-th point
belongs to the j-th cluster.
k-means: a matrix formulation
Let A be the m-by-n matrix representing m points in Rn. Then, we seek to
X is a special “cluster membership” matrix: Xij denotes if the i-th point
belongs to the j-th cluster.
points
clusters
• Columns of X are normalized to have unit length.
(We divide each column by the square root of the
number of points in the cluster.)
• Every row of X has at most one non-zero element.
(Each element belongs to at most one cluster.)
• X is an orthogonal matrix, i.e., XTX = I.
SVD and k-means
If we only require that X is an orthogonal matrix and remove the condition on the
number of non-zero entries per row of X, then
is easy to minimize! The solution is X = Uk.
SVD and k-means
If we only require that X is an orthogonal matrix and remove the condition on the
number of non-zero entries per row of X, then
is easy to minimize! The solution is X = Uk.
Using SVD to solve k-means
• We can get a 2-approximation algorithm for k-means.
(Drineas, Frieze, Kannan, Vempala, and Vinay ’99, ’04)
• We can get heuristic schemes to assign points to clusters.
(Zha, He, Ding, Simon, and Gu ’01)
• There exist PTAS (based on random projections) for the k-means problem.
(Ostrovsky and Rabani ’00, ’02)
• Deeper connections between SVD and clustering in Kannan, Vempala, and Vetta ’00, ’04.
Ak and Kleinberg’s HITS algorithm
(Kleinberg ’98, ’99)
Hypertext Induced Topic Selection (HITS)
A link analysis algorithm that rates Web pages for their authority and hub scores.
Authority score: an estimate of the value of the content of the page.
Hub score: an estimate of the value of the links from this page to other pages.
These values can be used to rank Web search results.
Ak and Kleinberg’s HITS algorithm
Hypertext Induced Topic Selection (HITS)
A link analysis algorithm that rates Web pages for their authority and hub scores.
Authority score: an estimate of the value of the content of the page.
Hub score: an estimate of the value of the links from this page to other pages.
These values can be used to rank Web search results.
Authority: a page that is pointed to by many pages with high hub scores.
Hub: a page pointing to many pages that are good authorities.
Recursive definition; notice that each node has two scores.
Ak and Kleinberg’s HITS algorithm
Phase 1:
Given a query term (e.g., “jaguar”), find all pages containing the query term (root set).
Expand the resulting graph by one move forward and backward (base set).
Ak and Kleinberg’s HITS algorithm
Phase 2:
Let A be the adjacency matrix of the (directed) graph of the base set.
Let h , a 2 Rn be the vectors of hub (authority) scores.
Then,
h = Aa and a = ATh
h = AATh and a = ATAa.
Ak and Kleinberg’s HITS algorithm
Phase 2:
Let A be the adjacency matrix of the (directed) graph of the base set.
Let h , a 2 Rn be the vectors of hub (authority) scores.
Then,
h = Aa and a = ATh
h = AATh and a = ATAa.
Thus, the top left (right) singular vector of A corresponds to
hub (authority) scores.
Ak and Kleinberg’s HITS algorithm
Phase 2:
Let A be the adjacency matrix of the (directed) graph of the base set.
Let h , a 2 Rn be the vectors of hub (authority) scores.
Then,
h = Aa and a = ATh
h = AATh and a = ATAa.
Thus, the top left (right) singular vector of A corresponds to
hub (authority) scores.
What about the rest?
They provide a natural way to extract additional densely
linked collections of hubs and authorities from the base set.
See the “jaguar” example in Kleinberg ’99.
SVD example: microarray data
genes
Microarray Data
(Nielsen et al., Lancet, 2002)
Columns: genes (¼ 5,500)
Rows: 32 patients, three different cancer types (GIST, LEIO, SynSarc)
SVD example: microarray data
Microarray Data
Applying k-means with k=3 in this 3D
space results to 3 misclassifications.
Applying k-means with k=3 but retaining
4 PCs results to one misclassification.
Can we find actual genes (as opposed to
eigengenes) that achieve similar results?
SVD example: ancestry-informative SNPs
Single Nucleotide Polymorphisms: the most common type of genetic variation in the
genome across different individuals.
They are known locations at the human genome where two alternate nucleotide bases
(alleles) are observed (out of A, C, G, T).
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
There are ¼ 10 million SNPs in the human genome, so this table could have ~10 million columns.
Two copies of a chromosome
(father, mother)
Focus at a specific locus and assay
the observed nucleotide bases
(alleles).
SNP: exactly two alternate alleles
appear.
Focus at a specific locus and
assay the observed alleles.
C
T
SNP: exactly two alternate
alleles appear.
Two copies of a chromosome
(father, mother)
An individual could be:
- Heterozygotic (in our study, CT = TC)
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
Focus at a specific locus and
assay the observed alleles.
C
C
SNP: exactly two alternate
alleles appear.
Two copies of a chromosome
(father, mother)
An individual could be:
- Heterozygotic (in our study, CT = TC)
- Homozygotic at the first allele, e.g., C
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
Focus at a specific locus and
assay the observed alleles.
T
T
SNP: exactly two alternate
alleles appear.
Two copies of a chromosome
(father, mother)
An individual could be:
- Heterozygotic (in our study, CT = TC)
 Encode as 0
- Homozygotic at the first allele, e.g., C
 Encode as +1
- Homozygotic at the second allele, e.g., T  Encode as -1
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
(a) Why are SNPs really important?
Association studies:
Locating causative genes for common complex disorders (e.g., diabetes, heart
disease, etc.) is based on identifying association between affection status and
known SNPs.
No prior knowledge about the function of the gene(s) or the etiology of the
disorder is necessary.
The subsequent investigation of candidate genes that are in physical proximity
with the associated SNPs is the first step towards understanding the etiological
“pathway” of a disorder and designing a drug.
(b) Why are SNPs really important?
Among different populations (eg., European, Asian, African, etc.), different
patterns of SNP allele frequencies or SNP correlations are often observed.
Understanding such differences is crucial in order to develop the “next
generation” of drugs that will be “population specific” (eventually “genome
specific”) and not just “disease specific”.
The HapMap project
• Mapping the whole genome sequence of a single individual is very expensive.
• Mapping all the SNPs is also quite expensive, but the costs are dropping fast.
HapMap project (~$130,000,000 funding from NIH and other sources):
Map approx. 4 million SNPs for 270 individuals from 4 different populations (YRI,
CEU, CHB, JPT), in order to create a “genetic map” to be used by researchers.
Also, funding from pharmaceutical companies, NSF, the Department of Justice*,
etc.
*Is
it possible to identify the ethnicity of a suspect from his DNA?
CHB and JPT
Let A be the 90£2.7 million matrix of the CHB and JPT population in HapMap.
• Run SVD (PCA) on A, keep the two (left) singular vectors, and plot the results.
• Run a (naïve, e.g., k-means) clustering algorithm to split the data points in two
clusters.
Paschou, Ziv, Burchard, Mahoney, and Drineas, to appear in PLOS Genetics ’07
(data from E. Ziv and E. Burchard, UCSF)
Paschou, Mahoney, Javed, Kidd, Pakstis, Gu, Kidd, and Drineas, Genome Research ’07
(data from K. Kidd, Yale University)
EigenSNPs can not be assayed…
Not altogether satisfactory: the (top two left) singular vectors are linear
combinations of all SNPs, and – of course – can not be assayed!
Can we find actual SNPs that capture the information in the (top two left)
singular vectors?
(E.g., spanning the same subspace …)
Will get back to this later …
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
x Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
CX decomposition
C
C
Constrain  to contain exactly k columns of A.
Notation: replace  by C(olumns).
Easy to prove that optimal X = C+A. (C+ is the Moore-Penrose pseudoinverse of C.)
Also called “interpolative approximation”.
(some extra conditions on the elements of X are required…)
CX decomposition
C
C
Why?
If A is an object-feature matrix, then selecting “representative” columns is
equivalent to selecting “representative” features.
This leads to easier interpretability; compare to eigenfeatures, which are
linear combinations of all features.
Column Subset Selection problem (CSS)
Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C
that minimizes the above error over all O(nk) choices for C.
Column Subset Selection problem (CSS)
Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C
that minimizes the above error over all O(nk) choices for C.
C+: pseudoinverse of C, easily computed via the SVD of C.
(If C = U S VT, then C+ = V S-1 UT.)
PC = CC+ is the projector matrix on the subspace spanned by the columns of C.
Column Subset Selection problem (CSS)
Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C
that minimizes the above error over all O(nk) choices for C.
PC = CC+ is the projector matrix on the subspace spanned by the columns of C.
Complexity of the problem? O(nkmn) trivially works; NP-hard if k grows as a
function of n. (NP-hardness in Civril & Magdon-Ismail ’07)
Spectral norm
Given an m-by-n matrix A, find k columns of A forming an m-by-k matrix C
such that
is minimized over all O(nk) possible choices for C.
Remarks:
1.
PCA is the projection of A on the subspace spanned by the columns of C.
2.
The spectral or 2-norm of an m-by-n matrix X is
A lower bound for the CSS problem
For any m-by-k matrix C consisting of at most k columns of A
Ak
Remarks:
1.
This is also true if we replace the spectral norm by the Frobenius norm.
2.
This is a – potentially – weak lower bound.
Prior work: numerical linear algebra
Numerical Linear Algebra algorithms for CSS
1.
Deterministic, typically greedy approaches.
2.
Deep connection with the Rank Revealing QR factorization.
3.
Strongest results so far (spectral norm): in O(mn2) time
some function p(k,n)
Prior work: numerical linear algebra
Numerical Linear Algebra algorithms for CSS
1.
Deterministic, typically greedy approaches.
2.
Deep connection with the Rank Revealing QR factorization.
3.
Strongest results so far (Frobenius norm): in O(nk) time
Working on p(k,n): 1965 – today
Prior work: theoretical computer science
Theoretical Computer Science algorithms for CSS
1.
Randomized approaches, with some failure probability.
2.
More than k rows are picked, e.g., O(poly(k)) rows.
3.
Very strong bounds for the Frobenius norm in low polynomial time.
4.
Not many spectral norm bounds…
The strongest Frobenius norm bound
Given an m-by-n matrix A, there exists an O(mn2) algorithm that picks
at most O( k log k / 2 ) columns of A
such that with probability at least 1-10-20
The CX algorithm
Input:
m-by-n matrix A,
0 <  < 1, the desired accuracy
Output:
C, the matrix consisting of the selected columns
CX algorithm
• Compute probabilities pj summing to 1
• Let c = O(k log k / 2)
• For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cpj}
• Let C be the matrix consisting of the chosen columns
(C has – in expectation – at most c columns)
Subspace sampling (Frobenius norm)
Vk: orthogonal matrix containing the top
k right singular vectors of A.
S k: diagonal matrix containing the top k
singular values of A.
Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i) are not.
Subspace sampling (Frobenius norm)
Vk: orthogonal matrix containing the top
k right singular vectors of A.
S k: diagonal matrix containing the top k
singular values of A.
Remark: The rows of VkT are orthonormal vectors, but its columns (VkT)(i) are not.
Subspace sampling in O(mn2) time
Normalization s.t. the
pj sum up to 1
Prior work in TCS
Drineas, Mahoney, and Muthukrishnan 2005
•
O(mn2) time, O(k2/2) columns
Drineas, Mahoney, and Muthukrishnan 2006
•
O(mn2) time, O(k log k/2) columns
Deshpande and Vempala 2006
•
O(mnk2) time and O(k2 log k/2) columns
•
They also prove the existence of k columns of A forming a matrix C, such that
•
Compare to prior best existence result:
Open problems
Design:
• Faster algorithms (next slide)
• Algorithms that achieve better approximation guarantees (a hybrid approach)
Prior work spanning NLA and TCS
Woolfe, Liberty, Rohklin, and Tygert 2007
(also Martinsson, Rohklin, and Tygert 2006)
•
O(mn logk) time, k columns, same spectral norm bounds as prior work
•
Beautiful application of the Fast Johnson-Lindenstrauss transform of Ailon-Chazelle
A hybrid approach
(Boutsidis, Mahoney, and Drineas ’07)
Given an m-by-n matrix A (assume m ¸ n for simplicity):
•
(Randomized phase) Run a randomized algorithm to pick c = O(k logk) columns.
•
(Deterministic phase) Run a deterministic algorithm on the above columns* to pick
exactly k columns of A and form an m-by-k matrix C.
* Not so simple …
A hybrid approach
(Boutsidis, Mahoney, and Drineas ’07)
Given an m-by-n matrix A (assume m ¸ n for simplicity):
•
(Randomized phase) Run a randomized algorithm to pick c = O(k logk) columns.
•
(Deterministic phase) Run a deterministic algorithm on the above columns* to pick
exactly k columns of A and form an m-by-k matrix C.
* Not so simple …
Our algorithm runs in O(mn2) and satisfies, with probability at least 1-10-20,
Comparison: Frobenius norm
Our algorithm runs in O(mn2) and satisfies, with probability at least 1-10-20,
1.
We provide an efficient algorithmic result.
2.
We guarantee a Frobenius norm bound that is at most (k logk)1/2 worse than the
best known existential result.
Comparison: spectral norm
Our algorithm runs in O(mn2) and satisfies, with probability at least 1-10-20,
1.
Our running time is comparable with NLA algorithms for this problem.
2.
Our spectral norm bound grows as a function of (n-k)1/4 instead of (n-k)1/2!
3.
Do notice that with respect to k our bound is k1/4log1/2k worse than previous work.
4.
To the best of our knowledge, our result is the first asymptotic improvement of
the work of Gu & Eisenstat 1996.
Randomized phase: O(k log k) columns
Randomized phase: c = O(k logk)
• Compute probabilities pj summing to 1
• For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cpj}
• Let C be the matrix consisting of the chosen columns
(C has – in expectation – at most c columns)
Subspace sampling
Vk: orthogonal matrix containing the top
k right singular vectors of A.
S k: diagonal matrix containing the top k
singular values of A.
Remark: We need more elaborate subspace sampling probabilities than previous work.
Subspace sampling in O(mn2) time
Normalization s.t. the
pj sum up to 1
Deterministic phase: k columns
Deterministic phase
• Let S1 be the set of indices of the columns selected by the randomized phase.
• Let (VkT)S1 denote the set of columns of VkT with indices in S1,
(An extra technicality is that the columns of (VkT)S1 must be rescaled …)
• Run a deterministic NLA algorithm on (VkT)S1 to select exactly k columns.
(Any algorithm with p(k,n) = k1/2(n-k)1/2 will do.)
• Let S2 be the set of indices of the selected columns (the cardinality of S2 is exactly k).
• Return AS2 (the columns of A corresponding to indices in S2) as the final output.
Back to SNPs: CHB and JPT
Let A be the 90£2.7 million matrix of the CHB and JPT population in HapMap.
Can we find actual SNPs that
capture the information in the
top two left singular vectors?
Results
Number of SNPs
Misclassifications
40 (c = 400)
6
50 (c = 500)
5
60 (c = 600)
3
70 (c = 700)
1
•
Essentially as good as the best existing metric (informativeness).
•
However, our metric is unsupervised!
(Informativeness is supervised: it essentially identifies SNPs that are correlated with
population membership, given such membership information).
•
The fact that we can select ancestry informative SNPs in an unsupervised manner based on
PCA is novel, and seems interesting.
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
x Singular Value Decomposition (SVD)
x Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
CUR-type decompositions
Carefully
chosen U
Goal: make (some norm) of A-CUR small.
O(1) columns
O(1) rows
For any matrix A, we can find C, U and R such that the
norm of A – CUR is almost equal to the norm of A-Ak.
This might lead to a better understanding of the data.
Theorem: relative error CUR
(Drineas, Mahoney, & Muthukrishnan ’06, ’07)
For any k, O(mn2) time suffices to construct C, U, and R s.t.
holds with probability at least 1-, by picking
O( k log k log(1/) / 2 ) columns, and
O( k log2k log(1/) / 6 ) rows.
From SVD to CUR
Exploit structural properties of CUR to analyze data:
n features
m objects
A CUR-type decomposition needs
O(min{mn2, m2n}) time.
Instead of reifying the Principal Components:
• Use PCA (a.k.a. SVD) to find how many Principal Components are needed to “explain”
the data.
• Run CUR and pick columns/rows instead of eigen-columns and eigen-rows!
• Assign meaning to actual columns/rows of the matrix! Much more intuitive! Sparse!
CUR decompositions: a summary
C: variant of the QR algorithm
R: variant of the QR algorithm
U: minimizes ||A-CUR||F
No a priori bounds
Solid experimental performance
C: columns that span max volume
U: W+
R: rows that span max volume
Existential result
Error bounds depend on ||W+||2
Spectral norm bounds!
C: uniformly at random
U: W+
R: uniformly at random
Experimental evaluation
A is assumed PSD
Connections to Nystrom method
Drineas, Kannan, and Mahoney
(SODA ’03, ’04)
C: w.r.t. column lengths
U: in linear/constant time
R: w.r.t. row lengths
Randomized algorithm
Provable, a priori, bounds
Explicit dependency on A – Ak
Drineas, Mahoney, and Muthu
(’05, ’06)
C: depends on singular vectors of A.
U: (almost) W+
R: depends on singular vectors of C
(1+) approximation to A – Ak
Computable in SVDk(A) time.
G.W. Stewart
(Num. Math. ’99, TR ’04 )
Goreinov, Tyrtyshnikov, and
Zamarashkin
(LAA ’97, Cont. Math. ’01)
Williams and Seeger
(NIPS ’01)
Data applications of CUR
CMD factorization
(Sun, Xie, Zhang, and Faloutsos ’07, best paper award in SIAM Conference on Data Mining ‘07)
A CUR-type decomposition that avoids duplicate rows/columns that might appear in some
earlier versions of CUR-type decomposition.
Many interesting applications to large network datasets, DBLP, etc.; extensions to tensors.
Fast computation of Fourier Integral Operators
(Demanet, Candes, and Ying ’06)
Application in seismology imaging data (PBytes of data can be generated…)
The problem boils down to solving integral equations, i.e., matrix equations after discretization.
CUR-type structures appear; uniform sampling seems to work well in practice.
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
x Singular Value Decomposition (SVD)
x Column-based Decompositions (CX, interpolative decomposition)
x CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
Decompositions that respect the data
Non-negative matrix factorization
(Lee and Seung ’00)
Assume that the Aij are non-negative for all i,j.
The Non-negative Matrix Factorization
Non-negative matrix factorization
(Lee and Seung ’00)
Assume that the Aij are non-negative for all i,j.
Constrain  and X to have only non-negative entries as well.
This should respect the structure of the data better than Ak = UkSkVkT which
introduces a lot of (difficult to interpret) negative entries.
The Non-negative Matrix Factorization
It has been extensively applied to:
1.
Image mining (Lee and Seung ’00)
2.
Enron email collection (Berry and Brown ’05)
3.
Other text mining tasks (Berry and Plemmons ’04)
Algorithms for NMF:
1.
Multiplicative updage rules (Lee and Seung ’00, Hoyer ’02)
2.
Gradient descent (Hoyer ’04, Berry and Plemmons ’04)
3.
Alternating least squares (dating back to Paatero ’94)
Algorithmic challenges for NMF
Algorithmic challenges for the NMF:
1.
NMF (as stated above) is convex given  or X, but not if both are unknown.
2.
No unique solution: many matrices  and X that minimize the error.
3.
Other optimization objectives could be chosen (e.g., spectral norm, etc.)
4.
NMF becomes harder if sparsity constraints are included (e.g., X has a small
number of non-zeros).
5.
For the multiplicative update rules there exists some theory proving that they
converge to a fixed point; this might be a local optimum or a saddle point.
6.
Little theory is known for the other algorithms.
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
x
x
x
x
Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
SemiDiscrete Decomposition (SDD)
Dk: diagonal matrix
ASDD
Xk
Dk
YkT
Xk, Yk: all entries are in {-1,0,+1}
SDD identifies regions of the matrix
that have homogeneous density.
SemiDiscrete Decomposition (SDD)
SDD looks for blocks of similar height towers
and similar depth holes: “bump hunting”.
Applications include image compression and
text mining.
O’Leary and Peleg ’83, Kolda and O’Leary ’98, ’00, O’Leary and Roth ’06
The figures are from D. Skillkorn’s book on Data Mining with Matrix Decompositions.
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
x
x
x
x
x
Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
Collaborative Filtering and MMMF
User ratings for movies
Goal: predict unrated movies (?)
Collaborative Filtering and MMMF
User ratings for movies
Goal: predict unrated movies (?)
Maximum Margin Matrix Factorization (MMMF)
A novel, semi-definite programming based matrix
decomposition that seems to perform very well in
real data, including the Netflix challenge.
Srebro, Rennie, and Jaakkola ’04, Rennie and Srebro ’05
Some pictures are from Srebro’s presentation in NIPS ’04.
A linear factor model
A linear factor model
T
User biases for
different movie
attributes
All users
T
(Possible) solution to collaborative filtering: fit a rank (exactly) k matrix X to Y.
Fully observed Y  X is the best rank k approximation to Y.
Azar, Fiat, Karlin, McSherry, and Saia ’01, Drineas, Kerenidis, and Raghavan ’02
Imputing the missing entries via SVD
(Achlioptas and McSherry ’01, ’06)
Reconstruction Algorithm
• Compute the SVD of the matrix filling in the missing
entries with zeros.
• Some rescaling prior to computing the SVD is necessary,
e.g., multiply by 1/(fraction of observed entries).
• Keep the resulting top k principal components.
Imputing the missing entries via SVD
(Achlioptas and McSherry ’01, ’06)
Reconstruction Algorithm
• Compute the SVD of the matrix filling in the missing
entries with zeros.
• Some rescaling prior to computing the SVD is necessary,
e.g., multiply by 1/(fraction of observed entries).
• Keep the resulting top k principal components.
Under assumptions on the “quality” of the observed entries,
reconstruction accuracy bounds may be proven.
The error bounds scale with the Frobenius norm of the matrix.
A convex formulation
T
MMMF
• Focus on §1 rankings (for simplicity).
• Fit a prediction matrix X = UVT to the observations.
A convex formulation
T
MMMF
• Focus on §1 rankings (for simplicity).
• Fit a prediction matrix X = UVT to the observations.
Objectives (CONVEX!)
• Minimize the total number of mismatches between the
observed data and the predicted data.
• Keep the trace norm of X small.
A convex formulation
T
MMMF
• Focus on §1 rankings (for simplicity).
• Fit a prediction matrix X = UVT to the observations.
Objectives (CONVEX!)
• Minimize the total number of mismatches between the
observed data and the predicted data.
• Keep the trace norm of X small.
MMMF and SDP
T
MMMF
This may be formulated as a semi-definite program, and
thus may be solved efficiently.
Bounding the factor contribution
T
MMMF
Instead of a hard rank constraint (non-convex), a softer
constraint is introduced.
The total number of contributing factors (number of
columns/rows in U/VT) is unbounded, but their total
contribution is bounded.
Overview
x • Datasets in the form of matrices (and tensors)
• Matrix Decompositions
x
x
x
x
x
x
Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
Tensors
Tensors appear both in Math and CS.
• Connections to complexity theory (i.e., matrix multiplication complexity)
• Data Set applications (i.e., Independent Component Analysis, higher order
statistics, etc.)
Also, many practical applications, e.g., Medical Imaging, Hyperspectral
Imaging, video, Psychology, Chemometrics, etc.
However, there does not exist a definition of tensor rank (and associated
tensor SVD) with the – nice – properties found in the matrix case.
Tensor rank
A definition of tensor rank
Given a tensor
find the minimum number of rank one
tensors into it can be decomposed.
 agrees with matrices for d=2
 related to computing bilinear
forms and algebraic complexity
theory.
BUT
outer product
 only weak bounds are known
 tensor rank depends on the
underlying ring of scalars
 computing it is NP-hard
 successive rank one approxiimations are no good
Tensors decompositions
Many tensor decompositions “matricize” the tensor
1.
PARAFAC, Tucker, Higher-Order SVD, DEDICOM, etc.
2.
Most are computed via iterative algorithms (e.g., alternating least
squares).
Given
“unfold”
create the “unfolded” matrix
Useful links on tensor decompositions
•
Workshop on Algorithms for Modern Massive Data Sets (MMDS) ’06
http://www.stanford.edu/group/mmds/
Check the tutorial by Lek-Heng Lim on tensor decompositions.
•
Tutorial by Faloutsos, Kolda, and Sun in SIAM Data Mining Conference ’07
•
Tammy Kolda’s web page
http://csmr.ca.sandia.gov/~tgkolda/
Overview
x • Datasets in the form of matrices (and tensors)
x• Matrix Decompositions
x
x
x
x
x
x
x
Singular Value Decomposition (SVD)
Column-based Decompositions (CX, interpolative decomposition)
CUR-type decompositions
Non-negative matrix factorization
Semi-Discrete Decomposition (SDD)
Maximum-Margin Matrix Factorization (MMMF)
Tensor decompositions
• Regression
Coreset constructions
Fast algorithms for least-squares regression
Problem definition and motivation
In many applications (e.g., statistical data analysis and scientific
computation), one has n observations of the form:
Model y(t) (unknown) as a linear combination of d basis functions:
A is an n x d “design matrix” (n >> d):
In matrix-vector notation,
Least-norm approximation problems
Recall a linear measurement model:
In order to estimate x, solve:
Application: data analysis in science
• First application: Astronomy
Predicting the orbit of the asteroid Ceres (in 1801!).
Gauss (1809) -- see also Legendre (1805) and Adrain (1808).
First application of “least squares optimization” and runs in O(nd2) time!
• Data analysis: Fit parameters of a biological, chemical, economical, physical
(astronomical), social, internet, etc. model to experimental data.
Norms of common interest
Let y = b and define the residual:
Least-squares approximation:
Chebyshev or mini-max approximation:
Sum of absolute residuals approximation:
Lp norms and their unit balls
Recall the Lp norm for
:
Some inequality relationships include:
Lp regression problems
We are interested in over-constrained Lp regression problems, n >> d.
Typically, there is no x such that Ax = b.
Want to find the “best” x such that Ax ≈ b.
Lp regression problems are convex programs (or better).
There exist poly-time algorithms.
We want to solve them faster.
Exact solution to L2 regression
Cholesky Decomposition:
If A is full rank and well-conditioned,
decompose ATA = RTR, where R is upper triangular, and
solve the normal equations: RTRx = ATb.
QR Decomposition:
Slower but numerically stable, esp. if A is rank-deficient.
Write A = QR, and solve Rx = QTb.
Projection of b on the
subspace spanned by the
columns of A
Singular Value Decomposition:
Most expensive, but best if A is very ill-conditioned.
Write A = USVT, in which case: xOPT = A+b = VS-1UTb.
Complexity is O(nd2) , but constant factors differ.
Pseudoinverse of A
Questions …
Approximation algorithms:
Can we approximately solve Lp regression faster than “exact” methods?
Core-sets (or induced sub-problems):
Can we find a small set of constraints such that solving the Lp regression
on those constraints gives an approximation to the original problem?
Randomized algorithms for Lp regression
Alg. 1
p=2
Sampling
(core-set)
(1+)-approx
O(nd2)
Drineas, Mahoney, and Muthu ’06,
’07
Alg. 2
p=2
Projection
(no core-set)
(1+)-approx
O(nd logd)
Sarlos ’06
Drineas, Mahoney, Muthu, and
Sarlos ’07
Alg. 3
p  [1,∞)
Sampling
(core-set)
(1+)-approx
O(nd5)
+o(“exact”)
DasGupta, Drineas, Harb, Kumar,
& Mahoney ’07
Note: Clarkson ’05 gets a (1+)-approximation for L1 regression in O*(d3.5/4) time.
He preprocessed [A,b] to make it “well-rounded” or “well-conditioned” and then sampled.
Algorithm 1: Sampling for L2 regression
Algorithm
1.
Fix a set of probabilities pi, i=1…n,
summing up to 1.
2.
Pick the i-th row of A and the i-th
element of b with probability
min {1, rpi},
and rescale both by (1/min{1,rpi})1/2.
3.
Solve the induced problem.
Note: in expectation, at most r rows of A
and r elements of b are kept.
Sampling algorithm for L2 regression
sampled
rows of A
sampled
“rows” of b
scaling to
account for
undersampling
Our results for p=2
If the pi satisfy a condition, then with probability at least 1-,
(A): condition
number of A
The sampling complexity is
Notation
U(i): i-th row of U
: rank of A
U: orthogonal matrix containing the left singular vectors of A.
Condition on the probabilities
The condition that the pi must satisfy is, for some   (0,1] :
lengths of rows of matrix
of left singular vectors of A
Notes:
• O(nd2) time suffices (to compute probabilities and to construct a core-set).
• Important question:
Is O(nd2) necessary? Can we compute the pi’s, or construct a core-set, faster?
The Johnson-Lindenstrauss lemma
Results for J-L:
• Johnson & Lindenstrauss ’84: project to a random subspace
• Frankl & Maehara ’88: random orthogonal matrix
• DasGupta & Gupta ’99: matrix with entries from N(0,1), normalized
• Indyk & Motwani ’98: matrix with entries from N(0,1)
• Achlioptas ’03: matrix with entries in {-1,0,+1}
• Alon ’03: optimal dependency on n, and almost optimal dependency on 
Fast J-L transform (1 of 2)
(Ailon & Chazelle ’06)
Fast J-L transform (2 of 2)
(Ailon & Chazelle ’06)
Multiplication of the vectors by PHD is “fast”, since:
• (Du) is O(d) - since D is diagonal;
• (HDu) is O(d logd) – use Fast Fourier Transform algorithms;
• (PHDu) is O(poly (logn)) - P has on average O(poly(logn)) non-zeros per row.
O(nd logd) L2 regression
Fact 1: since Hn (the n-by-n Hadamard matrix) and Dn (an n-by-n diagonal with § 1 in
the diagonal, chosen uniformly at random) are orthogonal matrices,
Thus, we can work with HnDnAx – HnDnb. Let’s use our sampling approach…
O(nd logd) L2 regression
Fact 1: since Hn (the n-by-n Hadamard matrix) and Dn (an n-by-n diagonal with § 1 in
the diagonal, chosen uniformly at random) are orthogonal matrices,
Thus, we can work with HnDnAx – HnDnb. Let’s use our sampling approach…
Fact 2: Using a Chernoff-type argument, we can prove that the lengths of all the
rows of the left singular vectors of HnDnA are, with probability at least .9,
O(nd logd) L2 regression
DONE! We can perform uniform sampling in order to keep r = O(d logd/2) rows of
HnDnA; our L2 regression theorem guarantees the accuracy of the approximation.
Running time is O(nd logd), since we can use the fast Hadamard-Walsh transform to
multiply Hn and DnA.
Open problem: sparse approximations
Sparse approximations and l2 regression
(Natarajan ’95, Tropp ’04, ’06)
In the sparse approximation problem, we are given a d-by-n matrix A forming a
redundant dictionary for Rd and a target vector b 2 Rd and we seek to solve
subject to
In words, we seek a sparse, bounded error representation of b in terms of the
vectors in the dictionary.
Open problem: sparse approximations
Sparse approximations and l2 regression
(Natarajan ’95, Tropp ’04, ’06)
In the sparse approximation problem, we are given a d-by-n matrix A forming a
redundant dictionary for Rd and a target vector b 2 Rd and we seek to solve
subject to
In words, we seek a sparse, bounded error representation of b in terms of the
vectors in the dictionary.
This is (sort of) under-constrained least squares regression.
Can we use the aforementioned ideas to get better and/or faster approximation
algorithms for the sparse approximation problem?
Application: feature selection for RLSC
Regularized Least Squares Regression (RLSC)
Given a term-document matrix A and a class label for each document, find xopt to
minimize
Here c is the vector of labels. For simplicity assume two classes, thus ci = § 1.
Application: feature selection for RLSC
Regularized Least Squares Regression (RLSC)
Given a term-document matrix A and a class label for each document, find xopt to
minimize
Here c is the vector of labels. For simplicity assume two classes, thus ci = § 1.
Given a new document-vector q, its classification is determined by the sign of
Feature selection for RLSC
Feature selection for RLSC
Is it possible to select a small number of actual features (terms) and apply RLSC only
on the selected terms without a huge loss in accuracy?
Well studied problem; supervised (they employ the class label vector c) algorithms exist.
We applied our L2 regression sampling scheme to select terms; unsupervised!
A smaller RLSC problem
A smaller RLSC problem
TechTC data from ODP
(Gabrilovich and Markovitch ’04)
TechTC data
100 term-document matrices; average size ¼ 20,000 terms and ¼ 150 documents.
In prior work, feature selection was performed using a supervised metric called
information gain (IG), an entropic measure of correlation with class labels.
Conclusion of the experiments
Our unsupervised technique had (on average) comparable performance to IG.
Conclusions
Linear Algebraic techniques (e.g., matrix decompositions and regression) are
fundamental in data mining and information retrieval.
Randomized algorithms for linear algebra computations contribute novel results and
ideas, both from a theoretical as well as an applied perspective.
Conclusions and future directions
Linear Algebraic techniques (e.g., matrix decompositions and regression) are
fundamental in data mining and information retrieval.
Randomized algorithms for linear algebra computations contribute novel results and
ideas, both from a theoretical as well as an applied perspective.
Important directions
• Faster algorithms
• More accurate algorithms
• Matching lower bounds
• Implementations and widely disseminated software
• Technology transfer to other scientific disciplines
Download