Data-driven dictionary definition for diverse document domains Michael W. Mahoney Yahoo! Research

advertisement
Data-driven dictionary definition
for diverse document domains
Michael W. Mahoney
Yahoo! Research
http://www.cs.yale.edu/homes/mmahoney
(Joint work with P. Drineas, S. Muthukrishnan and others as listed.)
Modeling data and documents
People studying “documents,” or data more generally:
• put the data onto a graph or into a vector space
• even if the data don’t naturally or obviously live there
• and perform graph operations or vector space operations
• to extract information from the data.
Such data “documents” often have structure unrelated to the graphical or
linear algebraic structure implicit in the modeling.
• This non-modeled structure is difficult to formalize.
Practitioners often have extensive field-specific intuition about the data.
• This intuition is often used to choose “where the data live.”
• The choice of where the data live may capture non-modeled structure.
Documents modeled as matrices
Matrices often arise since n objects (“documents,” genomes, images, web
pages), each with m features, may be represented by an m x n matrix A.
Such “documents” often have structure:
• for linear structure, SVD or PCA is often used,
• for non-linear structure, kernel, e.g., diffusion-based, methods used,
Note: We know what the rows/columns “mean” from the application area.
Goal: Develop principled provably-accurate algorithmic methods such that:
• they are agnostic with respect to any particular field,
• one can fruitfully couple them to the field-specific intuition,
• they perform well on complex non-toy data “documents”.
SVD of a matrix
Theorem: Any m x n matrix A can be decomposed as:
U (V): orthogonal matrix containing the left (right) singular vectors of A.
S: diagonal matrix containing the singular values of A, ordered non-increasingly.
: rank of A, the number of non-zero singular values.
Data Application: Principal Components Analysis (PCA) is just SVD.
Complexity: Exact computation of the SVD takes O(min{mn2 , m2n}) time.
SVD and low-rank approximations
Theorem: Truncate the SVD by keeping k ≤  terms:
This gives another matrix Ak of the same dimensions that is the “best”
approximation to A among all rank-k matrices.
Interesting property: For future reference, note:
• The rows of Uk (= UA,k) are NOT orthogonal and are NOT unit length.
• The lengths/Euclidean norms of the rows of Uk capture a notion of information
dispersal.
Rows of left singular vectors
What do the lengths of the rows of the n x d matrix U = UA “mean”?
Consider possible n x d matrices U of d left singular vectors:
In|k = k columns from the identity
row lengths = 0 or 1
In|k x -> x
Hn|k = k columns from the n x n Hadamard (real Fourier) matrix
row lengths all equal
Hn|k x -> maximally dispersed
Uk = k columns from any orthogonal matrix
row lengths between 0 and 1
The lengths of the rows of U = UA correspond to a notion of
information dispersal
Problems with SVD/Eigen-Analysis
Problems arise since structure in the data is not respected by
mathematical operations on the data:
• Reification - maximum variance directions are just that.
• Interpretability - what does a linear combination of 6000 genes mean.
• Sparsity - is destroyed by orthogonalization.
• Non-negativity - is a convex and not linear algebraic notion.
The SVD gives a low-rank matrix approximation with a very particular
structure (think: rotation-with-truncation;rescaling;rotation-back-up).
Question: Do there exist “better” low-rank matrix approximations.
• “better” structural properties for certain applications.
• “better” at respecting relevant structure.
• “better” for interpretability and informing intuition.
Dictionaries for document analysis
Discrete Fourier Transform (DCT):
• fj = Si=0,…,N-1 xn cos[j(n+1/2)/N]
• the basis is fixed.
• O(N2) or O(Nlog(N)) computation to determine coefficients.
Singular Value Decomposition (SVD):
• A = Si=1,…, iU(i)V(i)T = Si=1,…, i A[i]
• O(N3) computation to determine basis and coefficients.
Many other more complex/expensive procedures depending on the application.
Question: Can actual data points and/or feature vectors be the dictionary.
• “Core-sets” on graphs.
• “CUR-decompositions” on matrices.
Dictionaries & the SVD
A = U SVT = S i=1,..., iU(i)V(i)T,
•
where U(i),V(i) = eigen-cols and eigen-rows.
Approximate: A(j) ≈ S i=1,...,k zijU(i)
•
by minzij|| A(j) - S i=1,...,k zijU(i) ||2
Z = UkTA --> A ≈ Ak = (UkUkT)A
•
project onto space of top k eigen-cols.
Z = SkVkT --> A ≈ Ak = Uk(SkVkT)
•
approximate every column of A i.t.o. a small number of eigen-rows and a lowdimensional encoding matrix Sk.
Dictionaries & columns and rows
A = CUR = S
•
ij
uijC(i)R(i), where U=W+ and W = intersection of C and R,
where C(i),R(i) = actual-cols and actual-rows.
Approximate: A(j) ≈ S i=1,...,c yijC(i)
•
by minyij|| A(j) - S i=1,...,c yijC(i) ||2
Y = C+A --> A ≈ PCA = (CC+)A
•
project onto space of those c actual-cols.
Y ≈ W+R --> A ≈ PCA ≈ C(W+R)
•
approximate every column of A i.t.o. a small number of actual-rows and a lowdimensional encoding matrix U=W+.
CX and CUR matrix decompositions
Def: A CX matrix decomposition is a low-rank approximation explicitly
expressed in terms of a small number of columns of the original matrix A.
Def: A CUR matrix decomposition is a low-rank approximation explicitly
expressed in terms of a small number of rows and columns of the original
matrix A.
Carefully
chosen U
O(1) columns
O(1) rows
Problem formulation (1 of 3)
Consider (for now) just columns:
• Could ask to find the “best” k of n columns of A (by whatever measure-of-merit).
• Combinatorial problem - trivial algorithm takes nk time.
• Probably NP-hard if k is not fixed.
Instead:
• Fix a rank parameter k.
• Let’s over-sample columns by a little (e.g., k+3, 10k, k2, etc.).
• Get close (additive error or relative error) to the “best” rank-k approximation..
Note: Error and over-sampling are computational resources to exploit algorithmically.
Problem formulation (2 of 3)
Ques: Do there exist O(k), or O(k2), or …, columns s.t.:
||A-CC+A||2,F < ||A-Ak||2,F + ||A||F
Ans: Yes - and can find them in O(m+n) space and time after two passes over the data! (DFKVV99,DKM04)
Ques: Do there exist O(k), or O(k2), or …, columns s.t.:
||A-CC+A||2,F < (1+)-1||A-Ak||2,F + t||A||F
Ans: Yes - and can find them in O(m+n) space and time after t passes over the data! (RVW05,DM05)
Ques: Do there exist, and can we find, O(k), or O(k2), or …, columns s.t.:
||A-CC+A||F < (1+)||A-Ak||F
Ans: Yes - existential proof - no non-exhaustive algorithm given! (RVW05,DRVW06)
Ans: ...
Problem formulation (3 of 3)
Back to columns and rows:
Ques: Do there exist O(k), or O(k2), or …, columns and rows s.t.:
||A-CUR||2,F < ||A-Ak||2,F + ||A||F
Ans: Yes - lots of them, and can find them in O(m+n) space and time after two passes
over the data! (DK03,DKM04)
Note: “lots of them” since these are randomized Monte Carlo algorithms!
Ques: Do there exist O(k), or O(k2), or …, columns and rows s.t.:
||A-CUR||F < (1+)||A-Ak||F
Ans: …
Algorithm to select U, R, given C
Idea: approximate all columns of A as linear combinations of the “basis” columns in C.
Algorithm:
• Compute a good set of probabilities pi summing to 1; % DETAILS COMING UP
• Pick r rows i1,i2, … , ir of A w.r.t. the pi in i.i.d. trials.
• Let R be the r x n matrix containing these rows;
• Let Dtt = 1/(rpit)1/2 for t = 1…r;
• Let W be the intersection of C and R;
Thm: Given C, in O(c2m + cmn) = O(mn) time, we can construct D and R s.t.
holds with probability at least 1-. We need to pick r= O(c2log(1/)/2) rows.
Row-sampling probabilities pi
• Let U = UC be the orthogonal matrix containing the left singular vectors of C.
• Let U(i) denote the i-th row of U.
NOTE: U(i) is NOT unit-length and is NOT orthogonal to U(j) in general.
• We can compute these probabilities in O(c2m + mnc) = O(mn) time.
Algorithm to select C
(D., M., & Muthukrishnan ’06)
Idea: express all columns of A as linear combinations of the “basis” columns in C.
Algorithm:
• Compute a good set of probabilities pi summing to 1; % DETAILS COMING UP
• Pick c columns of A w.r.t. the pi in i.i.d. trials.
• Let C be the m x c matrix containing these columns;
Theorem: For any k, let Ak be the “best” rank k approximation to A.
Then, in O(SVD(A)) time we can construct a matrix C consisting of c =
O(k2log(1/)/2) columns of A s.t.:
holds with probability at least 1-.
Column-sampling probabilities pi
• We can compute these probabilities in O(SVD(A)) time.
k: rank parameter input to the algorithm
: rank of A
Vk: top k right singular vectors of A
S-k: bottom -k singular values of A
V-k: bottom -k right sing. vectors of A
NOTE: In general, (Vk)(i) is NOT unitlength and is NOT orthogonal to (Vk)(j).
Theorem: (relative error) CUR
Theorem: Fix any k, , . Then, there exists a Monte Carlo algorithm that
uses O(SVD(A)) time to find C and R and construct U s.t.:
holds with probability at least 1-, by picking c = O( k2 log(1/) / 2 )
columns and r = O( k4 log2(1/) / 6 ) rows.
Proof: Really nice. We disentangle “subspace” information and “size-of-A”
information to get relative error bound. Skip for now.
(Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).)
(Current empirical work: we can usually choose c,r ≤ k+4.)
(Don’t worry about : choose =1 if you want!)
Previous CUR-type decompositions
Goreinov, Tyrtyshnikov, &
Zamarashkin
(LAA ’97, …)
Berry, Stewart, & Pulatova
(Num. Math. ’99, TR ’04, … )
Williams & Seeger
(NIPS ’01, …)
D., M., & Kannan
(SODA ’03, TR ’04, SICOMP ‘06)
D., M., & Muthukrishnan
(TR ’06)
C: columns that span max volume
U: W+
R: rows that span max volume
Existential result
Error bounds depend on ||W+||2
Spectral norm bounds!
C: variant of the QR algorithm
R: variant of the QR algorithm
U: minimizes ||A-CUR||F
No a priori bounds
A must be known to construct U.
Solid experimental performance
C: uniformly at random
U: W+
R: uniformly at random
Experimental evaluation
A is assumed PSD
Connections to Nystrom method
C: w.r.t. column lengths
U: in linear/constant time
R: w.r.t. row lengths
“Sketching” massive matrices
Provable, a priori, bounds
Explicit dependency on A – Ak
C: depends on singular vectors of A.
U: (almost) W+
R: depends on singular vectors of C
(1+) approximation to A – Ak
Computable in low polynomial time
(Suffices to compute SVD(A))
(For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.)
Nonnegative Matrix Factorization (NMF)
Problem definition:
Given an m x n matrix A with non-negative entries and a number c << n: find an m x
c matrix W and a c x n matrix H such that all entries of W and H are non-negative
and s.t.:
Typical (non-convex) optimization objective: minW, H || A – WH ||F2
References: Paatero & Tapper, Chemometrics ’94
Lee & Seung, Nature ’00
A lot of recent work by M. Berry, B. Plemmons, P. Hoyer, etc..
Motivation: respect the nonnegative structure in the input matirix.
Observation: Why not use actual columns or rows in the decomposition?
Refs:
Work with Lek-Heng Lim and Petros Drineas ‘05.
Applications of CX/CUR to diverse data documents
Currently application areas for CUR-based analysis:
• Term-document matrices: (with Yahoo people).
• User-group matrices: (with Yahoo people).
• Recommendation Systems: (with Yahoo people).
• DNA microarray: (with O. Alter).
• Functional MRI data: (with F. Meyer).
• DNA SNP data: (with P. Paschou and K. Kidd).
• Hyperspectral Image data: (with M. Maggioni and R. Coifman).
CUR data application: DNA tagging-SNPs
(data from K. Kidd’s lab at Yale University, joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms: the most common type of genetic variation in the
genome across different individuals.
They are known locations at the human genome where two alternate nucleotide bases
(alleles) are observed (out of A, C, G, T).
SNPs
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
There are ∼10 million SNPs in the human genome, so this table could have ~10 million columns.
Why are SNPs important
SNPs occur quite frequently within the genome allowing the tracking of disease
genes and population histories.
• Mapping the whole genome sequence of a single individual is very expensive.
• Mapping all the SNPs is also quite expensive, but the costs are dropping fast.
HapMap project (~$108 funding from NIH and other sources):
Map all 107 SNPs for ~400 individuals from 4 different populations, in order to
create a “genetic map” to be used by researchers.
Funding from pharmaceutical companies, NSF, the Department of Justice*, etc.
*Is
it possible to identify the ethnicity of a suspect from his DNA?
Research directions
Research questions (working within a population):
Why?
(i) Are different SNPs correlated, within or across
populations?
- Understand structural properties
of the human genome.
(ii) Find a “good” set of tagging-SNPs capturing the
diversity of a chromosomal region of the human genome.
(iii) Find a set of individuals that capture the diversity
of a chromosomal region.
(iii) Is extrapolation feasible?
- Save time/money by assaying only
the tSNPs and predicting the rest.
- Save time/money by running (drug)
tests only on the cell lines of the
selected individuals.
Existing literature
Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide frequencies and
co-occurrences.
Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very difficult to generalize.
Exhaustive and semi-exhaustive algorithms in order to pick “good” ht-SNPs that have small LD distance with all other
SNPs.
Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004.
The SNP data we examined
• Samples from 38 different populations.
• Average size 50 subjects/population.
• For each subject 63 SNPs were assayed.
• These SNPs drawn from a chromosomal region which is roughly 900,000 bp long.
• This region is close to the end of the long arm of chromosome 17.
• At each SNP location two alternate nucleotide bases (alleles) are observed (so we
use genotype and not haplotype information).
individuals
SNPs
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
N > 50
N: 25 ~ 50
Finns
Kom Zyrian
Yakut
Khanty
Irish
European, Mixed
Danes
Chuvash
Russians
African Americans
Jews, Ashkenazi
Adygei
Druze
Samaritans
Pima, Arizona
Cheyenne
Chinese,
Hakka
Japanese
Han
Chinese, Taiwan
Cambodians
Jews, Yemenite
Ibo
Maya
Atayal
Hausa
Yoruba
Pima, Mexico
Ami
Biaka
Jews, Ethiopian
Mbuti
Ticuna
Micronesians
Chagga
Nasioi
Surui
Karitiana
Africa
Europe
NW Siberia
NE Siberia
SW Asia
E Asia
N America
S America
Oceania
Encoding the SNP data into a matrix
• Exactly two nucleotides (out of A,G,C,T) appear in each column of the data matrix.
• Thus, the two alleles might be both equal to the first one (encode by +1), both equal
to the second one (encode by -1), or different (encode by 0).
individuals
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1
Note: The order of the alleles is irrelevant (i.e., TG is the same as GT).
Note: Encoding, e.g., GG to +1 and TT to -1 is not any different (for our purposes)
from encoding GG to -1 and TT to +1.
(Flipping the signs of the columns of a matrix does not affect our techniques.)
Evaluating (linear) structure
• For each population, we ran SVD to determine the “optimal” number k of principal
components that are necessary in order to cover (for example) 90% of its spectrum.
• That is, if we select the top k left singular vectors Uk we can express every column (i.e,
every SNP) of A as a linear combination of the top k left singular vectors (i.e., eigen-SNPs)
and loose at most 10% of the “information”.
Expressing columns of A as
linear combinations of the
top k left singular vectors
Expressing columns of A as
linear combinations of a
few columns of A
• BUT: we do NOT want eigen-SNPS, or eigen-people for that matter.
• Is it possible to pick a small number (e.g., roughly k) of columns of A and express every
column (i.e., SNP) of A as a linear combination of the picked columns (matrix C) loosing at
most 10% of the information in the matrix?
Fast algorithms to select good SNPs
We ran various algorithms to select “good” columns (i.e., ht-SNPs).
A greedy Multi-Pass heuristic scheme gave the best results.
• Select one column in each round, subtracting from A the projection of A on this
column and repeating.
• Provable quality-of-approximation bounds exist for similar algorithms.
Nice feature: SVD provides a non-trivial (maybe not achievable) lower bound.
• In many cases, the lower bound is attained by the greedy heuristic!
• In our data, at most k+4 columns suffice to extract 90% of the structure.
America
Oceania
Asia
Europe
Africa
Extrapolation using both SNPs and subjects
Given a small number of SNPs for all subjects, and all SNPs for some judiciously
chosen subjects, extrapolate the values of the missing SNPs.
SNPs
“Training” data
JUDICIOUCLY CHOSEN
individuals
(for a few subjects, we are
given all SNPs)
BUT
SNP sample
(for all subjects, we are given a
small number of SNPs)
BUT
We choose these SNPs by
looking at the whole matrix A.
We choose which subjects
to keep.
CUR data application: image analysis
(with M. Maggioni and R. Coifman at Yale)
Goal: Extract structure from temporally-resolved images or spectrallyresolved images of medical interest using a small number of samples (images
and/or pixels).
Mode 3
Mode 1
Mode 2
m x n x p tensor A or mn x p matrix A
Note: A temporally or spectrally
resolved image may be viewed as a
tensor (naively, a dataset subscripted by
multiple indices) or as a matrix (whose
columns have internal structure that is
not modeled).
CUR applied to resolved images
 Let R consist of the sampled rows or “slabs”.
2 samples
 Express the remaining images as linear combinations
of the sampled “slabs”.
 Pick a constant number of columns or “fibers” of A
(the red dotted lines).
p time steps
 Express the remaining slabs as linear combination of
the sampled slabs.
Note: The chosen images are a dictionary from the data to express every image.
Note: The chosen pixels are a dictionary from the data to express every pixel.
Absorption/transmittance and nonuniform
sampling probabilities
Eigen-analysis of slabs and fibers
Look at the exact 65-th (or any other) slab.
The 65-th slab approximately reconstructed
This slab was
reconstructed
by approximate
least-squares
fit to the basis
from slabs 41
and 50, using
1000 (of 250K)
pixels/fibers.
Tissue Classification - Exact Data
Tissue Classification - Ns=12 & Nf=1000
Conclusions
CUR matrix decompositions provide data-driven dictionary definition
mechanism for diverse “document” domains.
• Provides a low-rank approximation in terms of the actual columns and rows of the matrix.
• Take advantage of field-specific intuition for improved analysis of mediumly large data.
• Approximate least squares fitting to the dictionary of chosen columns/rows.
CUR has applications to lots of diverse data “documents”:
• to DNA SNP data and DNA microarray data,
• to spectrally- and temporally-resolved image analysis,
• to recommendation systems and internet data.
Big Algorithm Question: How to better couple data/document analysis
methods with field-dependent data generation, preprocessing, and modeling.
Download