(Big) data analysis: Lecture 5, going into high dimensional datasets

advertisement
(Big) data analysis: Lecture 5, going
into high dimensional datasets
Amos Tanay
Slides made available as lecture notes – not intended for presentation
Objects and features
Practical problem involve a large number of objects, and a large number of
features defining the objects
Given a Matrix X={x_ij} i=1..N, j=1..M, we can study the object correlation
matrix, or the feature correlation matrix:
Co = {cor(e_i*, e_j*)}
Cf = {cor_*i, cor_*j)}
Correlations can be assesed using different approaches, for eaxmple, using
the linear covariance matrix:
XXT for the feature covariance matrix
XTX – for the object covariance matrix
If we normalize X such that mean(x_i*)=0, std(x_i*)=1, the
covariance=correlation
Correlation matrices as an example for
multiple testing
• We perform n*(n-1)/2 tests for correlation of objects
• We perform m*(m-1)/2 tests for correlation of objects
• On random data, the correlation p-value should be distributed uniformly
in [0,1]
• Which mean that there are 0.05* n*(n-1)/2 pairs with p<0.05…or n*(n1)/2000 with p<0.001
• It is not good thing.
Controlling False Discovery Rate (FDR)
One can control the multiple testing effect by eliminating tests (filtering bad
objects, bad features, or focusing on a-priori interesting tests)
More generally, we can compute the false discovery rate of a p-value
threshold T as:
q=Pr(pvalue on random < T)/Pr(pvalue < T)
For example q = 0.1 would suggest that only 10% of the test we report
(Pvalue<T) are expected to be random.
It is common to search for argmaxT(q(T)<FDR), and using this threshold for
reporting results with controlled FDR.
Note that Pr(pvalue on random < T) is uniformly distributed if the test is well
defined, but may require resampling in more complex situations
The FDR concept is strongly based on the notion of tests independence. The
covariance matrix have very strong dependencies , so FDR cannot be used on
it directly.
Indirect correlation
• If cor(X,Y)>>0 and cor(X,Z)>>0, what can you say about cor(Y,Z)? (nothing
in general – find examples)
• However in may cases cor(Y,Z) will be correlated in a way that is explicable
by X
• Binormal(X,Y|cov1), Binormal(X,Z|cov2) -> Pr(Y,Z)?
• For binomial or multinomial data: conditional independence: Y indep Z | X
(set Bayes net theory)
• How to approach Cor(Y,Z|X) = ?
• Cor(Y-X, Z-X) – bad idea! Cor(Y-X, Z)?
• Cor(Y,Z|X=X0), Cor(Y,Z| X in bin)
Hidden factors
• In many cases major parts of the correlation
structure can be dependent on a hidden variable
• Examples: experimental batch, day in week,
operator, doctor
• Cor(Xi, h) > 0 for each i….then: Cor(X,Y)>0 for all
pairs…
• In that case, none of the observed variable can
eliminate the dependencies: Cor(Y,Z|X)>0… (find
an example)
Normalization: linear
• As said, common normalization approach is to center and scale the rows
or columns of the data matrix (mean = 0, stdev = 1).
• If we want to factor out an effector variable X we can subtract the
regression Y-f(X) but then be aware of the indirect correlation introduced
• Questions – what will happen if we normalize Y by:
Ynorm = Y-sample(N(mean = f(X), stdev=(1-cor(X,Y)2))
Normalization: binning
• If X is a variable with strong correlation to many other variables (Y,Z,W..)
we will in many case wish to stratify these variable given X without
assuming any parameterization.
• This can be done by binning all objects to ranges of X values and observing
the distributions of Y,Z,W within each bin.
• If bins are suffiiently large we can Subtract from each variable its mean or
its linear regression with X within the bin
Y’(i) = Y(i) – mean(Y(i)|X(i) in bin b(i)), or
Y’(i) = Y(i) – f(X(i)|X(i) in bin b(i)) (where f is the linear fit in the bin)(
Quantile normalization
Normalization can be done by shifting (Adding a constant) and scaling
(multiplying by a constant). This is adequate for normallly distributed data.
Less so for other type of data.
A more aggressive scheme is to force the entire distribution of values to be
fixed. This can be done using quantile normalization.
Define the percentile(X(i), X) as the normalized rank of X(i) within all X values.
Define the quantile(i, X) to be the value of the i percentile in X
Quantile normaliztion of Y given X is defined as:
Y’(i) = quantile(X, percentile(Y(i))
After this normalization Y and X have the same distribution, not just mean or
variance.
One can use quantile normalization within binning stratification as in the
previous slide (instead of subtracting the mean).
The approach can be very problematic with respect to outliers
Forcing low dimension: PCA
XTX – the empirical covariance matrix (if columns in X are normalized to
standard normal (mean = 0, var=1), scaled by n
What is the normalized linear combination (in simple words: weighted
average) of features that maximize variance?
w – weights (such that Swi = 1)
what is argmaxw(Si (xi**w)2))?
or: w = argmax (wTXTXw)
From linear algebra we know w is the eigenvector of the largest eigenvalue of
the covariance matrix:
(XXT)w1 = l1w1
We can project the matrix on the orthogonal complement of the first
eigenvector:
X2 = X – Xw1w1T
Now we can find the largest eigenvector of X2, defining w2, and so on for w3,..
We are in fact transforming the space to a new basis of eigenvectors. The
projection of each data vector on the PCs (the w’s) is called the PC scores
PCA is easy to compute, used for
dimensionality reduction
To perform PCA, we will usually use Singular Value Decomposition (SVD):
X = USWT
where U – n by n orthogonal, W – m by m orthogonal, S  diagonal
Some nice properties of SVD are:
a) XTX = WSUTUSWT =WS2WT
b) SVD is computed efficiently and is usually numerically stable (see NRC 2.6)
c) The PCA projection is computed by XW but this is just US
We can use the top k PCs to get a data matrix of rank k (so effective dimension
of k) that is maximally similar to the original matrix.
PCA can be used on “objects” or
“features”
Examples – two possible “orientations” for PCA of a gene expression matrix
(tissues X genes)
1) PCA of single cell gene expression can done with tissues in the columns.
Each gene will be projected on a reduced “virtual tissues” space. You can
imagine such PCs to instruct Re tumor/normal behavior etc.
2) If genes are in columns, each tissue will be projected on dimension that
combine genes in to some “consolidated expression signature”. You can
imagine such signatures to instruct on gene expression modules (groups of
genes that are working together)
Do not to use PCA as your major/first
data analysis approach
PCA is very popular – some reasons include:
a) It is easy to compute, available in matlab/R
b) I generate projections on 2D that are nice to look at
c)
It does not require parameters of any kind
d) It have nice theoretical properties, especially when thinking about machine
learning applications (e.g. feature selection for classificaiton)
Nevertheless, In the context of data analysis, I am tempted to say that PCA is to be
avoided in 99% of the situations.
- The PCs are linear combination of properties – this is seldom a good assumption
- The PCs are “hiding” the data and generate plots that cannot be directly
attributed to the data (e.g. problematic features, outliers, various biases)
- The PCs are very difficult to interpret unless there are 2 strong behaviors in the
data – the visual gain is lost when going beyond 2D
It is recommended to approach high dimensional data with many measurements using
first an exploratory approach (as we outline below), and to perform dimensionality
reduction and high-dimension modeling in a way that is tailored to the problem
The other off-the-shelf high-dim analysis
solution: hierarchical clustering
Given a set of vectors Xi
We are building a tree on the objects i=1..n, define by the relation Pa(i) (parent of i).
Compute distances d(Xi, Xj). Distances can use pearson, spearman, eucledian distance,
or essentially anything.
Add i=1..n to the list of objects L. If using weights set w(i) = 1 for i=1..n
Find the minimal distance pair i1,i2
Add a new object n+1, set Pa(i1) = n+1, Pa(i2) = n+1.
Update d(n+1,j) for each j using one policy:
1. Max(d(j,i1), d(j,i2)) or Min(.,.) (maximum,minimum linkage)
2. WeightedMean = 1/(wi1+wi1)[wi1d(j,i1) + wi2d(j,i2)]
3. Set Xn = Xi1 + Xi2, d(n+1,j) = d(Xn+1,Xj) (possibly weighting X using w)
Remove i1,i2 from the list L. If using weights, set wn+1 = wi1 + wi2
Continue iteratively (Adding objects n+2,… decreasing the size of L by 1 each time until
convergence).
Hclust- caveats
Hcluster is popular for visualizing the data in heat maps (ordering the objects
or features) – but the tree is _not_ defining an order (we can inverse the
order over nodes)
Effective to detect big divisions in the data
Of course - effective when hierarchical structure is expected.
But when the data is not clearly partitioned/divided/hierarchical, hclust will
be difficult to understand and work with
May make sense for visualizing the correlation matrix more than the data
matrix
Can look at the original data which is a good thing (e.g. PCA issues), but
smoothing can introduce visual artifacts that mislead novices
Limited as a basis for further modeling
A more organized intro to clustering
Clustering is a fuzzy task! No canonical goals or quality metric
We partition a group of objects into clusters:
Property 1: maximize the similarity of objects within clusters
Property 2: maximize the separation between clusters
Property 3: use the right number of clusters
The tradeoffs between 1/2 are not well defined! (think of examples)
In exploratory data analysis our goals may be a bit different:
Property A: Capture all of the data by a relatively small number of well define
models that are easy to understand/visualize
Property B: the model defining each cluster is well determined
Property C: Random data will not contain cluster significantly different from
the background distribution of all data
Clustering: k-means
Given a set of vectors Xi
We optimize a clustering solution C1+C2+..+Ck = 1..n, Ci disjoint.
Uniquely defining cl(i) = the cluster of element i
K is fixed and we try to minimize the disperssion of our clusters, defined by:
Centerj = 1/|Cj|Si in CjXi
score= Si ||Xi – centercl(i)||2 (Other schemes possible)
This is done by brute-force optimization:
Incremental algorithm: (starting from any solution C)
Examine all possible cluster substitutions cl(i) = j->l
If one substitution improve score, apply it and continue with the new solution
iteratively
Iterative algorithm: (starting from any solution C)
Compute centers. Update C such that cl(i) = the nearset center to Xi.
Initialization
K means, as most algorithms we will learn in the coming lectures, is supersensitive to initialization.
A large number of random starting poitns can be attempted – selecting the
highest scoring solution
An effective heuristic for initialization is selecting seeds successively
Given i seeds we rank all objects given their minimal distance to the existing
seeds.
We sample the i+1 seed from the objects with minimal distance within the top 20
percentiles (so preferring new seeds that are far away from existing seeds).
Many other algorithmic variants are possible – replacing centers with medians,
changing the distance or scoring scheme, adding new seeds for clusters that
become empty, trying to merge or split clusters (change k) during the run.
Using K-mean for data exploration: key
points
1. We select a number of clusters that is larger than what we think we need
2. This is ok as long as clusters are not becoming too small. The guildeine is that
we approximate a complex distribution using pieces we can understand.
3. An appropriate size of a cluster depends on the statistics we collect – each
feature is defining a 1D distribution within each cluster – this distribution should
be appropriately determined (As we covered in the first lectures)
4. Observing several similar clusters (with slight overfitting in each) can be
handled in post-process – do not try to optimize the number of clusters!
5. Frequently most of the data is “boring” (similar to the example of a 2D scatterplot that is focused at 0). The goal function of most clustering algorithm will not
allow appropriate modelling of strongly imbalanced clusters – it will be important
to filter such cases in preprocess (but take note of their existence and frequency!)
6. Combinatorial effects can be problematic for cluster in general, k-means in
particular. For example, 4 independt groups of features that parititon the objects
into two distinct group each will generate 16 cluster in theory – with model for
the connetion between clusters.
This should be diagnosed and handled using more sophisticated models.
Back to probabilistic modeling –
multivariate normal distribution
Generalizing 2D correlation with a set of pairwise covariance
S – the covariance matrix |S| - det(S)
1
1
f (X) =
exp(- (x - m )T S-1 (x - m ))
2
2p k | S |
Multi-normality is a very strong restriction on data – although the model have
many – d2 - parameters and it prone to sever overfitting
Restricting the covariance structure can be approached in many ways.
Probabilistic analog of clustering: mixture models
Pr(Xi ) = å r j Pr(Xi | q j )
j
Pr( j | Xi ) = rh Pr(Xi | q h ) / å r j Pr(Xi | q j )
j
We model uncertainty in the identification of objects to mixture components
Each mixture component can be based on some appropriate model
- Independent discrete (binomial,multinormial)
- Independent normals
- Mutlivariate normals
- More complex dependency model (e.g. Tree, Bayesian network, Markov
Random Field…)
This is our first model with a hiden variable T->X or P(T,X) = P(T)*P(X|T)
The probability of the data is derived by marginzation over the hidden
variables
Expectation Maximization (EM)
(Working with e.g. a mixture of independent or correlated normal
distributions)
Initialize the mixture component parameters and mixture coefficients:
q1, r1
Compute the posterior probabilities: wij Pr(t =j|Xi)
Set new mixture coefficients: r2i=(1/m) Sj wij
Set new mixture components using weighted data wij, Xi
Iterate until convergence (no significant change in qlast, rlast)
Update mixture component using weight data – examples:
Independent normal: E(X[k]) E(X2[k]) is determined by weighted statistics
Full multivariate: all statistics (E(X[k]X[u]) is determined by weighted stats
Expectation-Maximization
The most important property of the EM algorithm is that it is guaranteed to
improve the likelihood of the data.
This DOES NOT MEAN it will find a good solution. Bad initialization will
converge to bad outcome.
Here t is ranging over all possible
log P(X | q ) = log å P(t, X | q )
assignment of the hidden mixture variable t
t
per sample – so km possibilities
P(t, X | q ) = P(t | X, q )P(X, q )
log P(X | q ) = logP(t, X | q ) - log P(t | X,q )
log P(X | q ) = å P(t | X,q k )log P(t, X | q ) - å P(t | X, q k )log P(t | X, q )
h
h
Q(q | q k ) = å P(t | X, q k )log P(t, X | q )
h
log P(X | q ) - log P(X | q k ) =
P(t | X, q k )
Q(q | q ) - Q(q | q ) + å P(t | X, q )log
P(t | X, q )
h
Relative entropy>=0
EM maximization
q k 1  arg maxQ(q | q k )
k
Dempster
q
k
k
k
Expectation-Maximization: Mixture model
æ
ö
k
Q(q | q ) = å P(t | X, q )log P(t, X | q ) = åç P(t | X, q )å log P(ti , Xi | q )÷
ø
t
t è
i
k
k
Q can be brought back to the form of a sum over independent terms, k of them
depends only on the parameters of one mixture component, and one depends
on the mixture coeeficient. Maximization of each of these component can then
be shown to be done as in the non-mixture case (e.g. using the empirical
covariance matrix for the multi-normal distribution).
KL-divergence
H ( P)   P( xi ) log P( xi )
i
1
1
max H ( P)   P( ) log P( )  log K
k
k
i
min H ( P)  0
Entropy (Shannon)
Shannon
H ( P || Q)   P( xi ) log
i
 H ( P ||| Q)   P( xi ) log
i
P( xi )
Q( xi )
Kullback-leibler divergence
 Q( xi ) 
Q( xi )
 P( xi )
 1  Q( xi )  P( xi )   0
P( xi ) i
 P( xi )  i
log(u)  u  1
H ( P || Q)  H (Q || P)
Not a metric!!
KL
Download