(Big) data analysis: Lecture 5, going into high dimensional datasets Amos Tanay Slides made available as lecture notes – not intended for presentation Objects and features Practical problem involve a large number of objects, and a large number of features defining the objects Given a Matrix X={x_ij} i=1..N, j=1..M, we can study the object correlation matrix, or the feature correlation matrix: Co = {cor(e_i*, e_j*)} Cf = {cor_*i, cor_*j)} Correlations can be assesed using different approaches, for eaxmple, using the linear covariance matrix: XXT for the feature covariance matrix XTX – for the object covariance matrix If we normalize X such that mean(x_i*)=0, std(x_i*)=1, the covariance=correlation Correlation matrices as an example for multiple testing • We perform n*(n-1)/2 tests for correlation of objects • We perform m*(m-1)/2 tests for correlation of objects • On random data, the correlation p-value should be distributed uniformly in [0,1] • Which mean that there are 0.05* n*(n-1)/2 pairs with p<0.05…or n*(n1)/2000 with p<0.001 • It is not good thing. Controlling False Discovery Rate (FDR) One can control the multiple testing effect by eliminating tests (filtering bad objects, bad features, or focusing on a-priori interesting tests) More generally, we can compute the false discovery rate of a p-value threshold T as: q=Pr(pvalue on random < T)/Pr(pvalue < T) For example q = 0.1 would suggest that only 10% of the test we report (Pvalue<T) are expected to be random. It is common to search for argmaxT(q(T)<FDR), and using this threshold for reporting results with controlled FDR. Note that Pr(pvalue on random < T) is uniformly distributed if the test is well defined, but may require resampling in more complex situations The FDR concept is strongly based on the notion of tests independence. The covariance matrix have very strong dependencies , so FDR cannot be used on it directly. Indirect correlation • If cor(X,Y)>>0 and cor(X,Z)>>0, what can you say about cor(Y,Z)? (nothing in general – find examples) • However in may cases cor(Y,Z) will be correlated in a way that is explicable by X • Binormal(X,Y|cov1), Binormal(X,Z|cov2) -> Pr(Y,Z)? • For binomial or multinomial data: conditional independence: Y indep Z | X (set Bayes net theory) • How to approach Cor(Y,Z|X) = ? • Cor(Y-X, Z-X) – bad idea! Cor(Y-X, Z)? • Cor(Y,Z|X=X0), Cor(Y,Z| X in bin) Hidden factors • In many cases major parts of the correlation structure can be dependent on a hidden variable • Examples: experimental batch, day in week, operator, doctor • Cor(Xi, h) > 0 for each i….then: Cor(X,Y)>0 for all pairs… • In that case, none of the observed variable can eliminate the dependencies: Cor(Y,Z|X)>0… (find an example) Normalization: linear • As said, common normalization approach is to center and scale the rows or columns of the data matrix (mean = 0, stdev = 1). • If we want to factor out an effector variable X we can subtract the regression Y-f(X) but then be aware of the indirect correlation introduced • Questions – what will happen if we normalize Y by: Ynorm = Y-sample(N(mean = f(X), stdev=(1-cor(X,Y)2)) Normalization: binning • If X is a variable with strong correlation to many other variables (Y,Z,W..) we will in many case wish to stratify these variable given X without assuming any parameterization. • This can be done by binning all objects to ranges of X values and observing the distributions of Y,Z,W within each bin. • If bins are suffiiently large we can Subtract from each variable its mean or its linear regression with X within the bin Y’(i) = Y(i) – mean(Y(i)|X(i) in bin b(i)), or Y’(i) = Y(i) – f(X(i)|X(i) in bin b(i)) (where f is the linear fit in the bin)( Quantile normalization Normalization can be done by shifting (Adding a constant) and scaling (multiplying by a constant). This is adequate for normallly distributed data. Less so for other type of data. A more aggressive scheme is to force the entire distribution of values to be fixed. This can be done using quantile normalization. Define the percentile(X(i), X) as the normalized rank of X(i) within all X values. Define the quantile(i, X) to be the value of the i percentile in X Quantile normaliztion of Y given X is defined as: Y’(i) = quantile(X, percentile(Y(i)) After this normalization Y and X have the same distribution, not just mean or variance. One can use quantile normalization within binning stratification as in the previous slide (instead of subtracting the mean). The approach can be very problematic with respect to outliers Forcing low dimension: PCA XTX – the empirical covariance matrix (if columns in X are normalized to standard normal (mean = 0, var=1), scaled by n What is the normalized linear combination (in simple words: weighted average) of features that maximize variance? w – weights (such that Swi = 1) what is argmaxw(Si (xi**w)2))? or: w = argmax (wTXTXw) From linear algebra we know w is the eigenvector of the largest eigenvalue of the covariance matrix: (XXT)w1 = l1w1 We can project the matrix on the orthogonal complement of the first eigenvector: X2 = X – Xw1w1T Now we can find the largest eigenvector of X2, defining w2, and so on for w3,.. We are in fact transforming the space to a new basis of eigenvectors. The projection of each data vector on the PCs (the w’s) is called the PC scores PCA is easy to compute, used for dimensionality reduction To perform PCA, we will usually use Singular Value Decomposition (SVD): X = USWT where U – n by n orthogonal, W – m by m orthogonal, S diagonal Some nice properties of SVD are: a) XTX = WSUTUSWT =WS2WT b) SVD is computed efficiently and is usually numerically stable (see NRC 2.6) c) The PCA projection is computed by XW but this is just US We can use the top k PCs to get a data matrix of rank k (so effective dimension of k) that is maximally similar to the original matrix. PCA can be used on “objects” or “features” Examples – two possible “orientations” for PCA of a gene expression matrix (tissues X genes) 1) PCA of single cell gene expression can done with tissues in the columns. Each gene will be projected on a reduced “virtual tissues” space. You can imagine such PCs to instruct Re tumor/normal behavior etc. 2) If genes are in columns, each tissue will be projected on dimension that combine genes in to some “consolidated expression signature”. You can imagine such signatures to instruct on gene expression modules (groups of genes that are working together) Do not to use PCA as your major/first data analysis approach PCA is very popular – some reasons include: a) It is easy to compute, available in matlab/R b) I generate projections on 2D that are nice to look at c) It does not require parameters of any kind d) It have nice theoretical properties, especially when thinking about machine learning applications (e.g. feature selection for classificaiton) Nevertheless, In the context of data analysis, I am tempted to say that PCA is to be avoided in 99% of the situations. - The PCs are linear combination of properties – this is seldom a good assumption - The PCs are “hiding” the data and generate plots that cannot be directly attributed to the data (e.g. problematic features, outliers, various biases) - The PCs are very difficult to interpret unless there are 2 strong behaviors in the data – the visual gain is lost when going beyond 2D It is recommended to approach high dimensional data with many measurements using first an exploratory approach (as we outline below), and to perform dimensionality reduction and high-dimension modeling in a way that is tailored to the problem The other off-the-shelf high-dim analysis solution: hierarchical clustering Given a set of vectors Xi We are building a tree on the objects i=1..n, define by the relation Pa(i) (parent of i). Compute distances d(Xi, Xj). Distances can use pearson, spearman, eucledian distance, or essentially anything. Add i=1..n to the list of objects L. If using weights set w(i) = 1 for i=1..n Find the minimal distance pair i1,i2 Add a new object n+1, set Pa(i1) = n+1, Pa(i2) = n+1. Update d(n+1,j) for each j using one policy: 1. Max(d(j,i1), d(j,i2)) or Min(.,.) (maximum,minimum linkage) 2. WeightedMean = 1/(wi1+wi1)[wi1d(j,i1) + wi2d(j,i2)] 3. Set Xn = Xi1 + Xi2, d(n+1,j) = d(Xn+1,Xj) (possibly weighting X using w) Remove i1,i2 from the list L. If using weights, set wn+1 = wi1 + wi2 Continue iteratively (Adding objects n+2,… decreasing the size of L by 1 each time until convergence). Hclust- caveats Hcluster is popular for visualizing the data in heat maps (ordering the objects or features) – but the tree is _not_ defining an order (we can inverse the order over nodes) Effective to detect big divisions in the data Of course - effective when hierarchical structure is expected. But when the data is not clearly partitioned/divided/hierarchical, hclust will be difficult to understand and work with May make sense for visualizing the correlation matrix more than the data matrix Can look at the original data which is a good thing (e.g. PCA issues), but smoothing can introduce visual artifacts that mislead novices Limited as a basis for further modeling A more organized intro to clustering Clustering is a fuzzy task! No canonical goals or quality metric We partition a group of objects into clusters: Property 1: maximize the similarity of objects within clusters Property 2: maximize the separation between clusters Property 3: use the right number of clusters The tradeoffs between 1/2 are not well defined! (think of examples) In exploratory data analysis our goals may be a bit different: Property A: Capture all of the data by a relatively small number of well define models that are easy to understand/visualize Property B: the model defining each cluster is well determined Property C: Random data will not contain cluster significantly different from the background distribution of all data Clustering: k-means Given a set of vectors Xi We optimize a clustering solution C1+C2+..+Ck = 1..n, Ci disjoint. Uniquely defining cl(i) = the cluster of element i K is fixed and we try to minimize the disperssion of our clusters, defined by: Centerj = 1/|Cj|Si in CjXi score= Si ||Xi – centercl(i)||2 (Other schemes possible) This is done by brute-force optimization: Incremental algorithm: (starting from any solution C) Examine all possible cluster substitutions cl(i) = j->l If one substitution improve score, apply it and continue with the new solution iteratively Iterative algorithm: (starting from any solution C) Compute centers. Update C such that cl(i) = the nearset center to Xi. Initialization K means, as most algorithms we will learn in the coming lectures, is supersensitive to initialization. A large number of random starting poitns can be attempted – selecting the highest scoring solution An effective heuristic for initialization is selecting seeds successively Given i seeds we rank all objects given their minimal distance to the existing seeds. We sample the i+1 seed from the objects with minimal distance within the top 20 percentiles (so preferring new seeds that are far away from existing seeds). Many other algorithmic variants are possible – replacing centers with medians, changing the distance or scoring scheme, adding new seeds for clusters that become empty, trying to merge or split clusters (change k) during the run. Using K-mean for data exploration: key points 1. We select a number of clusters that is larger than what we think we need 2. This is ok as long as clusters are not becoming too small. The guildeine is that we approximate a complex distribution using pieces we can understand. 3. An appropriate size of a cluster depends on the statistics we collect – each feature is defining a 1D distribution within each cluster – this distribution should be appropriately determined (As we covered in the first lectures) 4. Observing several similar clusters (with slight overfitting in each) can be handled in post-process – do not try to optimize the number of clusters! 5. Frequently most of the data is “boring” (similar to the example of a 2D scatterplot that is focused at 0). The goal function of most clustering algorithm will not allow appropriate modelling of strongly imbalanced clusters – it will be important to filter such cases in preprocess (but take note of their existence and frequency!) 6. Combinatorial effects can be problematic for cluster in general, k-means in particular. For example, 4 independt groups of features that parititon the objects into two distinct group each will generate 16 cluster in theory – with model for the connetion between clusters. This should be diagnosed and handled using more sophisticated models. Back to probabilistic modeling – multivariate normal distribution Generalizing 2D correlation with a set of pairwise covariance S – the covariance matrix |S| - det(S) 1 1 f (X) = exp(- (x - m )T S-1 (x - m )) 2 2p k | S | Multi-normality is a very strong restriction on data – although the model have many – d2 - parameters and it prone to sever overfitting Restricting the covariance structure can be approached in many ways. Probabilistic analog of clustering: mixture models Pr(Xi ) = å r j Pr(Xi | q j ) j Pr( j | Xi ) = rh Pr(Xi | q h ) / å r j Pr(Xi | q j ) j We model uncertainty in the identification of objects to mixture components Each mixture component can be based on some appropriate model - Independent discrete (binomial,multinormial) - Independent normals - Mutlivariate normals - More complex dependency model (e.g. Tree, Bayesian network, Markov Random Field…) This is our first model with a hiden variable T->X or P(T,X) = P(T)*P(X|T) The probability of the data is derived by marginzation over the hidden variables Expectation Maximization (EM) (Working with e.g. a mixture of independent or correlated normal distributions) Initialize the mixture component parameters and mixture coefficients: q1, r1 Compute the posterior probabilities: wij Pr(t =j|Xi) Set new mixture coefficients: r2i=(1/m) Sj wij Set new mixture components using weighted data wij, Xi Iterate until convergence (no significant change in qlast, rlast) Update mixture component using weight data – examples: Independent normal: E(X[k]) E(X2[k]) is determined by weighted statistics Full multivariate: all statistics (E(X[k]X[u]) is determined by weighted stats Expectation-Maximization The most important property of the EM algorithm is that it is guaranteed to improve the likelihood of the data. This DOES NOT MEAN it will find a good solution. Bad initialization will converge to bad outcome. Here t is ranging over all possible log P(X | q ) = log å P(t, X | q ) assignment of the hidden mixture variable t t per sample – so km possibilities P(t, X | q ) = P(t | X, q )P(X, q ) log P(X | q ) = logP(t, X | q ) - log P(t | X,q ) log P(X | q ) = å P(t | X,q k )log P(t, X | q ) - å P(t | X, q k )log P(t | X, q ) h h Q(q | q k ) = å P(t | X, q k )log P(t, X | q ) h log P(X | q ) - log P(X | q k ) = P(t | X, q k ) Q(q | q ) - Q(q | q ) + å P(t | X, q )log P(t | X, q ) h Relative entropy>=0 EM maximization q k 1 arg maxQ(q | q k ) k Dempster q k k k Expectation-Maximization: Mixture model æ ö k Q(q | q ) = å P(t | X, q )log P(t, X | q ) = åç P(t | X, q )å log P(ti , Xi | q )÷ ø t t è i k k Q can be brought back to the form of a sum over independent terms, k of them depends only on the parameters of one mixture component, and one depends on the mixture coeeficient. Maximization of each of these component can then be shown to be done as in the non-mixture case (e.g. using the empirical covariance matrix for the multi-normal distribution). KL-divergence H ( P) P( xi ) log P( xi ) i 1 1 max H ( P) P( ) log P( ) log K k k i min H ( P) 0 Entropy (Shannon) Shannon H ( P || Q) P( xi ) log i H ( P ||| Q) P( xi ) log i P( xi ) Q( xi ) Kullback-leibler divergence Q( xi ) Q( xi ) P( xi ) 1 Q( xi ) P( xi ) 0 P( xi ) i P( xi ) i log(u) u 1 H ( P || Q) H (Q || P) Not a metric!! KL