Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles How does two-channel microarray work? • Printing process introduces errors and larger variance • Comparative hybridization experiment How does microarray work? • Fabrication expense and frequency of error increases with the length of probe, therefore 25 oligonucleotide probes are employed. • Problem: cross hybridization • Solution: introduce mismatched probe with one position (central) different with the matched probe. The difference gives a more accurate reading. How do we use microarray? • Inference • Clustering Normalization • Which normalization algorithm to use • Inter-slide normalization • Not just for Affymetrix arrays Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles Hypothesis Testing • Two set of samples sampled from two distributions (N=2) Hypothesis Testing • Two set of samples sampled from two distributions (N=2) • Hypothesis Null hypothesis Alternative hypothesis m1 and m2 are the means of the two distributions. Student’s t-test Student’s t-test p-value can be computed from t-value and number of freedom (related to number of samples) to give a bound on the probability for type-I error (claiming insignificant difference to be significant) assuming normal distributions. Student’s t-test • Dependent (paired) t-test Permutation (t-)test T-test relies on the parametric distribution assumption (normal distribution). Permutation tests do not depend on such an assumption. Examples include the permutation t-test and Wilcoxon rank-sum test. Perform regular t-test to obtain t-value t0. The randomly permute the N1+N2 samples and designate the first N1 as group 1 with the rest being group 2. Perform t-test again and record the t-value t. For all possible permutations, count how many tvalues are larger than t0 and write down the number K0. Multiple Classes (N>2) F-test • The null hypothesis is that the distribution of gene expression is the same for all classes. • The alternative hypothesis is that at least one of the classes has a distribution that is different from the other classes. • Which class is different cannot be determined in F-test (ANOVA). It can only be identified post hoc. Example • GEO Dataset Subgroup Effect Gene Discovery and Multiple T-tests Controlling False Positives • p-value cutoff = 0.05 (probability for false positive - type-I error) • 22,000 probesets • False discovery 22,000X0.05=1,100 • Focus on the 1,100 genes in the second speciman. False discovery 1,100X0.05 = 55 Gene Discovery and Multiple T-tests Controlling False Positives • State the set of genes explicitly before the experiments • Problem: not always feasible, defeat the purpose of large scale screening, could miss important discovery • Statistical tests to control the false positives Gene Discovery and Multiple T-tests Controlling False Positives • Statistical tests to control the false positives • Controlling for no false positives (very stringent, e.g. Bonferroni methods) • Controlling the number of false positives ( • Controlling the proportion of false positives • Note that in the screening stage, false positive is better than false negative as the later means missing of possibly important discovery. Gene Discovery and Multiple T-tests Controlling False Positives • Statistical tests to control the false positives • Controlling for no false positives (very stringent) • Bonferroni methods and multivariate permutation methods Bonferroni inequality Area of union < Sum of areas Gene Discovery and Multiple T-tests Bonferroni methods • Bonferroni adjustment • If Ei is the event for false positive discovery of gene I, conservative speaking, it is almost guaranteed to have false positive for K > 19. • So change the p-value cutoff line from p0 to p0/K. This is called Bonferroni adjustment. • If K=20, p0=0.05, we call a gene i is significantly differentially expressed if pi<0.0025. Gene Discovery and Multiple T-tests Bonferroni methods • Bonferroni adjustment • Too conservative. Excessive stringency leads to increased false negative (type II error). • Has problem with metaanalysis. • Variations: sequential Bonferroni test (Holm-Bonferroni test) • Sort the K p-values from small to large to get p1p2…pK. • So change the p-value cutoff line for the ith p-value to be p0/(K-i+1) (ie, p1p0/K, p2p0/(K-1), …, pKp0. • If pjp0/(K-j+1) for all ji but pi+1>p0/(K-i+1+1), reject all the alternative hypothesis from i+1 to K, but keep the hypothesis from 1 to i. Gene Discovery and Multiple T-tests Controlling False Positives • Statistical tests to control the false positives • Controlling the number of false positives • Simple approach – choose a cutoff for pvalues that are lower than the usual 0.05 but higher than that from Bonferroni adjustment • More sophisticated way: a version of multivariate permutation. Gene Discovery and Multiple T-tests Controlling False Positives • Statistical tests to control the false positives • Controlling the proportion of false positives Let g be the portion (percentage) of false positive in the total discovered genes. False positive Total positive pD is the choice. There are other ways for estimating false positives. Details can be found in Tusher et. al. PNAS 98:5116-5121. Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles Review of Microarray and Gene Discovery Clustering and Classification • Preprocessing • Distance measures • Popular algorithms (not necessarily the best ones) • More sophisticated ones • Evaluation • Data mining - Clustering or classification? - Is training data available? - What domain specific knowledge can be applied? - What preprocessing of data is needed? - Log / data scale and numerical stability - Filtering / denoising - Nonlinear kernel - Feature selection (do I need to use all the data?) - Is the dimensionality of the data too high? How do we process microarray data (clustering)? - Feature selection – genes, transformations of expression levels. - Genes discovered in the class comparison (ttest). Risk: missing genes. - Iterative approach : select genes under different p-value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA). Distance Measure (Metric?) - What do you mean by “similar”? - Euclidean - Uncentered correlation - Pearson correlation Distance Metric - Euclidean 102123_at 160552_at Lip1 3189.000 Ap1s1 5410.900 1596.000 1321.300 4144.400 3162.100 2040.900 2164.400 3986.900 4100.900 1277.000 868.600 3083.100 4603.200 4090.500 185.300 6105.900 6066.200 1357.600 266.400 3245.800 5505.800 dE(Lip1, Ap1s1) = 12883 1039.200 2527.800 4468.400 5702.700 1387.300 7295.000 Distance Metric - Pearson Correlation Ranges from 1 to -1. r=1 r = -1 Distance Metric - Pearson Correlation 102123_at 160552_at Lip1 3189.000 Ap1s1 5410.900 1596.000 1321.300 4144.400 3162.100 2040.900 2164.400 3986.900 4100.900 1277.000 868.600 3083.100 4603.200 4090.500 185.300 6105.900 6066.200 1357.600 266.400 3245.800 5505.800 1039.200 2527.800 4468.400 5702.700 8000 7000 6000 dP(Lip1, Ap1s1) = 0.904 5000 4000 3000 2000 1000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 1387.300 7295.000 Distance Metric - Uncentered Correlation 102123_at 160552_at Lip1 3189.000 Ap1s1 5410.900 1596.000 1321.300 4144.400 3162.100 2040.900 2164.400 3986.900 4100.900 1277.000 868.600 3083.100 4603.200 4090.500 185.300 6105.900 6066.200 1357.600 266.400 3245.800 5505.800 1039.200 2527.800 4468.400 5702.700 du(Lip1, Ap1s1) = 0.835 q About 33.4o 1387.300 7295.000 Distance Metric - Difference between Pearson correlation and uncentered correlation 102123_at Lip1 3189.000 Ap1s1 5410.900 160552_at 1596.000 1321.300 4144.400 3162.100 2040.900 2164.400 3986.900 4100.900 1277.000 868.600 3083.100 4603.200 4090.500 185.300 6105.900 6066.200 8000 8000 7000 7000 6000 6000 5000 5000 4000 4000 3000 3000 2000 2000 1000 1000 1357.600 266.400 3245.800 5505.800 1039.200 2527.800 4468.400 5702.700 1387.300 7295.000 0 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Pearson correlation Baseline expression possible 0 500 1000 1500 2000 2500 3000 3500 4000 Uncentered correlation All are considered signals 4500 Distance Metric - Difference between Euclidean and correlation Distance Metric - Missing: negative correlation may also mean “close” in signal pathway (1-|PCC|, 1-PCC^2) Review of Microarray and Gene Discovery Clustering and Classification • Preprocessing • Distance measures • Popular algorithms (not necessarily the best ones) • More sophisticated ones • Evaluation • Data mining How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters. How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters. How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA). How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering • Single linkage – Prone to chaining and sensitive to noise • Complete linkage – Tends to produce compact clusters • Average linkage – Sensitive to distance metric -Unsupervised Learning – Hierarchical Clustering -Unsupervised Learning – Hierarchical Clustering Dendrograms • Distance – the height each horizontal line represents the distance between the two groups it merges. • Order – Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria. - Unsupervised Learning - K-means - Vector quantization - K-D trees - Need to try different K, sensitive to initialization - Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric - Unsupervised Learning - K-means - Number of class K needs to be specified - Does not always converge - Sensitive to initialization - Issues - Lack of consistency or representative features (5.3 TP53 + 0.8 PTEN doesn’t make sense) - Data structure is missing - Not robust to outliers and noise D’Haeseleer 2005 Nat. Biotechnol 23(12):1499-501 - Model-based clustering methods (Han) http://www.cs.umd.edu/~bhhan/research2.html Pan et al. Genome Biology 2002 3:research0009.1 doi:10.1186/gb-2002-3-2-research0009 - Structure-based clustering methods - Supervised Learning - Support vector machines (SVM) and Kernels - Only (binary) classifier, no data model - Accuracy vs. generality - Overfitting Prediction error - Model selection Testing sample Training sample Model complexity (reproduced from Hastie et.al.)