Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University Outline Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Identifying functional groups Array Quality Assessment ISMB 2008 Microarray Tutorial Mark Reimers Virginia Commonwealth University Approaches to Quality Control • • • • • • Careful Technique Practice! Testing the RNA quality Examining the chips and images Spot metrics Statistical approaches – Examining deviations as functions of technical variables Simple QA for Spotted Arrays • Spot Measures – Signal/Noise • Foreground / background or – foreground / SD – Uniformity – Spot Area • Global Measures – Qualitative assessments – Averages of spot measures • Inspect images for artifacts – Streaks of dye, scratches etc. • Are there biases in regions? Using Statistics for QA • Significant artifacts may not be obvious from visual inspection or bulk statistics • General approach: plot deviations from average or residuals from fit against any technical variable: – Intensity – CG content or Tm – Probe position relative to 3’ end (for poly-T primed) – Location on chip • Color-code deviations on chip image 4 5 6 7 8 9 Simple Portrait - Boxplots • Boxplot of 16 chips from Cheung et al Nature 2005 Another Portrait - Densities 0.7 Density Plots: before and after Chips 0.1 0.2 0.3 0.4 GSM25540.CEL GSM25541.CEL GSM25542.CEL GSM25543.CEL GSM25548.CEL GSM25549.CEL GSM25550.CEL GSM25551.CEL 0.0 Density 0.5 0.6 GSM25524.CEL GSM25525.CEL GSM25526.CEL GSM25527.CEL GSM25528.CEL GSM25529.CEL GSM25530.CEL GSM25531.CEL 4 6 8 10 log(Signal) 12 14 Ratio vs Intensity Plots: Saturation & Quenching • Saturation – Decreasing rate of binding of RNA at higher occupancies on probe • Quenching: – Light emitted by one dye molecule may be re-absorbed by a nearby dye molecule – Then lost as heat – Effect proportional to square of density Plot of log ratio against average log intensity across chips GSM25377 from the CEPH expression data GSE2552 How Much Variability on R-I? • Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005) Covariation with Probe Tm • MAQC project • Agilent 44K – Array 1C3 – Performed by Agilent •Plot of log ratios to average against Tm •Bimodal distribution because two samples are very different Covariation with Probe Position • RNA degrades from 5’ end • Intensity should decrease from 3’ end uniformly across chips • affyRNAdeg plots in affy package Plot of average intensity for each probe position across all genes against probe position Spatial Variation Across Chips Red/Green ratios show variation -probably concentrated Ratios of ratios on slide to ratios on standard show consistent biases In House Spotted Arrays Ratio of ratios shows much clearer concentration of red spots on some slides Legend Note non-random but highly irregular concentration of red Background Subtraction (1) • We think that local background contributes to bias • Does subtracting background remove bias? Local off-spot background may not be the best estimate of spot background (nonspecific hyb) Spots BG subtracted Background Subtraction (2) Raw spot ratios show a mild bias relative to average After subtracting a high green bg in the center a red bias results Raw Ratios Background BG-subtracted Other Bias Patterns Processed This spotted oligo array shows strong biases at the beginning and end of each print-tip group The background shows a milder version of this effect Subtracting background compensates for about half this effect Raw Spot Background Local Bias on Affymetrix Chips Image of raw data on a log2 scale shows striations but no obvious artifacts Image of ratios of probes to standard shows a smudge Noncoding probes Images show high values as red, low values as yellow Variation in Affy Chips QC in Bioconductor • Robust Multi-chip Analysis (RMA) – fits a linear model to each probe set – High residuals show regional patterns High residuals in green Available in affyPLM package at www.bioconductor.org Portion of dChip QA image High residuals in pink www.dchip.org Affy QC Metrics in Bioconductor • affyPLM package fits probe level model to Affymetrix raw data • NUSE - Normalized Unscaled Standard Errors – normalized relative to each gene • How many big errors? Conclusions • Technical bias is a significant source of error in microarray studies • Technical bias can be visualized and quantified • Normalization can only partially compensate these problems • It is best to drop chips with extreme technical deviations Are Microarrays Reliable? • Microarray studies have an uneven track record of replication • Huang et al (2003) replicated few of the markers for breast cancer survival identified by van t’Veer et al (2002) • Two Science papers in 2003 on stem cell gene expression describing parallel experiments identify only 2 genes in common out of hundreds • The MAQC project papers in fall 2006 claimed to validate microarrays. Measures of consistency were highly variable across any two platforms • Maybe both are true: – careful microarray studies are accurate Further Questions Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Identifying functional groups Normalizing Expression Data ISMB 2008 Tutorial QA and Normalization • Technical differences cause changes in measures • QA flags big technical differences in arrays – Often sporadic – e.g. scratches on chip • Normalization attempts to compensate for modest but systematic differences – e.g. intensity-dependent bias (quenching) • Both must be done together Variations in Technique with Broad Consequences for Measures • • • • • • • Temperature of hybridization Amount of RNA Degradation of RNA Yield of conversion to cDNA or cRNA Yield of labeling reaction Strength of ionic buffers Stringency of wash Many Normalization Approaches • One Parameter – Total or median brightness • Two parameter – Variance stabilizing – Lowess for two-color arrays • Non-parametric – Distribution (quantile) matching • Regression on technical covariates One Parameter – Overall Mean • Can only measure relative levels of expression: per mg RNA • Assume: only difference between chips is due to different weights of RNA hybridized N N • Set: red green Cred f i i 1 ; Cgreen f i i 1 • For each chip, normalized values are: fi* fi red / Cred ; fi* fi green / Cgreen • For 2-color cDNA use separate constant for each channel on each chip • More consistent results if use a robust estimator, such as median or 1/3 – trimmed mean (Quackenbush): take mean of middle 2/3 of probes One-Parameter Limitations 0.4 0.3 0.2 0.1 0.0 Density 0.5 0.6 0.7 Density Plots: before and after 4 6 8 10 12 14 A centering transform canlog(Signal) shift the density of log2 data but can’t get around the differences in shape of distributions Intensity Dependent Bias Ratio – Intensity (M-A) plot of raw data: M = log2(R/G) ; A = (log2(R) + log2(G)) / 2 Global (lowess) Normalization Same data set normalized by: Mnorm = M-c(A) where c(A) is an intensity dependent function estimated by local regression. Print-tip Normalization Print-tip layout Separate lowess curves for each of 16 print-tips 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Box plots for log ratios in each of 16 print-tip groups Scaled Print-tip Normalization Box plot after print-tip normalization Box plot after scaled print-tip normalization There still remain apparent technical artifacts Try to fix them by scaling after print-tip lowess: Mp,norm = sp·(Mp-cp(A)); Spatial Effects No normalization Global normalization Print-tip normalization Scaled Print-tip normalization This is too Complex! • We are piling fudge factor upon fudge factor • We don’t know what errors these fudges are introducing… • … and still haven’t removed all artifacts • How about something simpler! Quantile Normalization • Currently most widely used ‘best’ method • Implemented in BioC affy, oligo, limma • Ignores causes of variation and technical covariates • Shoehorns all data into the same shape distribution – matching quantiles Motivation: Probe Intensities in 23 Replicates on Affy chips Densities of intensities from GeneLogic spikein study; black is composite Quantile Normalization (Irizarry et al 2002) 1. Map values in each curve separately to their quantile within the distribution 2. Map quantiles of each distribution to quantiles of the reference curve The mapping by quantile normalization Ratio Intensity Correction? M-A plots of raw data from Affy chip pairs Ratio-Intensity: After Quantile Norm Critiques of Quantile Normalization • Artificially compresses variation of highly expressed genes • Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression • Induces artifactual correlations in lowintensity genes Technical Variable Regression • Hypothesis: 1. Most technical variation between chips is caused by a few systemic factors 2. Probes with similar technical characteristics (Tm, position in gene, location on chip, intensity) will be distorted by similar amounts • Normalization: estimate bias due to technical factors by local averaging of deviations from reference profile Further Questions Models for Multiple-Probe Affymetrix and NimbleGen Data Many Probes for One Gene Gene 5´ Sequence 3´ Multiple oligo probes Perfect Match Mismatch How to combine signals from multiple probes into a single gene abundance estimate? Probe Variation • Individual probes don’t agree on fold changes • Probes vary by two orders of magnitude on each chip – CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip Probe Measure Variation •Typical probes are two orders of magnitude different! •CG content is most important factor •RNA target folding also affects hybridization 3x104 0 Many Approaches • Affymetrix MicroArray Suite • dChip - Li and Wong, HSPH • Bioconductor: – RMA - Bolstad, Irizarry, Speed, et al – affyPLM – Bolstad – gcRMA – Wu • Physical chemistry models – Zhang et al Critique of Averaging (MAS5) • Not clear what an average of different probes should mean • Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here • No ‘learning’ based on cross-chip performance of individual probes Motivation for multi-chip models: Probe level data from spike-in study ( log scale ) note parallel trend of all probes Courtesy of Terry Speed Model for Probe Signal • Each probe signal is proportional to – i) the amount of target sample – a – ii) the affinity of the specific probe sequence to the target – f • NB: High affinity is not the same as Specificity – Probe can give high signal to intended target and also to other transcripts Probes 1 2 3 chip 1 a1 chip 2 a2 f1 f2 f3 Multiplicative Model • For each gene, a set of probes p1,…,pk • Each probe pj binds the gene with efficiency fj • In each sample there is an amount ai. • Probe intensity should be proportional to fjxai • Always some noise! Robust Linear Models • Criterion of fit – Least median squares – Sum of weighted squares – Least squares and throw out outliers • Method for finding fit – High-dimensional search – Iteratively re-weighted least squares – Median Polish Bolstad, Irizarry, Speed – (RMA) • For each probe set, take the log transform of PMij = aifj: log ( PM ij ) log( ai ) log( f j ) • i.e. fit the model: log ( PM ij bg) ai b j ij After normalization! • Fit this additive model by iteratively reweighted least-squares or median polish Critique: Model assumes probe noise is constant (homoschedastic) on log scale Comparing Measures Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA 20 replicate arrays – variance should be small Standard deviations of expression estimates on arrays arranged in four groups of genes Courtesy of Terry Speed by increasing mean expression level Current Multi-chip Approaches • PLIER – Affymetrix’ own multi-chip model • RMA – Bioconductor standard – Affy package – To be replaced by oligo package? • affyPLM – More sophisticated RMA – Careful with more sophisticated models! • gcRMA – Estimates background Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Identifying functional groups Further Questions Outline of Section • • • • • General Issues Normalizing CGH arrays Normalizing ChIP-chip arrays Normalizing SNP arrays Normalizing methylation (HELP) arrays General Issues • Quantile (and other expression normalizations) assume that overall distribution of values is invariant across samples – Roughly true for most expression sets – Untrue for copy number, DNA methylation, ChiP-chip data • Many choices for expression probes • Tiling, exon, and promoter arrays must work with what is there Array CGH Technology Array CGH Data: Chromosome 8 (241 BAC’s) 10 cell lines and 45 tumor samples CGH Array Pre-processing • Dye bias effects are present but because the ranges are usually smaller they have less impact – Most programs don’t do a dye bias adjustment • Probe bias effects – Most probes show consistent biases in relation to their neighbors – About 50% of total variance is such bias – Few programs adjust for that – See Bilke et al (Bioinformatics 2005) CGH Array Segmentation • Key idea: Most probe targets have same copy number as their next neighbors • Can average over neighbors • Key issue: when is a difference real? • Recommended Programs: • DNACopy – Solid statistical basis; slow • StepGram – Heuristic ; fast Further Questions ChIP-chips Chromatin Immuno-precipitation Tiling Array • One probe every n base pairs over some length of chromosome What the data look like Superimposing channels Giresi et al, Genome Res. 10 Methods and Issues • Normalization – Different enrichment ratios – Different probe thermodynamics – Dye and probe bias • Estimation – Categorical or continuous? – Individual values are noisy: • Where is the peak? TileMap • Ignores normalization • ‘Shrinkage’ estimator of variance – Improves individual scores • Smooths noise by moving average • Alternative HMM MAT • One color array – Needs separate array for comparison • Normalizes probe thermodynamics & enrichment ratio • Estimation by (robust) moving average Further Questions SNP chips Affy SNP chips SNP Chip Probe Design • 10 25-mers overlapping the SNP • Alleles A & B • Sense and Anti-sense – or PM and MM (old) RMA for SNP chips • Early Affy software wasn’t very accurate • Rabbee & Speed (2006) proposed RLMM, an RMA-like method using: – Quantile normalization – Two variables ( A & B signals) – Discriminant analysis (Mahalanobis metric) to tell apart alleles • Much better than Affy software • Has been adopted by Affy Discriminating SNPs SNP A 1753037 SNP A 1698299 Improvements • RLMM still had problems with several hundred SNPs • Also problems comparing across labs • CRLMM addresses these by careful normalization CRLMM Overview • Model intensity on each chip separately by • Summarize qA, qB, qA, qB by median polish: M+ = qA - qB ; M- = qA- qB • Model log ratio bias on each chip by • Estimate log ratio bias using E-M – Zi indexes which SNP state is likely Normalization – Step 1 • Regress (PM) intensity on sequence predictors and fragment length Normalization – Step 1 • Too many hb(t)’s – Impose constraint: • hb(t) is a cubic spline with 5 df on [1,25] • Forces neighboring values of h to be close • Allows variation in smoothness (unlike loess) • Subtract fitted values from signal • BUT: bias still present Summary Method • • • • Median Polish For arrays of numbers Iterative method Subtract medians of each row and each column (and accumulate) until medians converge Normalization Step 2 • Fit bias function: • But what is k? E-M Algorithm • • • • Systematic way to ‘guess and improve’ Start with putative assignments to classes Estimate bias and fit Use residuals from fit to classify again Final Step: Calling • Separation in two-dimensional log-ratio space: Further Questions BREAK Return at 3:45 Exploratory Data Analysis Clustering • Can be useful to detect unsuspected affiliations among clinical samples or genes • For most designed experiments significance testing or multivariate analysis are better choices When is Clustering Useful? • To examine data for artifacts • To try to identify sub-categories in undifferentiated clinical samples • To impute roles / functions for genes by association Caveats for Clustering • Ordering of samples in dendrogram is arbitrary – not meaningful relation Samples 3 & 6 could be on the right • Distance metrics are usually arbitrary • Correlation ‘distance’ usually improves if increase dynamic range Multivariate Exploration Example – Ape Brain Gene Exp. • Enard et al, Science (2002) • Nice separation between species Caveats for PCA • Multivariate methods are sensitive – Pick up subtle patterns – Can easily pick up artifacts in data 1. The clustering does not reflect phylogeny 2. The principal components don’t load heavily onto distinct subsets of genes Further Questions Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Selecting functional groups Selecting Genes Multiple Comparisons Issues ISMB 2008 Tutorial Outline • Multiple Testing – Family wide error rates – False discovery rates • Application to microarray data – Practical issues – Computing FDR by permutation procedures – Conditioning t-scores Goals • To identify genes most likely to be changed or affected • To prioritize candidates for focused followup studies • To characterize functional changes consequent on changes in gene expression Individual Variation Within Groups • Numerous genes show high levels of interindividual variation • Level of variation depends on tissue also • Donors or experimental animals may be infected or under social stress • Tissues are hypoxic or ischemic for variable times before freezing • Tissues are mixtures of different types of cells – proportions may vary • Inflammatory cells recruited differently Technical Variation • • • • • • • • Temperature of hybridization Amount of RNA Degradation of RNA Yield and lengths of cDNA conversion Strength of ionic buffers Stringency of wash Scratches on chip Random variations in number of molecules Multiple comparisons • Suppose no genes really changed – (as if random samples from same population) • 10,000 genes on a chip • Each gene has a 5% chance of exceeding the threshold at a p-value of .05 – Type I error • The test statistics for 500 genes should exceed .05 threshold ‘by chance’ Distributions of p-values Random Data Real Microarray Data Characterizing False Positives • Family-Wide Error Rate (FWE) – probability of at least one false positive arising from the selection procedure • Strong control of FWE: – Bound on the probability • False Discovery Rate: – Proportion of false positives arising from selection procedure ‘Corrected’ p-Values for FWE • Sidak (exact correction for independent tests) – pi* = 1 – (1 – pi)N – Still conservative if genes are co-regulated (correlated) – pi* = 1 – (1 – Npi + …) gives Bonferroni • Bonferroni correction – pi* = Npi, if Npi < 1, otherwise 1 – Too conservative! Increasing Power by Lower N • If we select a small subset of genes for tests we will have better p-values • Selecting well-measured or biologically more probable genes is reasonable – e.g. high S/N ratio, high variance • Selecting genes by statistical criteria related to the test that will be performed is not reasonable. – e.g. selecting genes obtained by clustering Simes’ Lemma • Suppose we order the p-values from N independent tests using random data: – p(1), p(2), …, p(N) are order statistics from uniform • Pick a target threshold a: – Then: P( p(1) < a /N | p(2) < 2 a /N | p(3) < 3 a /N | … ) = a • Hochberg’s procedure: – If any p(k) < k a /N – Then select genes (1) to (k) False Discovery Rate • At threshold t* what fraction of genes are likely to be true positives? • Illustration: 10,000 independent genes t p #sig E(FP) FDR* 1.96 .05 600 500 87% 2.57 .01 200 100 50% 3.29 .001 40 10 20% In practice use permutation algorithm to compute FDR Benjamini-Hochberg • Can’t know what FDR is for a particular sample • B-H suggest procedure specifying Average FDR – Order the p-values : p(1), p(2), …, p(N) – If any p(k) < k a /N – Then select genes (1) to (k) • q-value: smallest FDR at which the gene becomes ‘significant’ • NB: acceptable FDR may be much larger than acceptable p-value (e.g. 0.10 ) Practical Issues • Actual proportion of false positives varies from data set to data set • Mean FDR could be low but could be high in your data set The Effect of Correlation • If all genes are uncorrelated, Sidak is exact • If all genes were perfectly correlated – p-values for one are p-values for all – No multiple-comparisons correction needed • Typical gene data is highly correlated – First eigenvalue of SVD may be more than half the variance – Distribution of p-values may differ from uniform • True FDR more variable P-Values from Correlated Genes Null distribution from Null distribution from Null distribution from independent genes perfectly correlated genes highly correlated genes .5 .3 .9 .5 .3 .9 .5 .3 .9 .7 .03 .1 .5 .3 .9 .45 .2 .95 .4 .9 .05 .5 .3 .9 .65 .25 .8 .6 .8 .4 .5 .3 .9 .4 .35 .75 .2 .2 .9 .5 .3 .9 .5 .4 .85 Rows: genes; columns: samples; entries: p-values from randomized distribution Re-formulating the Question • Independent: ~5% of genes exceed .05 threshold, all the time • Perfectly correlated: all genes exceed .05 threshold ~5% of the time • Partially correlated: .05 < f1 < 1 of genes exceeds .05 threshold, .05 < f2 < 1 of the cases • New question: for a given f1 and a, how likely is it that a fraction f1 of genes will exceed the a threshold? Symptoms of Correlated Tests Typical P-value histograms with NO real changes P-values from random but correlated data P-values from random but correlated data -- permuted Distributions of numbers of p-values below threshold • 10,000 genes; • 10,000 random drawings • L: Uncorrelated R: Highly correlated Permutation Tests • We don’t know the true distribution of gene expression measures within groups • We simulate the distribution of samples drawn from the same group by pooling the two groups, and selecting randomly two groups of the same size we are testing. • Need at least 5 in each group to do this! Permutation Tests – How To • Suppose samples 1,2,…,10 are in group 1 and samples 11 – 20 are from group 2 • Permute 1,2,…,20: say – 13,4,7,20,9,11,17,3,8,19,2,5,16,14,6,18,12,15,10 – Construct t-scores for each gene based on these groups • Repeat many times to obtain Null distribution of t-scores • This will be a t-distribution original distribution has no outliers Multivariate Permutation Tests • Want a null distribution with same correlation structure as given data but no real differences between groups • Permute group labels among samples • redo tests with pseudo-groups • repeat ad infinitum (10,000 times) Critiques of Permutations • Variances of permuted values for truly changed genes are inflated – artificially low p-values • Permuted t -scores for many genes may be lower than from random samples from the same population Recurrent Aberrations • Many genetic diseases are highly variable • Large numbers of genes show changes: – mutation (Sjoblom et al, Science 2006) – differential expression (many) – amplifications/ deletions (many) – methylation (Figueroa & Reimers, submitted) • Few show consistency! Categorical Analysis: Mutations • Mutations occur randomly – But at different rates in different samples • Null model IID occurrence: Poisson – Estimated parameters: sample mutation rate l • Compare numbers of genes mutated in k samples to prediction Copy Number Patterns • Neighboring loci are dependent because long regions typically get amplified/deleted • Different samples have widely differing rates of genomic aberrations • Null model (like permutation) – same number of aberrant regions and lengths as actually occur in each sample – but occurring at random positions Segmented Data Recurrent Loci Statistical Significance • ‘Permute’ aberrations in all samples • For each n: what fraction of loci end up with > n aberrations in each permutation? • Select n such that fraction exceeding n is less than x% of that observed in y% of the permutations Further Questions Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Selecting functional groups Goals • Characterize biological meaning of joint changes in gene expression • Organize expression (or other) changes into meaningful ‘chunks’ (themes) • Identify crucial points in process where intervention could make a difference • Why? Biology is Redundant! Often sets of genes doing related functions are changed Gene Sets • Gene Ontology – Biological Process – Molecular Function – Cellular Location • Pathway Databases – KEGG – BioCarta – Broad Institute Other Gene Sets • Transcription factor targets – All the genes regulated by particular TF’s • Protein complex components – Sets of genes whose protein products function together • Ion channel receptors • RNA / DNA Polymerase • Paralogs – Families of genes descended (in eukaryotic times) from a common ancestor Approaches • Univariate: – Derive summary statistics for each gene independently – Group statistics of genes by gene group • Multivariate: – Analyze covariation of genes in groups across individuals – More adaptable to continuous statistics Univariate Approaches • Discrete tests: enrichment for groups in gene lists – Select genes differentially expressed at some cutoff – For each gene group cross-tabulate – Test for significance (Hyper-geometric or Fisher test) • Continuous tests: from gene scores to group scores – Compare distribution of scores within each group to random selections – GSEA (Gene Set Enrichment Analysis) – PAGE (Parametric Analysis of Gene Expression) Multivariate Approaches • Classical multivariate methods – Multi-dimensional Scaling – Hotelling’s T2 • Informativeness – Topological score relative to network – Prediction by machine learning tool • e.g. ‘random forest’ Contingency Table: Category X Significance Signif. NS All Genes Genes Group of Interest k n-k Others K-k Totals K (N-n)- N-n (K-k) N-K N Combinatorial Probability n P= for each table Categorical Analysis • Fisher’s Exact Test – Of all tables with same margins, how many show as much or more enrichment? • Tedious to compute when n or k are large • Approximations – Binomial (when k / n is small) – Chi-square (when expected values > 5 ) – G2 (log-likelihood ratio; compare to c2) • Approx. not needed for modern computers Signif. genes NS Genes All Group k n-k n Others K-k (N-n)-(K-k) N-n Totals K N-K N Issues in Assessing Significance • P-value or FDR? – FDR more powerful but strong correlations • What is appropriate Null Distribution? – Random sets of genes? Or – Random assignments of samples? • If a child category is significant, how to assess significance of parent category? – Include child category – Consider only genes outside child category Critiques of Discrete Approach • Discrete approaches are simple – often good for a first pass at a novel situation • No use of size of change information • Continuous procedures have twice the power of analogous discrete procedures on discretized continuous data • No account of gene covariation – Covariation changes p-values greatly Continuous Scores • Gene Set Enrichment Analysis – Subramanian et al (2005) improves by hack – Mootha et al (2003) uses distributional test • PAGE – Simple z-score test – Adds up changes in all genes in a pathway • T-profiler – T-version of PAGE GSEA • Uses Kolmogorov-Smirnov (K-S) test of distribution equality to compare t-scores for selected gene group with all genes Group Z- or T- Scores • Form z-scores for each gene: – Z = ( mG – m ) / SAll • Each gene’s z-score (zi) is distributed N(0,1) • If z-scores are assigned randomly in groups – Hence the sum over genes in a group G: z iG i / G ~ N (0,1) • Compare group scores to Normal distribution – Remember multiple comparisons! (many groups) Issues for Pathway Methods • How to assess significance? – Null distribution by permutations – Permute genes or samples? • How to handle activators and inhibitors in the same pathway? – Variance Test – Other approaches • Critique: all genes treated as independent – No account of gene covariation – Covariation changes p-values greatly Further Questions Multivariate Approaches to Gene Set Selection Key Multivariate Ideas • PCA (Principal Components Analysis) – Identify major directions of covariation in data • MDS (Multi-dimensional Scaling) – Easy graphical representation of data in terms of many variables • Hotelling T2 – Test for differences between two sample means PCA Three correlated variables PCA1 lies along the direction of maximal correlation; PCA 2 at right angles with the next highest variation. MDS Representation of Expression in Pathways • BAD pathway Normal IBC Other BC • Clear separation between groups • Variation differences Hotelling’s T2 • Compute distance between sample means using (common) metric of covariation • Where • Multidimensional analog of t (actually F) statistic Principles of Kong et al Method • Normal covariation generally acts to preserve homeostasis • The transcription of genes that participate in many processes will be changed • The joint changes in genes will be most distinctive for those genes active in pathways that are working differently Critiques of Hotelling’s T • Small samples: unreliable S estimates –N<p • Estimates of Dx and S not robust to outliers • Assumes same covariance in each sample – S1 = S2 ? Usually not in disease – Kong et al propose analog of Welch t-test – Permutation in samples for significance Making it Stable 1. Insufficient information to capture all relationships – too much correlation! – Power of Hotelling’s method comes from identifying directions of rare variation – Many (spurious) directions of 0 variation 2. Random variation in data leads to random variation in PCA • Regularization strategy: force covariance to be more like independent variables Making it Robust • Microarray data has many outliers • Multivariate methods are very much distorted by outliers • Robust estimates of covariance could give robust PCA • Simple approach: trim outliers Handling Changes of Covariance • Power of Hotelling’s method comes from identifying directions of rare variation • If one group shows little covariation in one direction but the other does – how to test for changes? • If one group is control then its covariance should be taken as standard – Robust measure of means in both groups Summary • All arrays subject to systemic error – Often hidden from view! • Best normalizations for expression arrays • Normalization methods for non-expression arrays very different • Selecting individual genes problematic in presence of correlated variation • Selection by pathways often more powerful but often spurious Additional Information • An Opinionated Guide to Microarray Data Analysis (coming August 2008) • www.people.vcu.edu/~mreimers/OGMDA • BioConductor: www.bioconductor.org