Microarray Design and analysis: Linear model analyses Edinburgh 13-15 Oct 2008 Luc Janss University of Aarhus luc.janss@djf.au.dk Luc Janss Edinburgh Oct 2008 Course overview & general remarks Overview: 1. Normalisations (LOWESS, variance, bit on background). Practicals: start with R 2. Simple analyses (t-tests, clustering), false positives and the use of False Discovery Rate 3. More advanced analyses using linear models Luc Janss Edinburgh Oct 2008 Course overview & general remarks Assumed background: (virtually) none in statistics (virtually) none in the use of R Some general biological background on expressions, arrays, array experiments. Luc Janss Edinburgh Oct 2008 Course overview & general remarks Types of microarrays One-sample per array (affymetrix type) Two-samples per array (e.g. spotted cDNA type). Here: focus on the two-sample arrays Generally more noisy, more complications in design and analysis Most is also applicable (sometimes simpler …) to one-sample arrays. Luc Janss Edinburgh Oct 2008 1st Morning: Normalisations Logging, means and medians MA-plot and dye-intensity bias Correction using lowess fit Choice of correction basis Background correction Effect on variation at low intensities Handling of negative values Few remarks on Quality Control Introduction to the statistical package “R” Luc Janss Edinburgh Oct 2008 The raw data, use of means/medians and logging Luc Janss Edinburgh Oct 2008 Raw data Spot localisation: size may vary; quality of localisation may vary Count of nr pixels in spot area and their colour (intensity) values Total, mean or median of colour-values of pixels Use of the median intensity is safest because it is not affected by a single extreme pixel Similar for a number of background spots And all for green (cy3) and red (cy5) Luc Janss Edinburgh Oct 2008 Gene expression from two channels: plotted cy3 and cy5 raw data Expectation: a cloud around the 45º line because most genes are neutral Problem: although scale goes up to 15,000 the majority of points is <1,500 (here: 92%). 15000 12000 Any trends and relationships here say more about the last 8% of data than about the first 92%…. 9000 6000 3000 0 0 3000 92% 6000 9000 12000 15000 Luc Janss Edinburgh Oct 2008 Gene expressions fromn two ree g channels after logging nd a d 4.5 4 e or M 3.5 g ree Eq n re l a u 3 2.5 2 1.5 M or ed r e 1 1 1.5 2 2.5 3 3.5 4 4.5 Luc Janss Edinburgh Oct 2008 Logging, mean and medians So generally, we work on log scale Extreme values pose less problems Scale is better spread On log scale we can fairly safely use means and linear models For items obtained before logging (e.g., intensities, background corrections), medians are safer to use The median is the middle point of a data range, and is not affected by an extreme value in a tail Luc Janss Edinburgh Oct 2008 The M-A plot and lowess normalisation Luc Janss Edinburgh Oct 2008 M = difference of log intensities More convenient plot: MA-plot 1.5 More green 1 0.5 0 1 1.5 2 2.5 3 3.5 4 4.5 -0.5 -1 -1.5 More red M is log(fold change) -2 A = average of log intensities Luc Janss Edinburgh Oct 2008 Logging and M-values The difference of logs = log(ratio) of original values So M can be seen to be the log(fold change): M<0: ratio <1 or down M>0: ratio >1 or up Often log base 2 is used, then: Every “M” unit + (or -) represents doubling (or halving) of expression ratio Luc Janss Edinburgh Oct 2008 Computations using M and A Let C1 = Intensity in Channel 1 (usually cy3) Let C2 = Intensity in Channel 2 (usually cy5) Log to base 2 is: LC1 = log(C1)/log(2) in R also log2( ) LC2 = log(C2)/log(2) A=(LC1+LC2)/2 M=LC2-LC1 => corresponds to log2(cy5/cy3) Back transformation from M,A -> LC1,LC2 LC1 = A – 0.5M LC2 = A + 0.5M Luc Janss Edinburgh Oct 2008 Some observation on the MA plot: The average M is not 0 but <0 (ratio < 1) Could it really be that more genes are down? Or could the slide simply have too much red? (A dye swap or controls could show) As the average intensity increases spots become more red Although it could happen to be so, more likely this is an artefact that needs correction The spread of M values increases Luc Janss Edinburgh Oct 2008 So are these the correct top spots? 1.5 1.5 Up 1 1 0.5 0.5 0 1 1.5 2 2.5 3 3.5 4 4.5 0 1 -0.5 -0.5 -1 -1 -1.5 -2 Dw Up 1.5 Dw 2 2.5 3 3.5 4 -1.5 -2 Or better these? Luc Janss Edinburgh Oct 2008 4.5 LOWESS (or LOESS) correction: pulls the MA plot “straight” 1.5 1 0.5 0 1 1.5 2 2.5 3 3.5 4 4.5 -0.5 -1 -1.5 -2 And also works for curved plots ….Luc Janss Edinburgh Oct 2008 LO(W)ESS: background Takes a portion of your data Portion needs to be given by user (e.g. 0.3 to 0.5) Data needs sorting or results are sorted! Fits straight line Moves up 1 data point, repeats Windows using portion=0.50 Luc Janss Edinburgh Oct 2008 Effect of LOWESS portion: Using low portion parameter gives little smoothing Using high portion parameter gives large smoothing For M-A plot 0.3-0.5 usually gives good results: follows main trend but does not fit every aberration Luc Janss Edinburgh Oct 2008 Further issues for LOWESS fit LOWESS can be applied to various sets of genes All “real” genes on the array (not controls etc.): assumes that there is no excess of either up or down genes (on average “neutral”) Controls A subset of the “real” genes which we believe to be neutral Of course this can crucially affect the genes selected in our top lists Luc Janss Edinburgh Oct 2008 Scenario’s when adjusting lowess to controls or subsets Raw data Lowess adjusted Genes Controls Limits for selecting top genes Luc Janss Edinburgh Oct 2008 Adjusting to controls or subsets …. We can also do some statistics to see whether level of controls is different from genes A problem of using controls or subsets is that extrapolation outside range of controls is dangerous: creates loss of data. Data Range where lowess fit can Data lost be made on controls lost Luc Janss Edinburgh Oct 2008 A bit on background correction Luc Janss Edinburgh Oct 2008 Background and Foreground 500 Foreground 300 100 Background hybridisation What is a better assessment of ratio between red and green? • 500/300 = 1.67 • (500-100)/(300-100) = 2.00 Luc Janss Edinburgh Oct 2008 Problems with background correction Red Green Background Red/Green 500 300 0 1.67 500 300 100 2.00 500 300 200 3.00 500 300 250 5.00 500 300 299 >200 500 300 310 “-19” Luc Janss Edinburgh Oct 2008 Problems with background correction If background gets close to one (or both) of the foreground signals, large ratio’s can arise Can increase spread of M values in lower intensity region dramatically If background gets above one of the foreground signals a negative ratio is the result These cannot be logged And actually have no meaning With background correction M’s get inaccurate when one foreground is close to the background Luc Janss Edinburgh Oct 2008 Solutions to work with background correction Points which get negative after correction: Need to be dropped -> loss of data Arbitrarily set at 1 -> gives arbitrary ratio Arbitrarily reduce entire background level -> also results in arbitrary ratio values The safest remains to drop negative points, but this remains undesirable (also good points get lost, see following). Background computed on blank/empty spots is often lower and gives less loss of data Luc Janss Edinburgh Oct 2008 Illustration 500 300 100 This spot would be lost after background correction, but surely looks interesting! With arbitrary setting for green at background+1, we would compute a ratio of ~400. Without background correction we would compute a ratio of ~5. Luc Janss Edinburgh Oct 2008 Background correction? Many people now advocate not correcting for background It introduces large variation/extreme ratio’s Handling of signals close to background remains unsatisfactory The safe option would be to drop points below background, but this also drops interesting up/down genes Background corrections are applied on the raw (nonlogged) intensities, which introduces further problems of extreme values. Luc Janss Edinburgh Oct 2008 A bit on Quality Control Luc Janss Edinburgh Oct 2008 Quality control Quality control can be done on several levels: By the scanner/image analysis software Using statistics within slide Using statistics between slides Luc Janss Edinburgh Oct 2008 Quality control from the image analysis Can report such features as: Standard deviation of pixel intensities Large s.d. can indicate non-round spots, bad spot localisation, “dust” particles, etc. # of pixels in the spot: very small # may actually be a dust particle in an otherwise faint spot Luc Janss Edinburgh Oct 2008 Within slide analysis for QC Having duplo (or more) measurements per gene helps to filter out bad spots Analysis of sector effects (usually each sector is associated with a print tip) Analysis of control and empty/blank spots Luc Janss Edinburgh Oct 2008 Between slide QC Slides with large variation or extreme values Slides with poor within-slide repeatability for duplicate gene expressions Check for repeatability of genes between slides, filter out poor duplicates on gene level Slides with dye effects (bad swap) (see Woolfinger model tomorrow) Luc Janss Edinburgh Oct 2008 Second half day Luc Janss Edinburgh Oct 2008 Outline afternoon Problem of variance heterogeneity Look at multiple slides boxplots, variance normalisation First simple statistic for analysis: t-test Multiple testing problems False negatives and positives, False Discovery Rate Permutation approach (SAM) Clustering Luc Janss Edinburgh Oct 2008 Variance heterogeneity Variance heterogeneity can lead to unsatisfactory choices for top-genes Simple “cut-off” favours high-variable genes Statistics (like t-test) favour low-variable genes Variance heterogeneity can be a problem between slides, between genes, between sectors within a slide, …. Note: variance estimates based on small numbers (such as for genes) are very inaccurate -> extra complication addressed Wednesday Luc Janss Edinburgh Oct 2008 Illustration variance heterogeneity Low variable gene High variable gene with same average up-regulation M A • Simple cut-off would pick up high variable gene in some replicates, and never low variable gene • Statistical analysis would sooner pick up the low variable gene because statistics say it has higher chance to really deviate from M=0 Luc Janss Edinburgh Oct 2008 1. Boxplots and variance normalisation between slides Luc Janss Edinburgh Oct 2008 25% of data (Q1) 1.5* IQ range 75% of data (Q3) Median (Q2) Interquartile range Variances between slides: boxplots 1.5* IQ range is expected to contain all your data points; points outside this range are “outliers” and individually drawn Luc Janss Edinburgh Oct 2008 Variance standardisation Common approach for standardisation of yi’s is: ( yi – mean) / (standard deviation) This approach is sensible to outliers and extreme values: maybe those are your up/down genes and you do not like to penalise those too much! More robust/less affected by extreme values: Use median instead of mean Use absolute deviations instead of squared deviations (as in variance and standard deviation) Luc Janss Edinburgh Oct 2008 Scale normalisation by Yang et al. (2002. Nucliec Acids Research, 30, 4 e15) Recommends normalisation of M values Considers within array the absolute deviation of each observation from median: | yi – median | Determines the “scale” (MAD) as the median value of all | yi – median | You can standardise to an average scale, computed as geometric mean of MAD values Luc Janss Edinburgh Oct 2008 Illustration scale normalisation by Yang et al. Plots by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 A Simple Example Gene 1 2 3 4 5 Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 8 7 3 1 9 15 2 6 5 13 9 7 5 2 6 13 15 8 9 11 Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Determine Channel Medians Gene 1 2 3 4 5 medians Sl1Cy3 8 7 3 1 9 7 Sl1Cy5 15 2 6 5 13 6 Sl2Cy3 9 7 5 2 6 6 Sl2Cy5 13 15 8 9 11 11 Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Subtract Channel Medians Gene 1 2 3 4 5 Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 9 3 2 0 -4 1 4 -4 0 -1 -3 -6 -1 -4 -2 2 7 0 0 Thi s i s t he da t a a f t e r me di a n c e nt e r i ng. Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Find Median Absolute Deviations Gene 1 2 3 4 5 MAD Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 9 3 2 0 -4 1 4 -4 0 -1 -3 -6 -1 -4 -2 2 7 0 0 2 4 1 2 Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Find Scaling Constant Gene 1 2 3 4 5 MAD Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 9 3 2 0 -4 1 4 -4 0 -1 -3 -6 -1 -4 -2 2 7 0 0 2 4 1 2 C = ( 2 *4 *1 *2 ) 1 /4 = 2 Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Find Scaling Factors Gene 1 2 3 4 5 Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 9 3 2 0 -4 1 4 -4 0 -1 -3 -6 -1 -4 -2 2 7 0 0 S c a l i ng F a c t or s 2 2 2 4 2 1 2 2 Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Scale Normalize the Median Centered Data Gene 1 2 3 4 5 Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 4. 5 6 2 0 -2. 0 2 4 -4 0. 0 -2 -3 -6 -0. 5 -8 -2 2 3. 5 0 0 Th i s i s t h e d a t a a f t e r me d i a n c e n t e r i n g a n d s c a l e n o r ma l i z i n g . Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Quantile Normalisation (Bolstad, et al. 2003, Bioinformatics 19 2:185-193) Commonly applied to affymetrix array data Forces the distribution between slides to be exactly the same by: Substituting all biggest values between slides by the average of all biggest values Similarly for the one-by-biggest values … …. two-by-biggest values … …. all smallest values between slides Luc Janss Edinburgh Oct 2008 Illustration quantile normalisation Plots by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Remarks quantile normalisation Quantile normalisation gives statistical difficulties because observations on different slides are no longer independent; each “observation” is now a mean from multiple slides. It looks a bit like non-parametric methods based on ranks of observations (within slide) – but then I’d prefer such a direct rank-based analysis instead of this semi- non-parametric quantile normalisation. Luc Janss Edinburgh Oct 2008 2. Accounting for variance heterogeneity between genes using t-test Luc Janss Edinburgh Oct 2008 T-test (or Student’s test) T-test computes a standardised effect accounting for number of observations (signal/noise): T-test = Effect √n St. dev. If n bigger or st.dev smaller: t-test larger Significant for > 2 to 3 or < -2 to –3 (depends also on nr. used for computing st.dev.) Can be further transformed in a p-value Luc Janss Edinburgh Oct 2008 Example t-test Gene Avg Effect (on n=4) Std. Dev (on n=8) T-test P-value A +2.00 1.8 2.22 0.045 B +2.00 1.3 3.08 0.011 • The p-values can be computed for instance in Excel using t.distr or in R using dt • Usually “two-sided” testing is used because you do not known a-priori the sign of a gene effect, so for 0.05 significance level you apply 0.025 in each tail. Luc Janss Edinburgh Oct 2008 3. Multiple testing problems; False Discovery Rate Luc Janss Edinburgh Oct 2008 Statistical errors (1) Statistical errors are inevitable: if you want a zero risk test, never accept an alternative (skip the experiment) But then you make “false negative”errors …. “Null hypothesis”: starting point e.g. “no effect of gene” Test for “gene effect”: If there is really no effect but test says so: false pos. If there is an effect but test does not say so: false neg. Luc Janss Edinburgh Oct 2008 Statistical errors (2) Statistical significance (or α) level is the fraction of false positives, can be determined by experimenter. Usually 1-5%: you have to risk some to gain some, and “usually” 1-5% false positives is found OK The level of false negatives can vary and may be high (depends on power of experiment: poor power means many true effects remain undetected = false neg.) For microarray analysis 1-5% false positives is still a lot: e.g. 100 out of 10.000 are “false” with α=0.01. Luc Janss Edinburgh Oct 2008 Accounting for false positives: Example with 10,000 genes tested α=1% α=0.1% Total genes tested 10,000 10,000 Expected false pos. 100 10 Actual # significant 150 50 Excess (probable real pos.) 50 40 “False Discovery Rate” 67% 20% Luc Janss Edinburgh Oct 2008 Histogram of p-values showing excess of positives Excess positives 5% 5% bins, if there’s nothing you expect 5% of tests to end up in each bin – and therefore also 5% < 0.05 Luc Janss Edinburgh Oct 2008 What significance level to choose? This depends on what you want to do with the results! If you will do validation by RT-PCR, further experiments, and more validation studies, you might be happy with up to 50% false ones in your set. If follow-up is very costly (drug testing, …), or when filing a patent application you probably want more scrutiny, e.g. <10% false ones in your set. Power of experiment also puts limits: Poor power e.g. 10 genes with only 1 false pos. Good power e.g. 50-100 genes with only 1-2 false pos. Luc Janss Edinburgh Oct 2008 Approaches to determine FDR Many approaches work on the histogram of p-values to determine excess. A way to formalise this is: Order the n p-values Compute for i’th p-value i*α/n Set cut-off just before i*α/n > i’th p-value You can compute ‘q’ values as an aid to select using FDR Based on permutation of the data, such as in SAM This depends on assumption that most genes are neutral We will see more of this in the practicals…. Luc Janss Edinburgh Oct 2008 4. Something on clustering Luc Janss Edinburgh Oct 2008 Clustering Groups your data For microarrays usually applied to group genes, e.g. Genes with similar profile in time Genes with similar up/down regulation in condition x Can not replace statistical analysis Is usually applied after statistical analysis to cluster only genes which have an effect Including all data would be too much “noise” There are many methods; some require you to specify apriori the number of groups you want. Luc Janss Edinburgh Oct 2008 Microarray Data for Clustering genes object 1 2 3 . . . n attribute 1 2 3 4.7 3.8 5.9 5.2 6.9 3.8 5.8 4.2 3.9 . . . . . . . . . 6.3 1.6 4.7 time points ... m ... 1.3 ... 2.9 ... 4.4 . . . . . . ... 2.0 estimated expression levels Janss Slide by Dan Nettleton, IowaLuc State Univ. Edinburgh Oct 2008 Illustration: 3 clusters for time profiles of top 400 genes Plots by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 How do we cluster (1) ? obj e c t 1 2 3 a t t r i but e 1 2 3 4. 7 3. 8 5. 9 5. 2 6. 9 3. 8 5. 8 4. 2 3. 9 . . . . . . . . . m . 1.3 . 2. 9 . 4. 4 So me h o w we h a v e t o c o mp u t e wh e t h e r t wo r o ws l o o k “ s i mi l a r ” : u s e e . g . c o r r e l at i o n o r “ di s t anc e ” Luc Janss Edinburgh Oct 2008 How do we cluster (2) ? Correlation based: Groups on similarity irrespective of means, for instance groups constant high & constant low profiles PCA is also correlation based Distance based: the actual data has to be close Will not group a constant high & a constant low profile Luc Janss Edinburgh Oct 2008 Two main approaches for clustering “K-means” Groups records in predefined # of groups iteratively so that distances within groups are smallest Starts randomly with 1 record per group, and then adds following records Hierarchical (bottom-up) Groups two “closest” records in one group Repeats to find next closest records/groups The groups get bigger allowing to make a dendogram Luc Janss Edinburgh Oct 2008 Example of K Medoids Clustering Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Start with K Medoids Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Assign Each Point to Closest Medoid (1) Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Assign Each Point to Closest Medoid (2) Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Assign Each Point to Closest Medoid (3) Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Find New Medoid for Each Cluster New medoids have smallest average distance to other points in their cluster. Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Reassign Each Point to Closest Medoid Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Reassign Each Point to Closest Medoid Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Find New Medoid for Each Cluster Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Reassign Each Point to Closest Medoid No reassignment is needed, so the procedure stops. Slide by Dan Nettleton, Iowa State Univ. Luc Janss Edinburgh Oct 2008 Illustration of hierarchical clustering 9.7 14 Janss Plot by Dan Nettleton, IowaLuc State Univ. Edinburgh Oct 2008 Hierarchical clustering 2 clusters 3 clusters 4 clusters 5 observ. Janss Plot by Dan Nettleton, IowaLuc State Univ. Edinburgh Oct 2008 3rd half day Luc Janss Edinburgh Oct 2008 Outline Linear models to analyse designs with multiple treatments or times Further refinement of handling heterogeneity of variances using “Bayesian smoothing” Luc Janss Edinburgh Oct 2008 How many comparisons? In a 2x2 design with day and treatment Day5 vs Day7 (within control) Day5 vs Day7 (within Atten) Contr vs Atten (within day5) Contr vs Atten (within day7) ... but 1 is redundant. (e.g. 4th can be known given 3rd and 2nd). Luc Janss Edinburgh Oct 2008 Comparison made in linear model For the 2x2 design the linear model is commonly used to yield: An overall day effect (“main effect”): the differences between day5 and day7 pooled over treatments An overall treatment “main” effect (similarly pooled over days) An interaction, to specify ONE deviation from main effects, e.g. for treatment ATTEN within day7 Luc Janss Edinburgh Oct 2008 1. Linear models: estimability, setting up effects Luc Janss Edinburgh Oct 2008 Basics of linear models By linear models we describe an observation as a sum of “constant” effects: Mi = mean + “day 3 effect” + “challenge effect” By “constant” effect is meant that there is a constant “challenge effect” computed irrespective of days (unless specified otherwise) Luc Janss Edinburgh Oct 2008 Example linear model Day 3 Day 5 Day 7 Day 11 Control x x x x Challenged x x x x Difference between this data is the overall effect of challenged vs. control, assumed the same at all days Luc Janss Edinburgh Oct 2008 Linear model more formal: Day 3 Day 5 Day 7 Day 11 Control µ µ+d5 µ+d7 µ+d11 Challenged µ+c µ+d5+c µ+d7+c µ+d11+c Difference of c Difference of d5 Luc Janss Edinburgh Oct 2008 Linear model The effects so described need not match the actual data: this is the model that describes the data in a “best” way under the assumptions, there will be some remaining error We do not have a day 3 effect or control effect: the “day 3 - control”-cell is the reference; the model mean refers to this reference cell. We can take another reference base, and therefore another model mean. Luc Janss Edinburgh Oct 2008 Same (“equivalent”) model but with reference cell day7-challenged Day 3 Day 5 Day 7 Day 11 Control µ+ct+d3 µ+ct+d5 µ+ct µ+ct+d11 Challenged µ+d3 µ+d5 µ µ+d11 What now is estimated as “control effect” is the same as the ­“challenged effect” in the previous model; like this all estimates can be transformed between the two equivalent models Luc Janss Edinburgh Oct 2008 General rules for estimable effects We have to select arbitrarily a reference cell which gets assigned the model mean Most statistical softwares do this automatically, e.g. choosing the cell with all first level For every treatment there will be (# levels – 1) additional effects to be estimated: 3 effects for for 4 days, 1 effect for control vs. challenged. Luc Janss Edinburgh Oct 2008 One extension: interaction. Sometimes the “constant” effects model is not satisfactory, and we can also include an interaction (this remains a “linear model”) Day 3 Day 5 Control µ µ+d Challenged µ+c µ+c+d+cd Extra term which says that something extra is happening on day 5 in challenged animals Luc Janss Edinburgh Oct 2008 Setting up linear model fit Designs may have multiple factors, e.g. 2 lines, 2 treatments, 3 days A typical lay-out may then look like: On Cy3 On Cy5 Slide 1 L1, T1, D1 L1, T1, D3 Slide 2 L1, T2, D1 L1, T2, D3 etc... ... .... But this is not the way to put it in a linear model.... Luc Janss Edinburgh Oct 2008 Linear model set-up The “factors” and their levels are for instance: Line: 1,2. Treatment: 1,2, Day: 3,5,7 And a more typical data lay-out for linear models will look like: Line Treatment Day Observ 1? 1 1 3 Observ 2? 1 2 5 etc... 2 1 7 Luc Janss Edinburgh Oct 2008 Linear model set-up Complication: the “observations” mostly used are the “M” values, which do not allow simple linear model set-up In R some special functions are supplied to do so Alternative: “Woolfinger” model discussed later. Luc Janss Edinburgh Oct 2008 Set-up for reference design There is only one “real” experimental measurement per slide Set-up the linear model table with the combination of levels for that one experimental measurement The other measurement is the reference Add an extra column which only indicates whether reference was on cy3 or on cy5 Special supplied R function makes the final linear model set-up des <- refdesign(~factor(line)+factor(day),ref) Luc Janss Edinburgh Oct 2008 A look at final set-up for reference design It can be instructive to have a look at the final design made by the refdesign() function This design contains -1's and +1's depending on the factors available and whether the slide was “swapped” or not, inversing the contrasts The exact set-up of the contrasts determines for instance whether a treatment effect is (T1-T2) or (T2-T1) Luc Janss Edinburgh Oct 2008 Set-up for direct comparisons Here two “real” experimental measurements are made, we have two different sets of factors for cy3 and cy5 measurement. The following set-up is most easy, filling a line for each of cy3 and cy5: Slide Dye Line Treatment Day 1 3 1 1 3 1 5 2 1 5 2 3 1 2 3 2 5 2 2 5 etc... Luc Janss Edinburgh Oct 2008 Alternative model: Woolfinger model In this model the data set-up made for direct comparisons (with a “dye” column) is used directly, adding “dye” to the model The analysed data are not M-values but the log intensity values Can not (simply) be applied to a reference design but is a convenient model for direct comparison designs Luc Janss Edinburgh Oct 2008 “Woolfinger” model Advantage: The data doubles, there are two intensity values instead of one difference (M value) Can apply linear models directly Can analyse “dye effect” to detect bad dye swaps Disadvantage: Assumes good repeatability of level of expressions across slides. Analysis of M-values only assumes good repeatability of ratio of expressions across slides. Luc Janss Edinburgh Oct 2008 Detection of “dye effect”: Sometimes it happens that red spots are also red in a dye swap (but they should have turned to green) and vice versa -> “dye effect” You would fit the model (per gene) logExpr ~ mu + dye + day + treatm If everything goes well, dye should not have an effect, effects need to be taken up by day and treatment If dye effects do become significant (too often), something’s wrong…. Luc Janss Edinburgh Oct 2008 A final note on linear models From linear models we can obtain several contrast, e.g. differences between 2 days, between 2 lines, and even an assessment of genes that particularly differ between days within 1 line only Obviously, every such contrast relates to a different top-list of genes! So we obtain many gene-sets of genes doing something between days, between lines, or between days only within one line, etc. Luc Janss Edinburgh Oct 2008 2. Another look at variance heterogeneity: Bayesian variance smoothing Luc Janss Edinburgh Oct 2008 More problems of variance heterogeneity Variance estimates (per gene) are based on small numbers, therefore can accidentally happen to be (very) small or happen to be (very) large Accidental low variance -> high t-test Accidental high variance -> low t-test We can apply the philosophy of “random effects” to variances: we do not believe too much in extreme values, certainly not when based on small numbers, and like to pull them to a common mean Luc Janss Edinburgh Oct 2008 Variance smoothing The function eBayes from the Limma / Bioconductor package does this variance smoothing The effect is especially that accidental small variances are pulled up Extreme t-test are therefore tempered The pulling towards the average is stronger when the variance is computed on a small number Luc Janss Edinburgh Oct 2008 Gene variances before and after eBayes smoothing Luc Janss Edinburgh Oct 2008