data_preprocessing

Probe analysis and data preprocessing 1. Affymetrix Probe level analysis 1) Normalization Constant, Loess, Rank invariant, Quantile normalization 2) Expression measure MAS 4.0, LI-Wong (dChip), MAS 5.0, RMA 3) Background adjustment PM-MM, PM only, RMA, GC-RMA 2. Statistical analysis of cDNA array 1) Image analysis 2) Normalization 3) Assess expression level (A case study with Bayesian hierarchical model) 4) Experimental design Source of variations; Calibration and replicate; Choice of reference sample; Design of two-color array 3. Preprocessing 1) Data transformation 2) Filtering (in all platforms) 3) Missing value imputation (in all platforms) 1 From experiment to down-stream analysis 2 Statistical Issues in Microarray Analysis Experimental design Image analysis Preprocessing (Normalization, filtering, MV imputation) Data visualization Regulatory network Identify differentially expressed genes Clustering Pathway analysis Classification Integrative analysis & meta-analysis 3 Data Preprocessing Preliminary analyses extract and summarize information from the microarray experiments. • These steps are irrelevant to biological discovery • But are for preparation of meaningful down-stream analyses for targeted biological purposes. (i.e. DE gene detection, classification, pathway analysis…) From scanned images  Image analysis (extract intensity values from the images)  Probe analysis (generate data matrix of expression profile)  Preprocessing (data transformation, gene filtering and missing value imputation) 4 1. Affymetrix probe level analysis 5 Overview of the technology Hybridization from Affymetrix Inc 6 . Array Design 25-mer unique oligo mismatch in the middle nuclieotide multiple probes (11~16) for each gene from Affymetrix Inc 7 . Array Probe Level Analysis Background adjustment Normalization Summarization Normalization Background adjustment Summarization  Give an expression measure for each probe set on each array (how to pool information of 16 probes?)  The result will greatly affect subsequent analysis (e.g. clustering and classification). If not modeled properly, => “Garbage in, garbage out” We will leave the discussion of “backgound adjustment” to the last because there’re more new exciting & technical advances. 8 1.1 Normalization The need for normalization: gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10 gene11 gene12 gene13 gene14 gene15 gene16 average array1 array2 3308 4947.5 2334 3155.5 2518 3738 8882.5 18937 5041 12956.5 7314.5 19013.5 3508.5 8164 2183 5121.5 4790 8082 1645.5 1794.5 1772 1963 1802.5 2186.5 14846 35811 9986 25293 11640.5 21508 3860 6530 5339.5 11200.09 Intensities of array 2 is intrinsically larger than array 1. (about two fold) 9 1.1. Normalization Reason: 1. Different labeling efficiency. 2. Different hybridization time or hybridization condition. 3. Different scanning sensitivity. 4. …..  gs : the underlying expression level x gs : the observed intensity x g 1  f1 ( g 1 ) array 1 x g 2  f 2 ( g 2 ) array 2  x gS  f S ( gS ) array S Normalization is needed in any microarray platform. (including Affy & cDNA) 10 1.1. Normalization Constant scaling • Distributions on each array are scaled to have identical mean. Suppose array 1 is the reference array, x gs'  • x1 x gs x s Applied in MAS 4.0 and MAS 5.0 but they perform the scaling after computing expression measure. 11 1.1. Normalization Constant scaling: Underlying reasoning Assumption : x g1  f1 ( g1 )  1 g1 array 1 x g 2  f 2 ( g 2 )   2 g 2 array 2  x gS  f S ( gS )   S gS array S Suppose array 1 is the reference array, ' x gs  1 x gs  1 gs , s  2,  , S S 1 is not estimable but 1 x can be estimated as 1 s x s if their overall expression are roughly the same ( 1   s ) 12 1.1 Normalization M-A plot M=0 M A  x gs   M  log x   g1  A  logx gs  x g1  2 13 1.1 Normalization For non - differentially expressed genes : θ gs  θ g1  x gs    s gs   s      M  log  log  log   constant x      1   g1   1 g1  M-A plot shows the need for non-linear normalization. The normalization factor is a function of the expression level. 14 1.1 Normalization Fit Mˆ  fˆ ( A) by ‘Lowess’ function in S-Plus Replicate arrays The same pool of sample is applied Normalized Log ratio: ~ M  M  Mˆ 15 1.1 Normalization Non-linear scaling: Underlying reasoning Assumption : x g1  f1 ( g1 )  1 ( g1 ) g1 x g 2  f 2 ( g 2 )   2 ( g 2 ) g 2 array 1 array 2  x gS  f S ( gS )   S ( gS ) gS array S  θ gs   x gs 1 ( g1 )     hgs  log  log  θ   x  ( )  g 1    g1 s gs  log relative expression level  x gs    g ( g1 ,  gs )  log x   g1   1 ( g1 )   where g ( g1 ,  gs )  log   ( )   s gs  16 1.1 Normalization Non-linear scaling: Underlying reasoning (cont’d) Suppose we know the green genes are non-differentially expressed genes,  1 ( g1 )   x g1 ( g1 )     g ( g1 ,  gs )  log  log   ( )   x ( )   s gs   gs gs  where  gs   g1  x g1 ( A)   gˆ ( g1 ,  gs )  log  x ( A)   gs   θ gs  x    log gs   gˆ ( g1 ,  gs ) hˆgs  log θ  x   g1   g1  The problem is: we usually don’t know which genes are constantly expressed!! 17 1.1. Normalization Loess (Yang et al., 2002) • Using all genes to fit a non-linear normalization curve at the M-A plot scale. (believe that most genes are constantly expressed) • Perform normalization between arrays pairwisely. • Has been extended to perform normalization globally without selecting a baseline array but then is timeconsuming. 18 1.1. Normalization • Invariant set (dChip) Select a baseline array (default is the one with median average intensity). • For each “treatment” array, identify a set of genes that have ranks conserved between the baseline and treatment array. This set of rank-invariant genes are considered non-differentially expressed genes. • Each array is normalized against the baseline array by fitting a non-linear normalization curve of invariant-gene set.  G  g : rank ( x gs )  rank ( xg1 )  d & l  Rank ( x gs  xg1 ) / 2  G  l Tseng et al., 2001 19  1.1. Normalization Invariant set (dChip) Advantage: More robust than fitting with all genes as in loess. Especially when expression distribution in the arrays are very different. Disadvantage: The selection of baseline array is important. 20 1.1. Normalization Quantile normalization (RMA) (Irizarry2003) 1. Given n array of length p, form X of dimension p×n where each array is a col21umn. 2. Sort each column of X to give Xsort. 3. Take the means across rows of Xsort and assign this mean to each element in the row to get Xsort. 4. Get Xnormalized by rearranging each column of Xsort to have the same ordering as original X. X Xsort Xsort Xnormalized 237 283 237 198 217.5 217.5 217.5 306 341 397 329 283 306 306 338 399 401 198 341 335 338 338 399 217.5 329 335 401 397 399 399 306 338 21 1.1. Normalization Bolstad, B.M., Irizarry RA, Astrand, M, and Speed, TP (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance Bioinformatics. 19(2):185-193 A careful comparison of different normalization methods and concluded that quantile normalization generally performs the best. 22 1.2. Summarize Expression Index There’re multiple probes for one gene (11 PM and 11 MM) in U133. How do we summarize the 24 intensity values to a meaningful expression intensity for the target gene? 23 1.2. Summarize Expression Index  MAS 4.0 For each probe set, (I: # of arrays, J: # of probes) PMij-MMij= i + ij, i=1,…,I, j=1,…,J i estimated by average difference 1. Negative expression 2. Noisy for low expressed genes 3. Not account for probe affinity 24 1.2. Summarize Expression Index  dChip (DNA chips) For each probe set, (I: # of arrays, J: # of probes) PMij=j + ij + ij + ij MMij=j + ij + ij PMij - MMij= ij + ij, i=1,…,I, j=1,…,J j = J, ij ~ N(0, 2) 1. Account for probe affinity effect, j. 2. Outlier detection through multi-chip analysis 3. Recommended for more than 10 arrays Li and Wong (PNAS, 2001) Multiplicative model: PMij - MMij= ij + ij (better) Additive model: PMij - MMij= i + j + ij 25 1.2. Summarize Expression Index  MAS 5.0 For each probe set, (I: # of arrays, J: # of probes) log(PMij-CTij)=log(i)+ij, i=1,…,I, j=1,…,J CTij=MMij if MMij<PMij less than PMij if MMijPMij i estimated by a robust average (Tukey biweight). 1. No more negative expression 2. Taking log adjusts for dependence of variance on the mean. 26 1.2. Summarize Expression Index  RMA (Robust Multi-array Analysis) For each probe set, (I: # of arrays, J: # of probes) log(T(PMij))= i + j + ij, i=1,…,I, j=1,…,J T is the transformation for background correction and normalization ij ~ N(0, 2) 1. Log-scale additive model 2. Suggest not to use MM 3. Fit the linear model robustly (median polish) Irizarry et al. (NAR, 2003) 27 R2=0.85 20g R2=0.95 20g 1.25g 1.25g R2=0.97 Affymetrix Latin square data 20g 1.25g from Irizarry et al. (NAR, 2003) 28 Affymetrix Latin square data from Irizarry et al. (NAR, 2003) 29 1.3. Background Adjustment  Direct subtraction: PM-MM MAS4.0, dChip, MAS5.0 Assume the following deterministic model: PM=O+N+S (O: optical noise, N: non-specifi binding) MM=O+N => PM-MM=S>0 Is it true? 30 MM does not measure background noise of PM  Yeast sample hybridized to human chip  If MM measures non-specific binding of PM well, PMMM.  R2 only 0.5. Many MM>PM  86 HG-U95A human chips, human blood extracts  Two fork phenomenon at high abundance  1/3 of probes have MM>PM 31 1.3. Background Adjustment Reasons MM should not be used: 1. MM contain non-specific binding information but also include signal information and noise 2. The non-specific binding mechanism not wellstudied. 3. MM is costly (take up half space of the array)  Ignore MM dChip has an option for PM-only model In general, PM-only is preferred for both dChip or RMA methods. 32 1.3. Background Adjustment Consider sequence information Naef & Magnasco, 2003 1. 95% of (MM>PM) have purine (A, G) in the middle base. 2. In the current protocol, only pyrimidines (C, T) have biotin-labeled florescence. 33 1.3. Background Adjustment Fit a simple linear model:  : probe affinity 1. C > G  T > A 2. Boundary effect Naef & Magnasco, 2003 34 1.3. Background Adjustment Some chemical explanation of the result: See next page PM C G T A MM G C A T labeling Yes (+) No (-) Yes (+) No(-) Labeling impedes binding Yes (-) No Yes (-) No Hydrogen bonds 3 (+) 3 (+) 2 2 Sequence specific brightness High average average Low 35 1.3. Background Adjustment Double strand Remember from the first lecture: • G-C has three hydrogen bonds. (stronger) • A-T has two hydrogen bonds. (weaker) From: Lodish et al. Fig 4-4 36 1.3. Background Adjustment  GC-RMA O: optical noise, log-normal dist. N: non-specific binding   log(N PM )   1    N   PM ,  2    MM   log(N MM )      PM       h PM    MM    MM     1 Wu et al., 2004 JASA h: a smooth (almost linear) function. : the sequence information weight computed form the simple linear model. 37 Criterion to compare diff. methods • Accuracy – In well-controlled experiment with spike-in genes (such as Latin Square data), accuracy of estimated log-fold changes compared to the underlying true log-fold changes are concerned. (only available in data with spike-in genes) • Precision – In data with replicates, the reproducibility (SD) of the same gene in replicates is concerned. (available in data with replicates) 38 39 40 GC-RMA 41 Fee GUI Flexibility to programming and mining MAS 4.0 Commercial Yes No Average Difference Manufacturer default dChip Free Yes Some extra tools Li-Wong model Biologists MAS 5.0 Commercial Yes No Robust average of log difference Manufacturer default RMAExpress Free Yes No RMA Biologists Bioconductor Free Some Best All of above Statistician, programmer ArrayAssist Yes No RMA, GC-RMA Biologists Commercial Audience 42 Probe level analysis in Bioconductor (affy package) Background Methods none rma/rma2 mas Normalization PM correction Summarization Methods Methods Methods quantiles loess contrasts constant invariantset qspline mas pmonly subtractmm avgdiff liwong mas medianpolish playerout 43 A Simple Case Study Latin Square Data 59 HG-U95A arrays 14 spike-in genes in 14 experimental groups Expts A B C D E F G H I J K L M, N, O, P Q, R, S, T Transcripts 1 2 0 0.25 0.25 0.5 0.5 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 512 512 1024 1024 0 3 0.5 1 2 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2 4 8 16 32 64 128 256 512 1024 0 0.25 0.5 5 2 4 8 16 32 64 128 256 512 1024 0 0.25 0.5 1 6 4 8 16 32 64 128 256 512 1024 0 0.25 0.5 1 2 7 8 16 32 64 128 256 512 1024 0 0.25 0.5 1 2 4 8 16 32 64 128 256 512 1024 0 0.25 0.5 1 2 4 8 9 32 64 128 256 512 1024 0 0.25 0.5 1 2 4 8 16 10 64 128 256 512 1024 0 0.25 0.5 1 2 4 8 16 32 11 128 256 512 1024 0 0.25 0.5 1 2 4 8 16 32 64 12 0 0.25 0.5 1 2 4 8 16 32 64 128 256 512 1024 13 512 1024 0 0.25 0.5 1 2 4 8 16 32 64 128 256 M, N, O, P are replicates and Q, R, S, T another replicates http://www.affymetrix.com/analysis/download_center2.affx 44 A Simple Case Study  Take the following two replicate groups.   M 1521m99hpp_av06.CEL 1521q99hpp_av06.CEL Q N 1521n99hpp_av06.CEL 1521r99hpp_av06.CEL R O 1521o99hpp_av06.CEL 1521s99hpp_av06.CEL S P 1521p99hpp_av06.CEL 1521t99hpp_av06.CEL T Use Bioconducotr to perform a simple evaluation of different probe analysis algorithms. Note: This is only a simple demonstration. The evaluation result in this presentation is not conclusive. 45 A Simple Case Study Average log intensities vs SD log intensities. (M, N, O, P) MAS5.0 dChip (PM/MM) dChip (PM only) RMA GC-RMA (PM only) GC-RMA (PM/MM) 46 A Simple Case Study Average log intensities vs SD log intensities. (Q, R, S, T) MAS5.0 dChip (PM/MM) dChip (PM only) RMA GC-RMA (PM only) GC-RMA (PM/MM) 47 A Simple Case Study Average pair-wise correlations between replicates MAS5 dChip (PM/MM) dChip (PM-only) RMA GC-RMA(PM/MM) GC-RMA(PM-only) M, N, O, P 0.8930 Q, R, S, T 0.9002 0.9604 0.9940 0.9978 0.9621 0.9966 0.9978 0.9988 0.9993 0.9990 0.9994 Replicate correlation performance: GCRMA(PM-only)>GC-RMA(PM/MM)>  RMA>  dChip(PM-only)>>dChip(PM/MM)>>MAS5 48 A Simple Case Study  RMA greatly improves dChip(PM/MM) but dChip(PMonly) model generally seems a little better than RMA.  Average replicate correlations of RMA (0.9978) is a little better than dChip(PM only) (0.9940 & 0.9966)  dChip(PM only) suffers from a number of outlying genes in the model. Outlying genes that do not fit Li-Wong model 49 Conclusion: 1. Technological advances have been made to have smaller probe size and better sequence selection algorithms to reduce # of probes in a probe set. This will enable more biologically meaningful genes on a slide and reduce the cost. 2. Recent analysis advances have been focused on understanding and modelling hybridization mechanisms. This will allow a better use of MM probes or eventually suggest to remove MMs from the array. 3. The probe analysis is relatively settled in the field. In the second lab session next Friday, we will introduce dChip and RMAexpress for Affymetrix probe analysis. 50 2. cDNA probe level analysis 51 cDNA Microarray Review From Y. Chen et al. (1997) 52 cDNA Microarray Review Probe (array) printing 1. 48 grids in a 12x4 pattern. 2. Each grid has 12x16 features. 3. Total 9216 features. 4. Each pin prints 3 grids. 53 cDNA Microarray Review Probe design and printing 54 cDNA Microarray Review From Y. Chen et al. (1997) 55 cDNA Microarray Review Comparison of cDNA array and GeneChip cDNA GeneChip Probe preparation Probes are cDNA fragments, usually amplified by PCR and spotted by robot. Probes are short oligos synthesized using a photolithographic approach. colors Two-color (measures relative intensity) One-color (measures absolute intensity) Gene representation One probe per gene 11-16 probe pairs per gene Probe length Long, varying lengths (hundreds to 1K bp) 25-mers Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 probes 56 Advantage and disadvantage of cDNA array and GeneChip cDNA microarray Affymetrix GeneChip The data can be noisy and with variable quality Specific and sensitive. Result very reproducible. Cross(non-specific) hybridization can often happen. Hybridization more specific. May need a RNA amplification procedure. Can use small amount of RNA. More difficulty in image analysis. Image analysis and intensity extraction is easier. Need to search the database for gene annotation. More widely used. Better quality of gene annotation. Cheap. (both initial cost and per slide cost) Expensive (~$400 per array+labeling and hybridization) Can be custom made for special species. Only several popular species are available Do not need to know the exact DNA sequence. Need the DNA sequence for probe selection. 57 2.1. Image Analysis  Identify spot area : 1. Each spot contains around 100100 pixels. 2. Spot image may not be uniformly and roundly distributed. 3. Some software (like ScanAlyze or ImaGene) have algorithms to “help” placing the grids and identify spot and background area locally. 4. Still semi-automatic: a very time-consuming job.  Extract intensities (data reduction) : 1. Aim to extract the minimum most informative statistics for further analysis. Usually use the median signal minus the median background. 2. Some spot quality indexes (e.g. Stdev or CV) will be 58 computed. ScanAlyze 2.1. Image Analysis 59 2.1. Image Analysis 1. Input the number of rows and columns in each sector; input the approximate location and distances between spots. 2. May need to tilt the grids 3. Some local adjustments may be needed. 4. Once the spot grids are close enough to the real spot physical location, computer image algorithms will help to find the optimal spot area (spherical or irregular shapes) and background area. May take 10~30 minutes for an array. Usually the biologists will do it. 60 2.1. Image Analysis 61 http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/ 2.1. Image Analysis Result file from image analysis Summarized intensities for further analysis: 62 median(spot intensities)-median(background intensities) 2.2. Normalization • Affymetrix – Normalization done across arrays – After normalization, the expression data matrix shows absolute expression intensities. • cDNA – Normalization between two colors in an array. – After normalization, the expression data matrix shows comparative expression intensities (logratios). 63 2.2. Normalization • Same sample on both dyes. • Each point is a gene. • Orange is one array and purple is another array. Calibration: apply the same samples on both dyes (E. Coli grown in glucose). Purple and orange represent two replicate slides. A  log( Cy5)  log( Cy3) / 2 M  log( Cy5 Cy3) 64 2.2. Normalization Normalization: General idea:  Dye effect : Cy5 is usually more bleached than Cy3.  Slide effect  The normalization factor is slide dependent.  Usually need to assume that most genes are not differentially expressed or up- and down-regulated genes roughly cancel out the expression effect. 65 2.2. Normalization Normalization: Current popular methods:  House-keeping genes : Select a set of non-differentially expressed genes according to experiences. Then use these genes to normalize.  Constant normalization factor :  Use mean or median of each dye to normalize.  ANOVA model (Churchill’s group)  Average-intensity-dependent normalization:   Robust nonlinear regression(Lowess) applied on whole genome. (Speed’s group) Select invariant genes computationally (rank-invariant method). Then apply Lowess. (Wong’s group) 66 2.2. Normalization Loess Normalization: Pin-wise normalization using all the genes. It requires the assumption that up- and down-regulated genes with similar average intensities (denoted A) are roughly cancelled out or otherwise most genes remain unchanged. M A 67 From Dudoit et al. 2000 2.2. Normalization Rank Invariant Normalization: Rank-invariant method (Schadt et al. 2001, Tseng et al. 2001): G  g : abs rank Cy3g   rank Cy5 g   5 Iterative selection :   g : g S S 0  g : Rank (Cy5 g )  Rank (Cy3g )  p * G & l  Rank Cy5 g  Cy3 g  / 2  G  l Si i 1 & Rank gSi1 (Cy5 g )  Rank gSi1 (Cy3g )  p * Si 1  Idea:   If a particular gene is up- or down- regulated, then its Cy5 rank among whole genome will significantly different from Cy3 rank. Iterative selection helps to select a more conserved invariant set when number of genes is large. 68  2.2. Normalization Rank Invariant Normalization: Data: E. Coli. Chip, ~4000 genes, from Liao lab. Blue points are invariant genes selected by rank-invariant method. Red 69 curves are estimated by Lowess and extrapolation. Data Truncation Data Truncation • In cDNA microarry, the intensity value is between 0~216=65536. • Measurement of low intensity genes are not stable. • Extremely highly expressed genes can saturate. • For example, we can truncate genes with intensity smaller than 200 or larger than 65000. 70 2.3. Assess Expression Level Approaches to assess expression level: Single slide: 1. Normal model (Chen et al. 1997) 2. Gamma model with empirical Bayes approach (Newton et al. 2001) With replicate slides:  Traditional t-test.  ANOVA model (Kerr et al. 2000)  Permutation t-test (Dudoit et al. 2000) Hierarchical structure:  Linear hierarchical model (Tseng et al. 2001) 71 2.3. Assess Expression Level Single slide analysis: 1. Chen et al.(1997) Cy5 ~ N (  R ,  R ), Cy3 ~ N (  G ,  G ) R G c   R G Infer  R / G by Cy5/Cy3 at known house - keeping genes in the experiment . 2. Newton et al. (1999) Assume gamma distributi ons to avoid negative values. Suggest to infer ratio conditiona l on absolute intensitie s. 72 2.3. Assess Expression Level Case study: (Tseng et al. 2001) 125-gene project: each gene is spotted four times Calibration: E. Coli grown in acetate v.s. actate C1S1~2 E. Coli grown in glucose v.s. glucose C2S1~4, C3S1~2, C4S1~3 Comparative: E. Coli grown in acetate v.s. glucose R1S1~2, R2S1~2 4129-gene project: each gene is singly spotted Calibration: E. Coli grown in acetate v.s. actate C1S1~2, C2S1~2 Comparative: E. Coli grown in acetate v.s. glucose R1S1~2, R2S1~2 73 2.3. Assess Expression Level Hierarchical structure in experiment design Original mRNA pool Reversed transcription & labeling C1 C2 Hybridize onto different slides C1S1 C1S2 C2S1 C2S2 74 2.3. Assess Expression Level 75 2.3. Assess Expression Level Basics of Bayesian Analysis x1 ,  xn ~ f ( x |  )  : the unknown underlying parameter of interest x1 ,  xn : observed data Bayes rule : g (θ|x1, xn )  f ( x1 ,  xn |  )  h( )  f ( x , x 1 n |  )  h( ) d h( ) is a prior distributi on(knowled ge) of  g (θ|x1, xn ) is called the posterior distributi on of  Meaning how much we can say about  given the data 76 2.3. Assess Expression Level Baysian Hierarchical Model xseg ~ N ( eg , g ) 2  g 2 ~ k~g 2  k 2 eg ~ N ( g ,  g 2 ) p  g   1  g 2 ~ h~g 2  h 2 x : normalized logratios, log( Cy5 Cy3). e : experiment , s : slide, g : gene.  : underlying true expression level.  2 : experiment al(cultura l) variation .  2 : slide variation . h and k are adjustable parameters . Original mRNA pool  Note only x' s are observed. e.g. ((0.75, 0.67), (0.45, 0.51))  2 C1 C2 2 C1S1 C1S2 C2S1 77 C2S2 2.3. Assess Expression Level ~ g g 2  eg 2 g h ~ 2 g g 2 k x seg xseg ~ N ( eg , g ) 2  g 2 ~ k~g 2  k 2 eg ~ N ( g ,  g 2 )  g 2 ~ h~g 2  h 2 p  g   1 78 2.3. Assess Expression Level How to specify the prior? Empirical Bayes: 2 2 Use empirical data to help specify hyperparam eters (~g , ~g ). A common ver sion of EB is usually achieved by integratin g out intermedia te parameters and maximize the resulting marginal likelihood . ~ 2 , ~ 2 | X )  p(~ 2 , ~ 2 ,  , , 2 ,  2 | X ) ddd 2 d 2 max p(   ~2 ~2  , ~ ~ It is hard to implement in this three - layer hierarchic al model. 79 2.3. Assess Expression Level Another version of EB: Estimate parameters in prior from empirical data. ~ 2  g ,s ,e ( y gse  y g e ) 2 G( S  1) E (between slides variation ) ~ 2  g ,e ( y g e  y g  ) 2 G( E  1) (between experiment variation ) Note : • Since there are thousands of genes, the common problem of reusing data and getting over - confident prior in EB is alleviated . • The estimation of ~ 2 is more conservati ve. Original mRNA pool 2 C1 2 C1S1 C2 C1S2 C2S1 80 C2S2 2.3. Assess Expression Level Getting prior distribution: (when we have calibration experiments) Calibration (normal vs normal) Comparative (cancer vs normal) Original mRNA pool Original mRNA pool 2 2 C1 C2 C1 2 C2 2 C1S1 C1S2 C2S1  C2S2  C1S1 C1S2 C2S1  C2S2  ~g 2  ( S  1) * E *ˆg 2  ˆA 2 ( S  1) * E  1 ~g 2  E * ˆ g 2  ˆ A 2 E  1 ˆg 2  s ,e ( xgse  xg e ) 2 ( S  1) E ˆ g 2  e x g e 2 E ˆA   g , s ,e ( xgse  xg e ) G ( S  1) E 2 2 ˆ A 2   g ,e x g e 2 GE ((0.75, 0.67), (0.45, 0.51)) 81 2.3. Assess Expression Level MCMC for hierarchical model: 1. Compute ( eg ) (0)  xeg 2. 3.  g 2  eg ~ 2 ~2 (    )  h  e eg g  g  eg ,  g 2  E2  h1 2    g   ~ N  g ,  E   se E 4.  g 2  eg , x seg ~ 2 ~2 ( x   )  k   seg eg g j 1 s 1  s2  s 1 5.  eg x seg , g , g ,  g 2 2 E k  se xeg  g 2   g 2 g ~ N ,  s  2  2 e g g    2   g   g 2 g 2 s e g 2 82 2.3. Assess Expression Level 95% probability interval of the posterior distribution of the underlying expression level. 83 2.4. Experimental design • Biological variation Technical variations: • Amplification • Labeling • Hybridization • Pin effect • Scanning 84 2.4.1 Calibration and replicate (i) Calibration:  Use the same sample on both dyes for hybridization.  Calibration experiments help to validate experiment quality and gene-specific variability. Comparative: Tumor vs Ref Calibration: Ref vs Ref 85 2.4.2. Calibration and replicate (ii) Replicates: (replicate spots, slides)  Multiple-spotting helps to identify local contaminated spots but will reduce number of genes in the study.  Multi-stage strategy: Use single-spotting to include as many genes as possible for pilot study. Identify a subset of interesting genes and then use multiple-spotting.  Replicate spots and slides help to verify reproducibility on the spot and slide level. 86 2.4.2. Calibration and replicate Biological replicate 87 From Yang, UCSF 2.4.2. Calibration and replicate Technical replicate 88 From Yang, UCSF 2.4.2. Calibration and replicate (iii) Reverse labelling: Sample A Sample B Advantage: • Cancel out linear normalization scaling and simplifies the analysis. However, the linear assumption is often not true. • Help to cancel out gene-label interactions if it exists. 89 2.4.3. Choice of reference sample Different choices of reference sample: a) Normal patient or time 0 sample in time course study b) Pool all samples or all normal samples c) Embryonic cells d) Commercial kit Ideally we want all genes expressed at a constant moderate level in reference sample. 90 2.4.4. Design issue From Yang, UCSF 91 2.4.4. Design issue Design issues: (a) Reference design (b) Loop design (c) Balance design Reference sample is redundantly measured many times. 92 (c) v samples with v+2 experiments v samples with 2v experiments 93 See Kerr et al. 2001 Conclusion of cDNA array 1. Affymetrix GeneChip is more preferred if available. 2. Unlike GeneChip, cDNA array data is usually more noisy and careful quality control (replicates and calibration) is important. But occasionally custom arrays are needed for some specific research. 3. Analysis of cDNA microarray is also applicable to other two-color technology such as array CGH and similar two-color oligo arrays. 4. Conservative “Reference design” is usually more robust although it’s not statistically most efficient. 94 3. Data preprocessing 95 3.1. Data Truncation and Transformation Transformation 1. Logarithmic transformation (most commonly used) -- tend to get an approximately normal distribution -- should avoid negative or 0 intensity before transformation 2. Square root transformation -- a variance-stabilizing transformation under Poisson model. 3. Box-Cox transformation family 4. Affine transformation 5. Generalized-log transformation Details see chapter 6.1 in Lee’s book; Log10 or Log2 transformation is the most common practice. 96 3.2. Filtering Filter is an important step in microarray analysis: 1. Without filtering, many genes are irrelevant to the biological investigation and will add noise to the analysis. (among ~30,000 genes in the human genome, usually only around 6000 genes are expressed and varied in the experiment) 2. But filtering out too many genes will run the risk to eliminate important biomarkers. 3. Three common aspects of filtering: 1. Genes of bad experimental quality. 2. Genes that are not expressed 3. Genes that do no fluctuate across experiments. 97 3.2. Filtering Filter out genes with bad quality in cDNA array: Outputs from imaging analysis usually have a quality index or flag to identify genes with bad quality image. Three common sources of bad quality probes: 1. Problematic probes: probes with non-uniform intensities. 2. Low-intensity probes: genes with low intensities are known to have bad reproducibility and hard to verify by RT-PCR. Normally genes with intensities less than 100 or 200 are filtered. 3. Saturated probes: genes with intensities reaching scanner limit (saturation) should also be filtered. For Affymetrix and other platforms, each probe (set) also has a detection p-value, quality flag or present/absent call. 98 3.2. Filtering Filtering by quality index: different array platform and image analysis have different format low intensity 99 3.2. Filtering Filtering by quality index: Array 1 Array 2    Array 1 Array 2  Gene 1 342.061 267.247  NA Gene 2 72.2798 54.2583  49.4225 Gene 3 69.6987 73.8338  58.7938 Gene 4 163.73 197.419  196.236 Gene 5 136.412 140.536  146.344 Gene 6 131.405 96.128  93.5549 : : : : : : : : Gene G-2 763.445 936.445  768.63 Gene G-1 NA 34.7477  30.3535 12.5406 13.648  15.9003 Gene G Array S Array S NA: not applicable Missing values due to bad quality, low or 100 3.2. Filtering Filter genes with low information content: 1. Small standard deviation (stdev) 2. Small coefficient of variation (CV: stdev/mean) 150 gene 2 150 gene 1 130 125 stdev=6.45 CV=0.053 50 50 intensity 100 120 intensity 100 stdev=6.45 CV=0.29 115 30 25 15 0 0 20 1.0 1.5 2.0 2.5 samples 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 samples Note: CV is more reasonable for original intensities. But for log-transformed intensities, stdev is enough 101 Why? 3.2. Filtering Gene filtering A simple gene filtering routine (I usually use) before downstream analyses: 1. Take log (base 2) transformation. 2. Delete genes with more than 20% missing values among all samples. 3. Delete genes with average expression level less than, say α=7 (27=128). For Affymetrix and most other platforms, intensities less than 100-200 are simply noises. 4. Delete genes with standard deviation smaller than, say β=0.4 (20.4=1.32, i.e. 32% fold change). 5. Might adjust β so that the number of remaining genes are computationally manageable in downstream analysis. (e.g. around ~5000 genes) 102 3.2. Filtering Sample filtering (detecting problematic slides) Compute correlation matrix of the samples: 1. Arrays of replicates should have high correlation. (m,n,o,p are replicates and q,r,s,t are another set of replicates) 2. A problematic array is often found to have low correlation with all the other arrays. 3. Heatmap is usually plotted for better visualization. 103 3.2. Filtering Diagnostic plot by correlation matrix White: high correlation Dark gray: low correlati m,n,o,p q,r,s,t 104 3.3. Missing Value Imputation Reasons of missing values in microarray:        spotting problems (cDNA) dust finger prints poor hybridization inadequate resolution fabrication errors (e.g. scratches) image corruption Many down-stream analysis require a complete data. “Imputation” is usually helpful. 105 3.3. Missing Value Imputation Array 1 Array 2  Array S Gene 1 342.061 267.247  NA Gene 2 72.2798 54.2583  49.4225 Gene 3 69.6987 73.8338  58.7938 Gene 4 163.73 197.419  196.236 Gene 5 136.412 140.536  146.344 Gene 6 131.405 96.128  93.5549 : : : : : : : : Gene G-2 763.445 936.445  768.63 Gene G-1 NA 34.7477  30.3535 12.5406 13.648  15.9003 Gene G It is common to have ~5% MVs in a study. 5000(genes)50(arrays) 5%=12,500 106 3.3. Missing Value Imputation Existing methods • Naïve approaches – Missing values = 0 or 1 (arbitrary signal) – missing values = row (gene) average • Smarter approaches have been proposed: – K-nearest neighbors (KNN) – Regression-based methods (OLS) – Singular value decomposition (SVD) – Local SVD (LSVD) – Partial least square (PLS) – More (Bayesian Principal Component Analysis, Least Square Adaptive, Local Lease Square) Assumption behind: Genes work cooperatively in groups. Genes with similar pattern will provide information in MV imputation. 107 3.3. Missing Value Imputation KNN.e & KNN.c • considerations: – number of neighbors (k) – distance metric – normalization step randomly missing datum ? Expression • choose k genes that are most “similar” to the gene with the missing value (MV) • estimate MV as the weighted mean of the neighbors Arrays 108 3.3. Missing Value Imputation KNN.e & KNN.c • parameter k ? – 10 usually works (5-15) Expression • distance metric – euclidean distance (KNN.e) – correlation-based distance (KNN.c) • normalization? Arrays – not necessary for euclidean neighbors – required for correlation neighbors 109 3.3. Missing Value Imputation OLS.e & OLS.c • regression-based approach • KNN+OLS • algorithm: – choose k neighbors (euclidean or correlation; normalize or not) – the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression) – for each neighbor, MV is predicted from the regression model – MV is imputed as the weighed average of the k predictions 110 3.3. Missing Value Imputation OLS.e & OLS.c y1 = a1 + b1 x1 ? y2 = a2 + b2 x2 Expression randomly missing datum Arrays y = w1 y1 + w2 y2 111 3.3. Missing Value Imputation SVD • Algorithm – set MVs to row average (need a starting point) – decompose expression matrix in orthogonal components, “eigengenes”. – use the proportion, p, of eigengenes corresponding to largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate) – use EM approach to iteratively imporove estimates of MVs until convergence • Assumption: – The complete expression matrix can be welldecomposed by a smaller number of principle components. 112 3.3. Missing Value Imputation LSVD.e & LSVD.c • KNN+SVD – choose k neighbors (euclidean or correlation; normalize or not) – Perform SVD on the k nearest neighbors and get a prediction of the missing value. 113 3.3. Missing Value Imputation PLS • PLS: Select linear combinations of genes (PLS components) exhibiting high covariance with the gene having the MV. – The first linear combination of genes has the highest correlation with the target gene. – The second linear combination of genes had the greatest correlation with the target gene in the orthogonal space of the first linear combination. • MVs are then imputed by regressing the target gene onto the PLS components 114 3.3. Missing Value Imputation Types of missing mechanism: 1. Missing completely at random (MCAR) Missingness is independent of the observed values and their own unobserved values. 1. Spot missing due to mis-printing or dust particle. 2. Spot missing due to scratches. 2. Missing at random (MAR) Missingness is independent of the unobserved data but depend on the observed data. • Missing not at random (MNAR) MIssingness is dependent on the unobserved data 1. Spots missing due to saturation or low expression. 115 Currently imputation methods only work for MCAR, not MNAR. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes Guy N. Brock1, John R. Shaffer2, Richard E. Blakesley3, Meredith J. Lotz3, George C. Tseng2,3,4§ BMC Bioinformatics, 2008 116 3.3. MV imputation comparative study Data set Full Dim. Used Dim. Category Organism Expression Profiles 13412 x 40 5635 x 40 multiple exposure H. sapiens diffuse large B-cell lymphoma 2000 x 62 2000 x 62 multiple exposure H. sapiens colon cancer and normal colon tissue Baldwin (BAL) 16814 x 39 6838 x 39 time series, non-cyclic H. sapiens epithelial cellular response to L. monocytogenes Causton (CAU) 4682 x 45 4616 x 45 multiple exposure x time series S. cerevisiae response to changes in extracellular environment 2986 x 155 multiple exposure x time series S. cerevisiae cellular response to DNA-damaging adgents Alizadeh (ALI) Alon (ALO) Gasch (GAS) 6152 x 174 Golub (GOL) 7129 x 72 1994 x 72 multiple exposure H. sapiens acute lymphoblastic leukemia Ross (ROS) 9706 x 60 2266 x 60 multiple exposure H. sapiens NCI60 cancer cell lines Spellman, AFA (SP.AFA) Spellman, ELU (SP.ELU) 7681 x 18 4480 x 18 time series, cyclic S. cerevisiae cell-cycle genes 7681 x 14 5766 x 14 time series, cyclic S. cerevisiae cell-cycle genes 9 data sets: multiple exposure, time series or both 7 methods were compared: KNN, OLS, LSA, LLS, PLS, SVD, 117 BPCA 3.3. MV imputation comparative study . Global-based methods (PLS, SVD, BPCA): Estimate the global structure of the data to impute MV. Neighbor-based methods (KNN, OLS, LSA, LLS): Borrow information from correlated genes (neighbors). Intuitively global-based methods require that dimension reduction of the data can be effectively performed. We define an entropy measure for a given data D to determine how well the dimension reduction of the data can be done: (i are the eigenvalues) k  e( D )   i 1 pi  i pi log pi log( k )  k l 1 , l Entropy low: the first few eigenvalues dominate and the data can be 118 reduced to low-dimension effectively. 3.3. MV imputation comparative study LRMSE is the performance measure, the lower the better. KNN, OLS, LSA, LLS are neighbor-based methods and work better in low-entropy data sets. PLS and SVD are global-based methods and work better in high-entropy data sets.   LRMSE Dˆ (j ;kM) i , D j 0 i  i e( D j )   j   ijk , 119 3.3. MV imputation comparative study Data set BAL CAU ALO GOL SP.EL U GAS SP.AF A ROS ALI Simulation II Entro Optimal py 0.819 LSA (38), LLS (12) 0.838 LLS (45), LSA (5) 0.872 LSA (50) 0.876 LSA (50) 0.909 LLS (41), BPCA (9) 0.911 LSA (50) 0.94 LLS (40), BPCA (10) 0.944 LSA (50) 0.947 LSA (50) Simulation III EBS Accura Optimal cy LSA (50) 76% LSA (9), LLS (1) LSA (50) 10% LLS (10) STS LSA (10) Accurac y 90% LSA (10) 0% LSA (50) 100% LSA (50) 100% LSA (50) 0% LSA (10) LSA (10) LLS (10) LSA (10) LSA (10) BPCA (10) 100% 100% 0% LSA (50) 100% LSA (50) 0% LSA (10) LLS (9), BPCA (1) LSA (10) LSA (10) LSA (10) BPCA (10) 100% 10% LSA (10) LSA (10) Overall 100% 100% 67% LSA (50) 100% LSA (50) 100% Overall 65% Three methods (LSA, LLS, BPCA) performed best but none dominated. Performed two selection schemes (entropy-based scheme and self-training 120 scheme) to select the best imputation method.

data_preprocessing

Related documents

Products

Support

data_preprocessing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib