GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26th May, 2010 Karsten Hokamp Genetics Department GeneExpression II BI2010 1 TFBS prediction - Overview • Introduction • Methods • Implementations • Analyse 2kb upstream of eve GeneExpression II BI2010 2 TFBS prediction - Introduction • TFBS = DNA motifs = 5 – 20 bp long = variable = multiple occurrences/sites per gene = combination of activators and repressors • cis-regulatory regions = clusters of TFBS -20kb – first intron GeneExpression II BI2010 3 TFBS prediction - Introduction Example: MSE2 strip for eve (D. melanogaster): (Janssens et al., 2006) understand transcriptional regulation infer regulatory networks GeneExpression II BI2010 4 TFBS prediction - Methods • De novo motif prediction (overrepresentation) • Searching for known motifs • Phylogenetic Footprinting/Shadowing • Clustering of TFBSs • Integration of external data sources (co-expression, structure) GeneExpression II BI2010 5 TFBS prediction - Overview GeneExpression II BI2010 Hannenhalli (2008, Bioinformatics) 6 De novo motif prediction • Search for over-represented motifs • Frequency count • Works well for yeast and prokaryotes • Not so successful in higher organisms GeneExpression II BI2010 7 Using motif databases • Search for known motifs • Position specific scoring matrix (PSSM) or Position weight matrix (PWM) • Databases: – Transfac – Jasper GeneExpression II BI2010 8 Phylogenetic-based methods • Search for islands of highly conserved regions • Footprinting: elements conserved across distant species • Shadowing: elements conserved between closely related species • Pros: increases specificity • Cons: conservation is not sufficient nor necessary GeneExpression II BI2010 9 Practical: • Try some tools on 2kp upstream sequence of D. melanogaster eve and compare with published results. – Alibaba (de novo) – Match (Tranfac) – Meme (de novo) – Promo (Tranfac) – WeederH (phylogenetic footprinting) GeneExpression II BI2010 10 Other tools: • Many more tools available for download: – Sombrero – FootPrinter – PhyloGibbs • Other Web-tools for groups of co-regulated genes: – RSAT – NestedMICA – WebMOTIFS GeneExpression II BI2010 11 TFBS prediction - Conclusion: • No single tool gives accurate results • Combination of predictions from multiple tools might increase specificity • Incorporate additional information for greater precision GeneExpression II BI2010 12 Microarrays - Overview • • • • • • GeneExpression II Introduction Data Generation Data Characteristics Diagnostic Plots Preprocessing Statistical Analysis BI2010 13 What is a microarray? • A solid support onto which the sequences from thousands of different genes are immobilized • Different array supports - glass slide - nylon membrane - silicon chip • Different probe types - short oligonucleotides - long oligonucleotides - cDNA • Each probe measures the expression of a single transcript GeneExpression II BI2010 14 Microarrays – How do they work? Affymetrix Arrays : single colour + uninfected cells infected cells RNA Reverse transcription Label with dye cDNA Hybridize Slide A GeneExpression II Slide B BI2010 15 Microarrays – How do they work? Spotted Arrays : two colour Prepare Sample + uninfected cells Prepare Microarray infected cells Hybridize target to microarray GeneExpression II BI2010 16 Microarray: Subgrids • One pin per subgrid (printTip group, stratus) GeneExpression II BI2010 17 Microarrays – Data Extraction • How to get data from the slides into the computer? GeneExpression II BI2010 18 Data Extraction – Scanning Slide Scanner Images (TIFF) PRMS02-001-S100 CF010 GeneExpression II settings: - laser power - sensitivity - focus BI2010 channel 1 (green) channel 2 (red) composite (green, yellow, red) 19 Data Extraction – Quantification align grid, tag unreliable spots Software: -ImaGene -GenePix -ScanAlyze ... GeneExpression II foreground (FG) background (BG) BI2010 Data File Spot ID FG CH1 BG CH1 FG CH2 BG CH2 FL GFP 1241 671 6707 713 1 PA0080 570 495 599 384 0 PA0080 691 632 667 651 0 PA0122 703 610 653 619 0 PA0122 708 598 695 602 0 .. … … … … … program assigns numbers representing intensity of spot 20 Quantification: Intensity Range - area composed of pixel - value range: 0 – 216 - 1 - value range: 0 – 65535 - saturation possible - low intensities = noise GeneExpression II BI2010 21 Data Generation – Summary • • • • • • • RNA labelling and hybridization Array Scanning One image per channel Load into quantification software Flag flawed spots Extract values Text file with FG and BG intensities (per probe) GeneExpression II BI2010 22 Microarrays – Sources of Variation .tiff Image Files Raw Data File Sample1 mRNA Cy3 intensity Cy3 RT Cy3-cDNA Cy5 RT Sample2 mRNA systematic experimental error cDNA array Cy5-cDNA uneven hybridization gel print-tip variations Cy5 intensity wavelength dependent intensity dependent background variations GeneExpression II image processing algorithmdependent source: www.tigr.org BI2010 23 Microarrays – Sources of Variation • Technical: – labelling – hybridization – slide quality – scanning – print-tip effect – quantification – experimenter GeneExpression II • Biological: – individual/strain/sample – environment – time point BI2010 24 Microarrays – Data Characteristics • Intensities vs. ratios • Natural scale vs. log scale GeneExpression II BI2010 25 Intensities vs. Ratios • Intensities: ratio = ch2 / ch1 GeneExpression II ch1 ch2 gene1 517 2100 gene2 3200 13000 gene3 3200 800 gene4 12000 3000 BI2010 26 Intensities vs. Ratios • Ratios: ratio = ch2 / ch1 >0 ratio = 1 if ch1 = ch2 GeneExpression II ch1 ch2 ratio gene1 517 2100 4.06 gene2 3200 13000 4.06 gene3 3200 800 0.25 gene4 12000 3000 0.25 BI2010 27 Intensities vs. Ratios • Ratios – convey expression changes – hide base level differences • But: absolute changes can be important, too! GeneExpression II BI2010 28 Graphical Representation: Signal Scatter Plot ratio = 1 Y CH2: Cy5 18000 3000 3000 GeneExpression II X CH1: Cy3 BI2010 ch1 ch2 spot1 517 2100 spot2 3200 13000 spot3 3200 800 spot4 12000 3000 18000 29 CH2: Cy5 Graphical Representation: Signal Scatter Plot ratio = 1 ~ 10x CH1: Cy3 GeneExpression II BI2010 30 Frequency Graphical Representation: Histogram ratios 1 Ratios GeneExpression II BI2010 31 Raw vs. Log ratios x = 2y • Log transformation ratios x = basey raw log 8 = 23 0.1 -3.3 0.125 = 2-3 0.5 -1 1 0 2 1 10 3.3 y undefined for x <= 0 GeneExpression II BI2010 32 Log ratios: scatter plot log-ratio = 0 CH2: Cy5 CH2: log2(Cy5) ratio = 1 CH1: log2(Cy3) CH1: Cy3 GeneExpression II BI2010 33 Frequency Log ratios: histogram ratios 1 Log-ratios Ratios GeneExpression II BI2010 34 Microarrays – Data Characteristics • ratios vs. intensities – convey expression changes – hide base level differences • log ratios vs. raw ratios – reduce spread – provide symmetry GeneExpression II BI2010 35 Diagnostic plots • • • • • GeneExpression II histogram scatter plot box plot MA plot chip visualization BI2010 36 Diagnostic plots – Histogram bad frequency good log(CH1) GeneExpression II log(CH2) BI2010 37 Diagnostic plots – Scatter plot o.k. GeneExpression II bad BI2010 38 Diagnostic plots – MA plot • Rotate scatter plot by ~ 45 degree: GeneExpression II BI2010 39 Diagnostic plots – MA plot • Rotate scatter plot by ~ 45 degree: GeneExpression II BI2010 40 Diagnostic plots – MA plot • Mathematically: Minus = log2(R) – log2(G) = 0.5 * ( log2(R) + log2(G) ) Addition GeneExpression II BI2010 41 M Diagnostic plots – MA plot A GeneExpression II BI2010 42 2-fold cut-off GeneExpression II BI2010 43 2-fold cut-off GeneExpression II BI2010 44 2-fold cut-off GeneExpression II BI2010 45 Dye Swap M = log(R/G) Unequal labeling efficiency Cy5 Cy3 Cy3-cDNA Cy3 Cy5 A = ½ log(RG) Cy5-cDNA Strong bias towards Cy3! GeneExpression II BI2010 46 Dye Swap Cy5 vs Cy3 Cy3 vs Cy5 + uninfected cells + infected cells uninfected cells cDNA infected cells cDNA Merged Data set GeneExpression II BI2010 47 Dye Swap M = log(R/G) Unequal labeling efficiency Cy3 Cy3-cDNA A = ½ log(RG) Cy5 Cy5-cDNA A = ½ log(RG) GeneExpression II BI2010 48 Diagnostic plots – Box plot outliers whiskers 1.5 times interquartile range Inter-quartile range [ upper quartile [ median lower quartile GeneExpression II BI2010 49 Diagnostic plots – Box plot o.k. GeneExpression II bad BI2010 50 Diagnostic plots – Box plot (printtip) GeneExpression II BI2010 51 Diagnostic plots – Chip visualization good: bad: GeneExpression II BI2010 52 Diagnostic plots: Summary • histogram – data distribution (intensities, ratios) • scatter plot – dye effect, print-tip effect • box plot – equal average ratio and distribution, print-tip effect • MA plot – dye effect and intensity-dependant ratio • chip visualization – spatial bias, scratches, bubbles, smears GeneExpression II BI2010 53 Microarrays – Preprocessing • • • • Flagging Background correction Normalization Flawed slides: Discard and repeat GeneExpression II BI2010 54 Microarrays – Flagging • Skip or keep (but warn) • e.g. skip low intensities and saturated spots GeneExpression II BI2010 55 Microarrays – Background correction • Subtract background measurements from foreground intensities • Brings intensities lower to zero, increases ratios: example spot with five fold upregulation: 500 / 100 = 5 subtract background (50) from both channels 450 / 50 = 9 • Additional source of variance! GeneExpression II BI2010 56 Microarrays – Normalization • Remove effect from intensities, dye bias, spatial bias or print-tip variations: – Global mean, median – Loess, lowess – Print-tip loess – 2D loess – Variance stabilazation (VSN) GeneExpression II BI2010 57 Microarrays – Normalization M Global rawmean LOESS printTip LOESS A GeneExpression II BI2010 58 Microarrays – Normalization printTip global LOESS raw LOESS mean GeneExpression II BI2010 59 Microarrays – Discard and repeat • Some slides turn out to be uncorrectable and need to be repeated (unless a sufficient number of replicates remains). • Remember: bad data in = bad data out! GeneExpression II BI2010 60 Microarrays – Statistical Analysis • • • • Replicates Variation t-tests multiple-testing correction • gene lists GeneExpression II BI2010 61 Statistical Analysis – Replicates • Two types of repeats • Technical: – multiple copies of probes on array – multiple repeats of hybridiztion (same RNA) • Biological: – multiple hybridizations with RNA from multiple extractions Need replicates to measure variation! GeneExpression II BI2010 62 Statistical Analysis – Variation • Biological variation different from technical • Statistically incorrect to mix • Important consideration for repeats: High confidence in results for a) one sample/patient/colony b) group of samples/patients/colonies Prioritise biological repeats! GeneExpression II BI2010 63 Statistical Analysis – t-tests Different classes of samples: - find genes that are affected by a treatment - p-value = degree of evidence - H0: expression does not change - t-test requires at least 2 replicates provides p-value for each gene GeneExpression II BI2010 64 Statistical Analysis – multiple-testing correction Carrying out t-tests on 10,000 genes average of 500 will have p-value <= 0.05 Methods for multiple testing: Bonferroni (very strict) Benjamini-Hochberg false-discovery rate (FDR) GeneExpression II BI2010 65 Statistical Analysis – Gene lists • List of good candidate genes to follow up • FP vs FN • Fold-change vs p-value Choice depends on downstream analysis Input for downstream analysis: Clustering, pathway analysis, enrichment, etc. GeneExpression II BI2010 66 Analysis tools • Stand-alone tools: – – – – – R BioConductor ArrayNorm TM4 GeneSpring (commercial) • Web-based tools – – – – – GeneExpression II ArrayPipe ExpressYourself GenePublisher GEPAS GeneTraffic (commercial) BI2010 67 Public Repositories • ArrayExpress – EBI, MIAME-compliant • Gene Expression Omnibus (GEO) – NCBI – „world‘s first write-only database“ GeneExpression II BI2010 68 Summary • Many sources of variance • Large numbers of replicates required for reliable results • Data: be aware of flaws/bias • Flagging/discarding results in data loss • Correction often possible but can insert artifacts • However: Microarrays can still help making great discoveries! GeneExpression II BI2010 69 END GeneExpression II BI2010 70