Microarray Informatics Donald Dunbar MSc Seminar 26th February 2011 Aims To give a biologist’s view of microarray experiments To explain the technologies involved To describe typical microarray experiments To show how to get the most from and experiment To show where the field is going February 26th 2011 MSc Seminar: Donald Dunbar Introduction Part 1 Microarrays in biological research A typical microarray experiment Experiment design, data pre-processing Part 2 Data analysis and mining Microarray standards and resources Recent advances February 26th 2011 MSc Seminar: Donald Dunbar Microarray Informatics Part 1 February 26th 2011 MSc Seminar: Donald Dunbar Biological research Using a wide range of experimental and computational methods to answer biological questions Genetics, physiology, molecular biology… Biology and informatics bioinformatics Genomic revolution What can we measure? February 26th 2011 MSc Seminar: Donald Dunbar The central dogma promoter exon intron exon intron intron exon 30k Gene: DNA 90k Transcript: mRNA 100+k Protein kinase, protease, structural receptor, ion channel… February 26th 2011 MSc Seminar: Donald Dunbar Measuring RNA and proteins Proteins Western blot ELISA Enzyme assay mRNA Northern blot RT-PCR February 26th 2011 MSc Seminar: Donald Dunbar Measuring RNA and proteins Protein levels would be best no real high throughput method mRNA levels will do genome-wide physical microarrays other ‘array-like’ technologies sequencing (see later) February 26th 2011 MSc Seminar: Donald Dunbar Measuring transcripts Genome level sequencing New miniaturisation technologies Better bioinformatics microarrays February 26th 2011 MSc Seminar: Donald Dunbar Microarrays: wish list Include all genes in the genome Include all splice variants Give reliable estimates of expression Easy to analyse bioinformatics tools available Cost effective February 26th 2011 MSc Seminar: Donald Dunbar Microarray technologies - 1 Oligonucleotides - Affymetrix One chip all genes Chips for many species Several oligos per transcript Use of control, mismatch sequences One sample per chip February 26th 2011 ‘absolute quantification’ Well established in research Expensive MSc Seminar: Donald Dunbar Microarray technologies - 1 February 26th 2011 MSc Seminar: Donald Dunbar Microarray technologies - 2 Illumina BeadChip Oligos on beads Hybridise in wells Compared to Affy Higher throughput Less RNA needed Cheaper February 26th 2011 MSc Seminar: Donald Dunbar Microarrays: wish list Include all genes in the genome Include all splice variants Give reliable estimates of expression Easy to analyse bioinformatics tools available Cost effective February 26th 2011 MSc Seminar: Donald Dunbar Problems with transcriptomics The gene might not be on the chip Can’t differentiate splice variants well The gene might be below detection limit Can’t differentiate RNA synthesis and degradation Can’t tell us about post translational events Bioinformatics can be difficult Relatively expensive February 26th 2011 MSc Seminar: Donald Dunbar History of Microarrays Developed in early 1990s after larger macro-arrays (100-1000 genes) Microarrays were spotted on glass slides Labs spotted their own (Southern, Brown) Then companies started (Affymetrix, Agilent) Some early papers: Int J Immunopathol Pharmacol. 1990 19(4):905-914. Raloxifene covalently bonded to titanium implants by interfacing with (3-aminopropyl)-triethoxysilane affects osteoblast-like cell gene expression. Bambini et al Nature 1993 364(6437): 555-6 Multiplexed biochemical assays with biological chips. Fodor SP, et al Science 1995 Oct 20;270(5235):467-70 Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Schena M, et al February 26th 2011 MSc Seminar: Donald Dunbar Microarray publications ‘hypertension’ and ‘microarray’ February 26th 2011 MSc Seminar: Donald Dunbar Types of experiment Usually control v test(s) Placebo Drug treatment Wild-type Knockout Healthy Patient Normal tissue Cancerous tissue Time = 0 Time = 1 February 26th 2011 Drug 2… Time = 2… MSc Seminar: Donald Dunbar Types of experiment Usually control v test(s) But also test v test(s) Comparison: placebo v drug treatment drug 1 v drug 2 tissue 1 v tissue 2 v tissue 3 (pairwise) time 0 v time 1, time 0 v time 2, time 0 v time 3 time 0 v time 1, time 1 v time 2, time 2 v time 3 February 26th 2011 MSc Seminar: Donald Dunbar A typical experiment experiment design February 26th 2011 MSc Seminar: Donald Dunbar Experiment design: system What is your model? animal, cell, tissue, drug, time… What comparison? What platform microarray? oligo, cDNA? Record all information: see “standards” February 26th 2011 MSc Seminar: Donald Dunbar Experiment design: replicates Microarrays are noisy: need extra confidence in the measurements We usually don’t want to know about a specific individual Biological replicates needed eg not an individual mouse, but the strain although sometimes we do (eg people) independent biological samples number depends on variability and required detection Technical replicates (same sample, different chip) usually not needed February 26th 2011 MSc Seminar: Donald Dunbar A typical experiment experiment design collect samples prepare RNA raw data chip process February 26th 2011 MSc Seminar: Donald Dunbar Raw data Affymetrix GeneChip process generates: DAT CEL CDF image file raw data file chip definition file Processing then involves CEL and CDF All platforms have different data formats… Will use Bioconductor February 26th 2011 MSc Seminar: Donald Dunbar Bioconductor (BioC) http://www.bioconductor.org/ “Bioconductor is an open source software project for the analysis and comprehension of genomic data” Started 2001, developed by expert volunteers Built on statistical programming environment Provides a wide range of powerful statistical and graphical tools Use BioC for most microarray processing and analysis Most platforms now have BioC packages Make experiment design file and import data February 26th 2011 MSc Seminar: Donald Dunbar Quality control (QC) Affymetrix gives data on QC the microarray team will record these for you scaling factor, % present, spiked probes, internal controls Bioconductor offers: boxplots and histograms of raw and normalised data RNA degradation plots specialised quality control routines (eg arrayQualityMetrics) February 26th 2011 MSc Seminar: Donald Dunbar Pre-processing: background Signal corresponds to expression… plus a non-specific component (noise) Non specific binding of labelled target Need to exclude this background Several methods exist eg Affy: PM-MM but many complications eg RMA PM=B+S (don’t use MM) February 26th 2011 MSc Seminar: Donald Dunbar Pre-processing: normalisation In addition to background corrections Make use of statistics combined control genes with probe set summary: total (assumptions not always appropriate) getintensity an expression value for the gene But seems to be non-linear dependency on intensity chip, probe, spatial, intra and inter-chip variation need to remove to get at real expression differences additive and multiplicative errors Quantile normalisation often used Normalisation more complicated for 2-colour arrays Try to remove most noise at lab stage (ie control things well statistically) February 26th 2011 MSc Seminar: Donald Dunbar A typical experiment experiment design collect samples processed data February 26th 2011 prepare RNA raw data chip process MSc Seminar: Donald Dunbar Part 1 Summary Microarrays in biological research Two types of microarray A typical microarray experiment Experiment design Data pre-processing February 26th 2011 MSc Seminar: Donald Dunbar Microarray Informatics Part 2 February 26th 2011 MSc Seminar: Donald Dunbar A typical experiment experiment design collect samples processed data February 26th 2011 prepare RNA raw data chip process MSc Seminar: Donald Dunbar Data analysis Identifying differential expression Compare control and test(s) t-test ANOVA SAM (FDR) Limma Rank Products Time series control treated v 0 1 v February 26th 2011 2 v MSc Seminar: Donald Dunbar 3 v Multiple testing Problem: statistical testing of 30,000 genes at α = 0.05 1500 genes Need to correct this Multiply p-value by number of observations • Bonferroni, too conservative False discovery • defines a q value: expected false positive rate • Less conservative, but higher chance of type I error • Benjamini and Hochberg Then regard genes as differentially expressed Depends on follow-up procedure! February 26th 2011 MSc Seminar: Donald Dunbar Hierarchical clustering Look for structure within dataset similarities between genes Compare gene expression profiles Euclidian distance Correlation Cosine correlation Calculate with distance matrix Combine closest, recalculate, combine closest… (or split!) Draw dendrogram and heatmap February 26th 2011 MSc Seminar: Donald Dunbar Hierarchical clustering Samples Genes February 26th 2011 MSc Seminar: Donald Dunbar Hierarchical clustering Heatmaps for microarray data February 26th 2011 MSc Seminar: Donald Dunbar Hierarchical clustering Predicting association of known and novel genes Class discovery in samples: new subtypes Visualising structure in data (sample outliers) Classifying groups of genes Identifying trends and rhythms in gene expression Caveat: you will always see clusters, even when they are not particularly meaningful February 26th 2011 MSc Seminar: Donald Dunbar Sample classification Supervised or non-supervised Non-supervised like hierarchical clustering of samples Supervised have training (known) and test (unknown) datasets use training sets to define robust classifier apply to test set to classify new samples February 26th 2011 MSc Seminar: Donald Dunbar Sample classification good prognosis drug treatment bad prognosis surgery Gene selection, training, cross validation classifier: gene x * 0.5 gene y * 0.25 gene z … ? February 26th 2011 ? ? ? ? MSc Seminar: Donald Dunbar Sample classification good prognosis drug treatment bad prognosis surgery Apply classifier February 26th 2011 MSc Seminar: Donald Dunbar Sample classification Class prediction for new samples cancer prognosis pharmacogenomics (predict drug efficacy) Need to watch for overfitting using too many parameters (genes) to classify classifier loses predictive power February 26th 2011 MSc Seminar: Donald Dunbar Annotation Big problem for microarrays Genome-wide chips need genome-wide annotation Good bioinformatics essential use several resources (Affymetrix, Ensembl) keep up to date (as annotation changes) genes have many attributes • name, symbol, gene ontology, pathway… February 26th 2011 MSc Seminar: Donald Dunbar Data-mining Microarrays are a waste of time …unless you do something with the data February 26th 2011 MSc Seminar: Donald Dunbar Data-mining Once data are statistically analysed: pull out genes of interest pull out pathways of interest mine data based on annotation • what are the expression patterns of these genes • what are the expression patterns in this pathway mine genes based on expression pattern • what types of genes are up-regulated … • fold change, p-value, expression level, correlation Should be driven by the biological question February 26th 2011 MSc Seminar: Donald Dunbar February 26th 2011 MSc Seminar: Donald Dunbar February 26th 2011 MSc Seminar: Donald Dunbar February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) biological pathways (eg KEGG) February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) biological pathways (eg KEGG) genomic localisation (Ensembl) February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) biological pathways (eg KEGG) Gene set genomic localisation Enriched regulatory (Ensembl) sequences Functional significance? regulatory sequence data (Toucan, BioProspector) February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) biological pathways (eg KEGG) genomic localisation (Ensembl) regulatory sequence data (Toucan, BioProspector) February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) biological pathways (eg KEGG) genomic localisation (Ensembl) regulatory sequence data (Toucan, BioProspector) literature (eg Pubmatrix, Ingenuity…) February 26th 2011 MSc Seminar: Donald Dunbar Further data-mining Other tools available using gene ontology (GO) biological pathways (eg KEGG) genomic localisation (Ensembl) regulatory sequence data (Toucan, BioProspector) literature (eg Pubmatrix, Ingenuity…) … to make sense of the data February 26th 2011 MSc Seminar: Donald Dunbar Microarray Resources Microarray data repositories Array express (EBI, UK) Gene Expression Omnibus (NCBI, USA) CIBEX (Japan) Annotation NetAffx, Ensembl, TIGR, Stanford… BioConductor February 26th 2011 MSc Seminar: Donald Dunbar Microarray Standards MIAME Minimum annotation about a microarray experiment Comprehensive description of experiment Models experiments well, and allows replication • chips, samples, treatments, settings, comparisons Required for most publications now MAGE-ML Microarray gene expression markup language Describes experiment (MIAME) and data Tools available for processing February 26th 2011 MSc Seminar: Donald Dunbar Recent advances: Exon chips Affymetrix now have chips that allow us to measure expression of splice variants 0.66 (down moderately) 1.4 (up slightly) 3 (up strongly) New chips will give us much more information February 26th 2011 MSc Seminar: Donald Dunbar Recent advances: Genotyping chips All discussion on EXPRESSION chips Also can get chips looking at genotype Tell us the sequence for genome-wide markers Test 300,000 markers with one chip Look for association with disease, prognosis, trait… Combined with expression chips to generate EXPRESSION QUANTITATIVE TRAIT LOCUS (eQTL) Overlap of expression and genetic differences (cis) Correlation at different locus (trans) February 26th 2011 MSc Seminar: Donald Dunbar Next Generation Sequencing Sequence rather than hybridisation Gene expression, genotyping, epigenetics New technologies: much cheaper than before Open ended (no previous knowledge required) Will take over in 2 years: the end of microarrays? February 26th 2011 MSc Seminar: Donald Dunbar Part 2 Summary Data analysis Data Mining Microarray Resources Microarray Standards Recent & future advances February 26th 2011 MSc Seminar: Donald Dunbar Seminar Summary Part 1 Microarrays in biological research A typical microarray experiment Part 2 Data analysis and mining Recent & future advances February 26th 2011 MSc Seminar: Donald Dunbar Contact Donald Dunbar Cardiovascular Bioinformatics donald.dunbar@ed.ac.uk 0131 242 6700 Room W3.01, QMRI, Little France www.bioinf.mvm.ed.ac.uk February 26th 2011 MSc Seminar: Donald Dunbar