MSc Seminar: Donald Dunbar

advertisement
Microarray Informatics
Donald Dunbar
MSc Seminar
26th February 2011
Aims

To give a biologist’s view of microarray experiments

To explain the technologies involved

To describe typical microarray experiments

To show how to get the most from and experiment

To show where the field is going
February 26th 2011
MSc Seminar: Donald Dunbar
Introduction

Part 1
Microarrays in biological research
 A typical microarray experiment
 Experiment design, data pre-processing


Part 2
Data analysis and mining
 Microarray standards and resources
 Recent advances

February 26th 2011
MSc Seminar: Donald Dunbar
Microarray Informatics
Part 1
February 26th 2011
MSc Seminar: Donald Dunbar
Biological research
Using a wide range of experimental and
computational methods to answer biological
questions
 Genetics, physiology, molecular biology…
 Biology and informatics  bioinformatics
 Genomic revolution
 What can we measure?

February 26th 2011
MSc Seminar: Donald Dunbar
The central dogma
promoter
exon
intron
exon
intron
intron
exon
30k
Gene:
DNA
90k
Transcript:
mRNA
100+k
Protein
kinase, protease, structural
receptor, ion channel…
February 26th 2011
MSc Seminar: Donald Dunbar
Measuring RNA and proteins

Proteins
Western blot
 ELISA
 Enzyme assay


mRNA
Northern blot
 RT-PCR

February 26th 2011
MSc Seminar: Donald Dunbar
Measuring RNA and proteins

Protein levels would be best


no real high throughput method
mRNA levels will do
genome-wide physical microarrays
 other ‘array-like’ technologies
 sequencing (see later)

February 26th 2011
MSc Seminar: Donald Dunbar
Measuring transcripts
Genome level sequencing
 New miniaturisation technologies
 Better bioinformatics

microarrays
February 26th 2011
MSc Seminar: Donald Dunbar
Microarrays: wish list
Include all genes in the genome
 Include all splice variants
 Give reliable estimates of expression
 Easy to analyse



bioinformatics tools available
Cost effective
February 26th 2011
MSc Seminar: Donald Dunbar
Microarray technologies - 1






Oligonucleotides - Affymetrix
One chip all genes
Chips for many species
Several oligos per transcript
Use of control, mismatch sequences
One sample per chip



February 26th 2011
‘absolute quantification’
Well established in research
Expensive
MSc Seminar: Donald Dunbar
Microarray technologies - 1
February 26th 2011
MSc Seminar: Donald Dunbar
Microarray technologies - 2




Illumina BeadChip
Oligos on beads
Hybridise in wells
Compared to Affy



Higher throughput
Less RNA needed
Cheaper
February 26th 2011
MSc Seminar: Donald Dunbar
Microarrays: wish list
Include all genes in the genome
 Include all splice variants
 Give reliable estimates of expression
 Easy to analyse



bioinformatics tools available
Cost effective
February 26th 2011
MSc Seminar: Donald Dunbar
Problems with transcriptomics







The gene might not be on the chip
Can’t differentiate splice variants well
The gene might be below detection limit
Can’t differentiate RNA synthesis and
degradation
Can’t tell us about post translational events
Bioinformatics can be difficult
Relatively expensive
February 26th 2011
MSc Seminar: Donald Dunbar
History of Microarrays





Developed in early 1990s after larger macro-arrays (100-1000 genes)
Microarrays were spotted on glass slides
Labs spotted their own (Southern, Brown)
Then companies started (Affymetrix, Agilent)
Some early papers:



Int J Immunopathol Pharmacol. 1990 19(4):905-914. Raloxifene covalently
bonded to titanium implants by interfacing with (3-aminopropyl)-triethoxysilane
affects osteoblast-like cell gene expression. Bambini et al
Nature 1993 364(6437): 555-6 Multiplexed biochemical assays with biological
chips. Fodor SP, et al
Science 1995 Oct 20;270(5235):467-70 Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Schena M, et al
February 26th 2011
MSc Seminar: Donald Dunbar
Microarray publications
‘hypertension’ and ‘microarray’
February 26th 2011
MSc Seminar: Donald Dunbar
Types of experiment

Usually control v test(s)
Placebo
Drug treatment
Wild-type
Knockout
Healthy
Patient
Normal tissue
Cancerous tissue
Time = 0
Time = 1
February 26th 2011
Drug 2…
Time = 2…
MSc Seminar: Donald Dunbar
Types of experiment



Usually control v test(s)
But also test v test(s)
Comparison:





placebo v drug treatment
drug 1 v drug 2
tissue 1 v tissue 2 v tissue 3 (pairwise)
time 0 v time 1, time 0 v time 2, time 0 v time 3
time 0 v time 1, time 1 v time 2, time 2 v time 3
February 26th 2011
MSc Seminar: Donald Dunbar
A typical experiment
experiment
design
February 26th 2011
MSc Seminar: Donald Dunbar
Experiment design: system

What is your model?

animal, cell, tissue, drug, time…
What comparison?
 What platform



microarray? oligo, cDNA?
Record all information: see “standards”
February 26th 2011
MSc Seminar: Donald Dunbar
Experiment design: replicates


Microarrays are noisy: need extra confidence in the
measurements
We usually don’t want to know about a specific
individual



Biological replicates needed



eg not an individual mouse, but the strain
although sometimes we do (eg people)
independent biological samples
number depends on variability and required detection
Technical replicates (same sample, different chip)
usually not needed
February 26th 2011
MSc Seminar: Donald Dunbar
A typical experiment
experiment
design
collect samples
prepare RNA
raw data
chip process
February 26th 2011
MSc Seminar: Donald Dunbar
Raw data

Affymetrix GeneChip process generates:
DAT
 CEL
 CDF

image file
raw data file
chip definition file
Processing then involves CEL and CDF
 All platforms have different data formats…
 Will use Bioconductor

February 26th 2011
MSc Seminar: Donald Dunbar
Bioconductor (BioC)








http://www.bioconductor.org/
“Bioconductor is an open source software project for the
analysis and comprehension of genomic data”
Started 2001, developed by expert volunteers
Built on statistical programming environment
Provides a wide range of powerful statistical and
graphical tools
Use BioC for most microarray processing and analysis
Most platforms now have BioC packages
Make experiment design file and import data
February 26th 2011
MSc Seminar: Donald Dunbar
Quality control (QC)

Affymetrix gives data on QC



the microarray team will record these for you
scaling factor, % present, spiked probes, internal controls
Bioconductor offers:



boxplots and histograms of raw and normalised data
RNA degradation plots
specialised quality control routines (eg arrayQualityMetrics)
February 26th 2011
MSc Seminar: Donald Dunbar
Pre-processing: background

Signal corresponds to expression…

plus a non-specific component (noise)
Non specific binding of labelled target
 Need to exclude this background
 Several methods exist

eg Affy: PM-MM but many complications
 eg RMA PM=B+S (don’t use MM)

February 26th 2011
MSc Seminar: Donald Dunbar
Pre-processing: normalisation

In addition to background corrections



Make use of






statistics
combined
control
genes with probe set summary:
total
(assumptions
not always
appropriate)
getintensity
an expression
value
for the
gene
But seems to be non-linear dependency on intensity


chip, probe, spatial, intra and inter-chip variation
need to remove to get at real expression differences
additive and multiplicative errors
Quantile normalisation often used
Normalisation more complicated for 2-colour arrays
Try to remove most noise at lab stage (ie control things well
statistically)
February 26th 2011
MSc Seminar: Donald Dunbar
A typical experiment
experiment
design
collect samples
processed
data
February 26th 2011
prepare RNA
raw data
chip process
MSc Seminar: Donald Dunbar
Part 1 Summary

Microarrays in biological research

Two types of microarray

A typical microarray experiment

Experiment design

Data pre-processing
February 26th 2011
MSc Seminar: Donald Dunbar
Microarray Informatics
Part 2
February 26th 2011
MSc Seminar: Donald Dunbar
A typical experiment
experiment
design
collect samples
processed
data
February 26th 2011
prepare RNA
raw data
chip process
MSc Seminar: Donald Dunbar
Data analysis
Identifying differential expression
 Compare control and test(s)







t-test
ANOVA
SAM (FDR)
Limma
Rank Products
Time series
control
treated
v
0
1
v
February 26th 2011
2
v
MSc Seminar: Donald Dunbar
3
v
Multiple testing

Problem:



statistical testing of 30,000 genes
at α = 0.05  1500 genes
Need to correct this

Multiply p-value by number of observations
• Bonferroni, too conservative

False discovery
• defines a q value: expected false positive rate
• Less conservative, but higher chance of type I error
• Benjamini and Hochberg


Then regard genes as differentially expressed
Depends on follow-up procedure!
February 26th 2011
MSc Seminar: Donald Dunbar
Hierarchical clustering

Look for structure within dataset


similarities between genes
Compare gene expression profiles



Euclidian distance
Correlation
Cosine correlation

Calculate with distance matrix

Combine closest, recalculate, combine closest… (or split!)

Draw dendrogram and heatmap
February 26th 2011
MSc Seminar: Donald Dunbar
Hierarchical clustering
Samples
Genes
February 26th 2011
MSc Seminar: Donald Dunbar
Hierarchical clustering

Heatmaps for microarray data
February 26th 2011
MSc Seminar: Donald Dunbar
Hierarchical clustering

Predicting association of known and novel genes

Class discovery in samples: new subtypes

Visualising structure in data (sample outliers)

Classifying groups of genes

Identifying trends and rhythms in gene expression

Caveat: you will always see clusters, even when they
are not particularly meaningful
February 26th 2011
MSc Seminar: Donald Dunbar
Sample classification
Supervised or non-supervised
 Non-supervised



like hierarchical clustering of samples
Supervised
have training (known) and test (unknown) datasets
 use training sets to define robust classifier
 apply to test set to classify new samples

February 26th 2011
MSc Seminar: Donald Dunbar
Sample classification
good prognosis
 drug treatment
bad prognosis
 surgery
Gene selection, training, cross validation 
classifier: gene x * 0.5 gene y * 0.25 gene z …
?
February 26th 2011
?
?
?
?
MSc Seminar: Donald Dunbar
Sample classification
good prognosis
 drug treatment
bad prognosis
 surgery
Apply classifier
February 26th 2011
MSc Seminar: Donald Dunbar
Sample classification

Class prediction for new samples
cancer prognosis
 pharmacogenomics (predict drug efficacy)


Need to watch for overfitting
using too many parameters (genes) to classify
 classifier loses predictive power

February 26th 2011
MSc Seminar: Donald Dunbar
Annotation
Big problem for microarrays
 Genome-wide chips need genome-wide
annotation
 Good bioinformatics essential

use several resources (Affymetrix, Ensembl)
 keep up to date (as annotation changes)
 genes have many attributes

• name, symbol, gene ontology, pathway…
February 26th 2011
MSc Seminar: Donald Dunbar
Data-mining
Microarrays are a waste
of time
…unless you do
something with the data
February 26th 2011
MSc Seminar: Donald Dunbar
Data-mining

Once data are statistically analysed:
pull out genes of interest
 pull out pathways of interest
 mine data based on annotation

• what are the expression patterns of these genes
• what are the expression patterns in this pathway


mine genes based on expression pattern
• what types of genes are up-regulated …
• fold change, p-value, expression level, correlation
Should be driven by the biological question
February 26th 2011
MSc Seminar: Donald Dunbar
February 26th 2011
MSc Seminar: Donald Dunbar
February 26th 2011
MSc Seminar: Donald Dunbar
February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using

gene ontology (GO)
February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using
gene ontology (GO)
 biological pathways (eg KEGG)

February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using
gene ontology (GO)
 biological pathways (eg KEGG)
 genomic localisation (Ensembl)

February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using
gene ontology (GO)
 biological pathways (eg KEGG)
Gene set
 genomic
localisation
Enriched
regulatory (Ensembl)
sequences
Functional
significance?
 regulatory
sequence
data (Toucan, BioProspector)

February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using
gene ontology (GO)
 biological pathways (eg KEGG)
 genomic localisation (Ensembl)
 regulatory sequence data (Toucan, BioProspector)

February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using
gene ontology (GO)
 biological pathways (eg KEGG)
 genomic localisation (Ensembl)
 regulatory sequence data (Toucan, BioProspector)
 literature (eg Pubmatrix, Ingenuity…)

February 26th 2011
MSc Seminar: Donald Dunbar
Further data-mining

Other tools available using
gene ontology (GO)
 biological pathways (eg KEGG)
 genomic localisation (Ensembl)
 regulatory sequence data (Toucan, BioProspector)
 literature (eg Pubmatrix, Ingenuity…)


… to make sense of the data
February 26th 2011
MSc Seminar: Donald Dunbar
Microarray Resources


Microarray data repositories

Array express (EBI, UK)

Gene Expression Omnibus (NCBI, USA)

CIBEX (Japan)
Annotation

NetAffx, Ensembl, TIGR, Stanford… BioConductor
February 26th 2011
MSc Seminar: Donald Dunbar
Microarray Standards

MIAME



Minimum annotation about a microarray experiment
Comprehensive description of experiment
Models experiments well, and allows replication
• chips, samples, treatments, settings, comparisons


Required for most publications now
MAGE-ML



Microarray gene expression markup language
Describes experiment (MIAME) and data
Tools available for processing
February 26th 2011
MSc Seminar: Donald Dunbar
Recent advances: Exon chips

Affymetrix now have chips that allow us to
measure expression of splice variants
0.66 (down moderately)
1.4 (up slightly)
3 (up strongly)
New chips will give us much more information
February 26th 2011
MSc Seminar: Donald Dunbar
Recent advances: Genotyping chips






All discussion on EXPRESSION chips
Also can get chips looking at genotype
Tell us the sequence for genome-wide markers
Test 300,000 markers with one chip
Look for association with disease, prognosis, trait…
Combined with expression chips to generate

EXPRESSION QUANTITATIVE TRAIT LOCUS (eQTL)

Overlap of expression and genetic differences (cis)
Correlation at different locus (trans)

February 26th 2011
MSc Seminar: Donald Dunbar
Next Generation Sequencing

Sequence rather than hybridisation

Gene expression, genotyping, epigenetics

New technologies: much cheaper than before

Open ended (no previous knowledge required)

Will take over in 2 years: the end of microarrays?
February 26th 2011
MSc Seminar: Donald Dunbar
Part 2 Summary
 Data
analysis
 Data Mining
 Microarray Resources
 Microarray Standards
 Recent & future advances
February 26th 2011
MSc Seminar: Donald Dunbar
Seminar Summary
 Part
1
 Microarrays
in biological research
 A typical microarray experiment
 Part
2
 Data
analysis and mining
 Recent & future advances
February 26th 2011
MSc Seminar: Donald Dunbar
Contact
Donald Dunbar
 Cardiovascular Bioinformatics
 donald.dunbar@ed.ac.uk
 0131 242 6700
 Room W3.01, QMRI, Little France
 www.bioinf.mvm.ed.ac.uk

February 26th 2011
MSc Seminar: Donald Dunbar
Download