GeneExpression II

advertisement
GeneExpression II:
1. Transcription Factor Binding Sites
2. Microarrays
26th May, 2010
Karsten Hokamp
Genetics Department
GeneExpression II
BI2010
1
TFBS prediction - Overview
• Introduction
• Methods
• Implementations
• Analyse 2kb upstream of eve
GeneExpression II
BI2010
2
TFBS prediction - Introduction
• TFBS = DNA motifs
= 5 – 20 bp long
= variable
= multiple occurrences/sites per gene
= combination of activators and repressors
• cis-regulatory regions
= clusters of TFBS -20kb – first intron
GeneExpression II
BI2010
3
TFBS prediction - Introduction
Example: MSE2 strip for eve (D. melanogaster):
(Janssens et al., 2006)
 understand transcriptional regulation
 infer regulatory networks
GeneExpression II
BI2010
4
TFBS prediction - Methods
• De novo motif prediction (overrepresentation)
• Searching for known motifs
• Phylogenetic Footprinting/Shadowing
• Clustering of TFBSs
• Integration of external data sources
(co-expression, structure)
GeneExpression II
BI2010
5
TFBS prediction - Overview
GeneExpression II
BI2010
Hannenhalli (2008, Bioinformatics)
6
De novo motif prediction
• Search for over-represented motifs
• Frequency count
• Works well for yeast and prokaryotes
• Not so successful in higher organisms
GeneExpression II
BI2010
7
Using motif databases
• Search for known motifs
• Position specific scoring matrix (PSSM) or
Position weight matrix (PWM)
• Databases:
– Transfac
– Jasper
GeneExpression II
BI2010
8
Phylogenetic-based methods
• Search for islands of highly conserved regions
• Footprinting: elements conserved across
distant species
• Shadowing: elements conserved between
closely related species
• Pros: increases specificity
• Cons: conservation is not sufficient nor
necessary
GeneExpression II
BI2010
9
Practical:
• Try some tools on 2kp upstream sequence of
D. melanogaster eve and compare with
published results.
– Alibaba (de novo)
– Match (Tranfac)
– Meme (de novo)
– Promo (Tranfac)
– WeederH (phylogenetic footprinting)
GeneExpression II
BI2010
10
Other tools:
• Many more tools available for download:
– Sombrero
– FootPrinter
– PhyloGibbs
• Other Web-tools for groups of co-regulated
genes:
– RSAT
– NestedMICA
– WebMOTIFS
GeneExpression II
BI2010
11
TFBS prediction - Conclusion:
• No single tool gives accurate results
• Combination of predictions from multiple
tools might increase specificity
• Incorporate additional information for greater
precision
GeneExpression II
BI2010
12
Microarrays - Overview
•
•
•
•
•
•
GeneExpression II
Introduction
Data Generation
Data Characteristics
Diagnostic Plots
Preprocessing
Statistical Analysis
BI2010
13
What is a microarray?
• A solid support onto which the sequences
from thousands of different genes are
immobilized
• Different array supports
- glass slide
- nylon membrane
- silicon chip
• Different probe types
- short oligonucleotides
- long oligonucleotides
- cDNA
• Each probe measures the expression of a single transcript
GeneExpression II
BI2010
14
Microarrays – How do they work?
Affymetrix Arrays : single colour
+
uninfected cells
infected cells
RNA
Reverse transcription
Label with dye
cDNA
Hybridize
Slide A
GeneExpression II
Slide B
BI2010
15
Microarrays – How do they work?
Spotted Arrays : two colour
Prepare Sample
+
uninfected cells
Prepare Microarray
infected cells
Hybridize
target to
microarray
GeneExpression II
BI2010
16
Microarray: Subgrids
•
One pin per subgrid (printTip group, stratus)
GeneExpression II
BI2010
17
Microarrays – Data Extraction
• How to get data from the slides into the
computer?
GeneExpression II
BI2010
18
Data Extraction – Scanning
Slide
Scanner
Images (TIFF)
PRMS02-001-S100
CF010
GeneExpression II
settings:
- laser power
- sensitivity
- focus
BI2010
channel 1 (green)
channel 2 (red)
composite
(green, yellow, red)
19
Data Extraction – Quantification
align grid,
tag unreliable spots
Software:
-ImaGene
-GenePix
-ScanAlyze
...
GeneExpression II
foreground (FG)
background (BG)
BI2010
Data File
Spot ID
FG
CH1
BG
CH1
FG
CH2
BG
CH2
FL
GFP
1241
671
6707
713
1
PA0080
570
495
599
384
0
PA0080
691
632
667
651
0
PA0122
703
610
653
619
0
PA0122
708
598
695
602
0
..
…
…
…
…
…
program assigns
numbers
representing
intensity of spot
20
Quantification: Intensity Range
- area composed of pixel
- value range: 0 – 216 - 1
- value range: 0 – 65535
- saturation possible
- low intensities = noise
GeneExpression II
BI2010
21
Data Generation – Summary
•
•
•
•
•
•
•
RNA labelling and hybridization
Array Scanning
One image per channel
Load into quantification software
Flag flawed spots
Extract values
Text file with FG and BG intensities (per probe)
GeneExpression II
BI2010
22
Microarrays – Sources of Variation
.tiff Image
Files
Raw Data
File
Sample1 mRNA
Cy3 intensity
Cy3
RT
Cy3-cDNA
Cy5
RT
Sample2 mRNA
systematic
experimental
error
cDNA
array
Cy5-cDNA
uneven
hybridization
gel
print-tip
variations
Cy5 intensity
wavelength
dependent
intensity
dependent
background
variations
GeneExpression II
image
processing
algorithmdependent
source: www.tigr.org
BI2010
23
Microarrays – Sources of Variation
• Technical:
– labelling
– hybridization
– slide quality
– scanning
– print-tip effect
– quantification
– experimenter
GeneExpression II
• Biological:
– individual/strain/sample
– environment
– time point
BI2010
24
Microarrays – Data Characteristics
• Intensities vs. ratios
• Natural scale vs. log scale
GeneExpression II
BI2010
25
Intensities vs. Ratios
• Intensities:
ratio = ch2 / ch1
GeneExpression II
ch1
ch2
gene1
517
2100
gene2
3200
13000
gene3
3200
800
gene4
12000
3000
BI2010
26
Intensities vs. Ratios
• Ratios:
ratio = ch2 / ch1
>0
ratio = 1 if ch1 = ch2
GeneExpression II
ch1
ch2
ratio
gene1
517
2100
4.06
gene2
3200
13000
4.06
gene3
3200
800
0.25
gene4
12000
3000
0.25
BI2010
27
Intensities vs. Ratios
• Ratios
– convey expression changes
– hide base level differences
• But: absolute changes can be important,
too!
GeneExpression II
BI2010
28
Graphical Representation: Signal
Scatter Plot
ratio = 1
Y CH2: Cy5
18000
3000
3000
GeneExpression II
X CH1: Cy3
BI2010
ch1
ch2
spot1
517
2100
spot2
3200
13000
spot3
3200
800
spot4
12000
3000
18000
29
CH2: Cy5
Graphical Representation:
Signal Scatter Plot
ratio = 1
~ 10x
CH1: Cy3
GeneExpression II
BI2010
30
Frequency
Graphical Representation: Histogram
ratios
1
Ratios
GeneExpression II
BI2010
31
Raw vs. Log ratios
x = 2y
• Log transformation
ratios
x = basey
raw
log
8 = 23
0.1
-3.3
0.125 = 2-3
0.5
-1
1
0
2
1
10
3.3
y undefined for x <= 0
GeneExpression II
BI2010
32
Log ratios: scatter plot
log-ratio = 0
CH2: Cy5
CH2: log2(Cy5)
ratio = 1
CH1: log2(Cy3)
CH1: Cy3
GeneExpression II
BI2010
33
Frequency
Log ratios: histogram
ratios
1
Log-ratios
Ratios
GeneExpression II
BI2010
34
Microarrays – Data Characteristics
• ratios vs. intensities
– convey expression changes
– hide base level differences
• log ratios vs. raw ratios
– reduce spread
– provide symmetry
GeneExpression II
BI2010
35
Diagnostic plots
•
•
•
•
•
GeneExpression II
histogram
scatter plot
box plot
MA plot
chip visualization
BI2010
36
Diagnostic plots – Histogram
bad
frequency
good
log(CH1)
GeneExpression II
log(CH2)
BI2010
37
Diagnostic plots – Scatter plot
o.k.
GeneExpression II
bad
BI2010
38
Diagnostic plots – MA plot
• Rotate scatter plot
by ~ 45 degree:
GeneExpression II
BI2010
39
Diagnostic plots – MA plot
• Rotate scatter plot
by ~ 45 degree:
GeneExpression II
BI2010
40
Diagnostic plots – MA plot
• Mathematically:
Minus
= log2(R) – log2(G)
= 0.5 * ( log2(R) + log2(G) )
Addition
GeneExpression II
BI2010
41
M
Diagnostic plots – MA plot
A
GeneExpression II
BI2010
42
2-fold cut-off
GeneExpression II
BI2010
43
2-fold cut-off
GeneExpression II
BI2010
44
2-fold cut-off
GeneExpression II
BI2010
45
Dye Swap
M = log(R/G)
Unequal labeling efficiency
Cy5
Cy3
Cy3-cDNA
Cy3
Cy5
A = ½ log(RG)
Cy5-cDNA
Strong bias towards Cy3!
GeneExpression II
BI2010
46
Dye Swap
Cy5
vs
Cy3
Cy3
vs Cy5
+
uninfected cells
+
infected cells
uninfected cells
cDNA
infected cells
cDNA
Merged Data set
GeneExpression II
BI2010
47
Dye Swap
M = log(R/G)
Unequal labeling efficiency
Cy3
Cy3-cDNA
A = ½ log(RG)
Cy5
Cy5-cDNA
A = ½ log(RG)
GeneExpression II
BI2010
48
Diagnostic plots – Box plot
outliers
whiskers
1.5 times interquartile range
Inter-quartile range
[
upper quartile
[
median
lower quartile
GeneExpression II
BI2010
49
Diagnostic plots – Box plot
o.k.
GeneExpression II
bad
BI2010
50
Diagnostic plots – Box plot (printtip)
GeneExpression II
BI2010
51
Diagnostic plots – Chip visualization
good:
bad:
GeneExpression II
BI2010
52
Diagnostic plots: Summary
• histogram
– data distribution (intensities, ratios)
• scatter plot
– dye effect, print-tip effect
• box plot
– equal average ratio and distribution, print-tip effect
• MA plot
– dye effect and intensity-dependant ratio
• chip visualization
– spatial bias, scratches, bubbles, smears
GeneExpression II
BI2010
53
Microarrays – Preprocessing
•
•
•
•
Flagging
Background correction
Normalization
Flawed slides: Discard and repeat 
GeneExpression II
BI2010
54
Microarrays – Flagging
• Skip or keep (but warn)
• e.g. skip low intensities
and saturated spots
GeneExpression II
BI2010
55
Microarrays – Background correction
• Subtract background measurements from
foreground intensities
• Brings intensities lower to zero, increases
ratios:
example spot with five fold upregulation:
500 / 100 = 5
subtract background (50) from both channels
450 / 50 = 9
• Additional source of variance!
GeneExpression II
BI2010
56
Microarrays – Normalization
• Remove effect from intensities, dye bias,
spatial bias or print-tip variations:
– Global mean, median
– Loess, lowess
– Print-tip loess
– 2D loess
– Variance stabilazation (VSN)
GeneExpression II
BI2010
57
Microarrays – Normalization
M
Global
rawmean
LOESS
printTip
LOESS
A
GeneExpression II
BI2010
58
Microarrays – Normalization
printTip
global
LOESS
raw
LOESS
mean
GeneExpression II
BI2010
59
Microarrays – Discard and repeat
• Some slides turn out to be uncorrectable
and need to be repeated (unless a sufficient
number of replicates remains).
• Remember: bad data in = bad data out!
GeneExpression II
BI2010
60
Microarrays – Statistical Analysis
•
•
•
•
Replicates
Variation
t-tests
multiple-testing
correction
• gene lists
GeneExpression II
BI2010
61
Statistical Analysis – Replicates
• Two types of repeats
• Technical:
– multiple copies of probes on array
– multiple repeats of hybridiztion (same RNA)
• Biological:
– multiple hybridizations with RNA from multiple
extractions
 Need replicates to measure variation!
GeneExpression II
BI2010
62
Statistical Analysis – Variation
• Biological variation different from technical
• Statistically incorrect to mix
• Important consideration for repeats:
High confidence in results for
a) one sample/patient/colony
b) group of samples/patients/colonies
 Prioritise biological repeats!
GeneExpression II
BI2010
63
Statistical Analysis – t-tests
Different classes of samples:
- find genes that are affected by a treatment
- p-value = degree of evidence
- H0: expression does not change
- t-test requires at least 2 replicates
provides p-value for each gene
GeneExpression II
BI2010
64
Statistical Analysis –
multiple-testing correction
Carrying out t-tests on 10,000 genes
 average of 500 will have p-value <= 0.05
Methods for multiple testing:
Bonferroni (very strict)
Benjamini-Hochberg  false-discovery rate (FDR)
GeneExpression II
BI2010
65
Statistical Analysis – Gene lists
• List of good candidate genes to follow up
• FP vs FN
• Fold-change vs p-value
Choice depends on downstream analysis
 Input for downstream analysis:
Clustering, pathway analysis, enrichment, etc.
GeneExpression II
BI2010
66
Analysis tools
• Stand-alone tools:
–
–
–
–
–
R
BioConductor
ArrayNorm
TM4
GeneSpring (commercial)
• Web-based tools
–
–
–
–
–
GeneExpression II
ArrayPipe
ExpressYourself
GenePublisher
GEPAS
GeneTraffic (commercial)
BI2010
67
Public Repositories
• ArrayExpress
– EBI, MIAME-compliant
• Gene Expression Omnibus (GEO)
– NCBI
– „world‘s first write-only database“
GeneExpression II
BI2010
68
Summary
• Many sources of variance
• Large numbers of replicates required for reliable
results
• Data: be aware of flaws/bias
• Flagging/discarding results in data loss
• Correction often possible but can insert artifacts
• However:
Microarrays can still help making great discoveries!
GeneExpression II
BI2010
69
END
GeneExpression II
BI2010
70
Download