CS 6293 Advanced Topics: Transcriptional Bioinformatics Introduction to Gene Expression Data Analysis

advertisement
CS 6293 Advanced Topics:
Transcriptional Bioinformatics
Introduction to Gene Expression
Data Analysis
Outline
• Biological background
• Microarray
– Basic categories of microarray
– Computational and statistical methods involved in
microarray
•
•
•
•
Pre-processing
Differentially expressed gene identification
Clustering and classification
Network / pathway modeling
• RNA-seq
Genome is fixed – Cells are
dynamic
• A genome is static
– (almost) Every cell in our body has a copy of
the same genome
• A cell is dynamic
– Responds to internal/external conditions
– Most cells follow a cell cycle of division
– Cells differentiate during development
Gene regulation
• … is responsible for the dynamic cell
• Gene expression (production of protein) varies
according to:
–
–
–
–
–
Cell type
Cell cycle
External conditions
Location
Etc.
Where gene regulation takes place
• Opening of chromatin
• Transcription
• Translation
• Protein stability
• Protein modifications
Gene expression
Reverse transcription (in lab)
Product is called cDNA
• Genes have different activities at different
time / environment
• DNA Microarrays
– Measure gene transcription (amount of mRNA) in
a high-throughput fashion
– A surrogate of gene activity
Transcriptional regulation of genes
Transcription Factor (TF)
(Protein)
RNA polymerase
(Protein)
DNA
Promoter
Gene
Transcriptional regulation of genes
Transcription Factor (TF)
(Protein)
RNA polymerase
(Protein)
DNA
TF binding site, cis-regulatory element
Gene
Transcriptional regulation of genes
Transcription Factor
(Protein)
RNA polymerase
DNA
TF binding site, cis-regulatory element
Gene
Transcriptional regulation of genes
New protein
RNA
polymerase
Transcription Factor
DNA
TF binding site, cis-regulatory element
Gene
The cell as a regulatory network
If C then D
gene D
A
B
C
Make D
If B then NOT D
If A and B then D
D
gene B
D
C
Make B
If D then B
Northern Blot
(an old technique for measuring mRNA expression)
1. mRNA extracted and
purified.
4. mRNA are
transferred from the
gel to a membrane.
2. mRNA loaded for
electrophoresis.
Lane 1: size standards.
Lane 2: RNA to be tested.
3. The gel is charged
and RNA “swim”
through gel according
to weight.
-
+
5. A labeled probe
specific for the RNA
fragment is incubated
with the blot. So the
RNA of interest can be
detected.
Hybridization
Need relatively large amount of mRNA
http://www.escience.ws/b572/L13/north.html
RT-PCR (reverse transcription-polymerase chain reaction)
1. RNA is reverse transcribed to DNA.
2. PCR procedures can be used amplify DNA at exponential
rate.
3. Gel quantification for the amplified product.
---- an semi-quantitative method. Smaller amount of sample
needed.
See animation of RT-PCR:
http://www.bio.davidson.edu/courses/Immunology/Flash/RT_PCR.html
real-time RT-PCR
1. The PCR amplification can be monitored by fluorescence
in “real time”.
2. The fluorescence values recorded in each cycle represent
the amount of amplified product.
Often used to
validate
microarray
---- a quantitative method. The current most advanced and
accurate analysis for mRNA abundance. Usually used to
validate microarray result.
http://www.ambion.com/techlib/basics/rtpcr/
Limitation of the old techniques
1. Labor intensive
2. Can only detect up to dozens of genes.
(gene-by-gene analysis)
What is a microarray
• A 2D array of DNA sequences
from thousands of genes
• Each spot has many copies of
same gene (probe)
• Allow mRNAs from a sample to
hybridize
– Form RNA-DNA double-strand
• Measure number of
hybridizations per spot
What is a Microarray (2)
Gene 9
Conceptually similar to
(reverse) Northern blot
(Many) probes, rather than
mRNAs, are fixed on some
surface, in an ordered way
Microarray categories
• cDNAs microarray
– Each probe is the cDNA of a gene (length: hundreds
to thousands nucleotides)
– Stanford, Brown Lab
• Oligonucleotide microarray
– Each probe is a synthesized short DNA (uniquely
corresponding to a substring of a gene)
– Affymetrix: ~ 25mers
– Agilent: ~ 60 mers
• Others
Spotted cDNA microarray
Array Manufacturing
Each tube contains cDNAs corresponding to a unique
gene. Pre-amplified, and spotted onto a glass slide
Experiment
cy3
cy5
Data acquisition
Computer programs are used to process the image into digital signals.
• Segmentation: determine the boundary between signal and background
• Results: gene expression ratios between two samples
Affymetrix GeneChip®
Array Design
25-mer unique oligo
mismatch in the middle
nuclieotide
multiple probes (11~16) for each gene
from Affymetrix Inc.
Array Manufacturing
Technology adapted from semiconductor industry.
(photolithography and combinatorial chemistry)
In situ synthesis of oligonucletides
from Affymetrix Inc.
GeneChip Probe Arrays
®
Hybridized Probe Cell
GeneChip Probe Array
Single stranded,
labeled RNA target
* *
*
*
*
*
Oligonucleotide probe
24µm
1.28cm
Millions of copies of a specific
oligonucleotide probe
>200,000 different
complementary probes
Image of Hybridized Probe Array
Overview of the Affymetrix GeneChip technology
Each probe set combines to give an
absolute expression level.
Image segmentation is relatively easy.
But how to use MM signal is debatable
from Affymetrix Inc.
Comparison of cDNA array and GeneChip
cDNA
GeneChip
Probe
preparation
Probes are cDNA fragments,
usually amplified by PCR and
spotted by robot.
Probes are short oligos
synthesized using a
photolithographic approach.
colors
Two-color
(measures relative intensity)
One-color
(measures absolute intensity)
Gene
representation
One probe per gene
11-16 probe pairs per gene
Probe length
Long, varying lengths
(hundreds to 1K bp)
25-mers
Density
Maximum of ~15000 probes.
38500 genes * 11 probes =
423500 probes
Affymetrix GeneChip
One color design
cDNA microarray
Two color design
Why the difference?
Affymetrix GeneChip
cDNA microarray
Photolithography
(The amount of oligos on a probe is well
controlled)
Robotic spotting
(The amount of cDNA spotted on a
probe may vary greatly)
Advantage and disadvantage of
cDNA array and GeneChip
cDNA microarray
Affymetrix GeneChip
The data can be noisy and with variable
quality
Specific and sensitive. Result very
reproducible.
Cross(non-specific) hybridization can
often happen.
Hybridization more specific.
May need a RNA amplification
procedure.
Can use small amount of RNA.
More difficulty in image analysis.
Image analysis and intensity extraction
is easier.
Need to search the database for gene
annotation.
More widely used. Better quality of gene
annotation.
Cheap. (both initial cost and per slide
cost)
Expensive (~$400 per array+labeling and
hybridization)
Can be custom made for special species. Only several popular species are
available
Do not need to know the exact DNA
sequence.
Need the DNA sequence for probe
selection.
Typical Microarray Analysis
normal
ID_REF
VALUE
AFFX-BioB-5_at
210.6
AFFX-BioB-M_at
393
AFFX-BioB-3_at
264.9
AFFX-BioC-5_at
738.6
AFFX-BioC-3_at
356.3
AFFX-BioDn-5_at
566.3
AFFX-BioDn-3_at
3911.8
AFFX-CreX-5_at
6433.3
AFFX-CreX-3_at
11917.8
AFFX-DapX-5_at
12.2
AFFX-DapX-M_at
57.8
AFFX-DapX-3_at
29.8
AFFX-LysX-5_at
15.3
AFFX-LysX-M_at
33.2
AFFX-LysX-3_at
40.7
AFFX-PheX-5_at
7.8
AFFX-PheX-M_at
4.2
AFFX-PheX-3_at
54.2
AFFX-ThrX-5_at
8.2
AFFX-ThrX-M_at
38.1
AFFX-ThrX-3_at
15.2
AFFX-TrpnX-5_at
11.2
AFFX-TrpnX-M_at
9
AFFX-TrpnX-3_at
19.8
AFFX-HUMISGF3A/M97935_5_at
82.7
AFFX-HUMISGF3A/M97935_MA_at
397.6
AFFX-HUMISGF3A/M97935_MB_at
206.2
AFFX-HUMISGF3A/M97935_3_at
663.8
AFFX-HUMR GE/M10098_5_at
547.6
AFFX-HUMR GE/M10098_M_at
239.1
AFFX-HUMR GE/M10098_3_at
1236.4
AFFX-HUMGAPDH/M33197_5_at
19508
AFFX-HUMGAPDH/M33197_M_at
18996.6
AFFX-HUMGAPDH/M33197_3_at
18016.4
AFFX-HSAC07/X00351_5_at
23294.6
AFFX-HSAC07/X00351_M_at
25373.1
AFFX-HSAC07/X00351_3_at
20032.8
tumor
ABS_C ALL VALUE
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
M
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
234.6
327.8
164.6
676.1
365.9
442.2
3703.7
5980
9376.7
44.3
42.5
6.2
16.2
12
10.7
3
4.8
39.6
11.2
30.6
5
11.8
8.1
12.8
120.7
416.7
303
723.9
405.9
175.8
721.4
19267.1
20610.4
17463.8
21783.7
24922.8
20251.1
tumor
VALUE
362.5
501.4
244.7
737.6
423.4
649.7
4680.9
7734.7
11509.3
31.2
79
23.4
15.6
17.7
36.2
7.6
6.8
19.4
13.2
37.6
15
22.2
9.1
11.8
92.7
244.8
300.8
812.1
6894.7
3675
9076.1
22892
21573.7
20921.3
18423.3
22384.2
20961.7
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
389
816.5
379.7
1191.2
711.6
834.3
6037.7
10591
16814.4
37.7
48.8
28.4
16.7
37.3
22.1
5.6
6.1
16.1
9.5
7.2
8.3
22.1
8.7
43.2
46.4
181.4
253.5
666.1
3496.1
1348.6
7795.9
26584
29936
26908.3
21858.9
25760.2
23494.6
ABS_C ALL
P
P
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
M
P
A
P
P
P
P
P
P
P
P
P
P
P
Raw data
normal
VALUE
305.6
542
261.3
917
560.3
599.1
4653.7
8162.1
13861.8
33.3
39.5
3.2
3.1
49.2
22.8
5
3.7
44.7
8.5
26.9
36.8
8.9
8.1
17.4
55.9
197.5
195.3
629.4
1958.5
695.9
4237.1
29666.6
30106.6
28382.2
23517.1
27718.5
23381.2
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
tumor
VALUE
330.5
440.8
303.7
767.9
484.9
606.9
4232
8428
13653.4
12.8
39.2
7.6
3.9
9.1
28.2
6.4
5.5
31.2
7.5
36.3
11.5
35.6
12
10
46.5
192.3
216
754.1
5799.4
2428.2
7890
25038.1
22380.2
21885
19450.3
21401.6
21173.3
A B S _C A LL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
Significance
Preprocess
Normalize
Classification
Function (Gene Ontology)
Regulation (Motif finding)
Filter
•Present/Absent
•Minimum value
•Fold change
Clustering
Preprocessing
• Background subtraction
– Account for non-specific hybridization
• Transformation (e.g. to log scale)
– Convenience
– Convert data into a certain distribution (e.g. normal)
assumed by many statistical procedures
• Normalization
– Remove systematic biases
– Make data from different samples comparable
• Filtering, averaging, etc.
– Remove random noises
Garbage in => Garbage out
Order may be different.
May be combined.
Background subtraction
• For cDNA array, relatively straightforward
–
–
–
–
Raw data contain foreground and background values
Foreground values obtained from detected spots
Background values obtained from surrounding area
It may occur that background > foreground
• For oligo array, probes are densely packed, so cannot be used
directly. Hope: MM captures non-specific hybridization?
• Recent studies suggest that PM and MM
are correlated.
900
• Better ignore MM entirely or
800
use with caution
700
• Available software tools
600
MM
– MAS 5 (by affymetrix)
– dChIP
– GCRMA
500
400
300
200
100
0
0
500
1000
PM
1500
Normalization
• Where errors could come from?
• Random noises
– Repeat the same experiment twice, get diff results
– Using multiple replicates reduces the problem
• Systematic errors
– Arrays manufactured at different time
– On the same array, probes printed with different printer tips may
have different biases
– Dye effect: difference between Cy5 and Cy3 labeling
– Experimental factors
• Array A being applied more mRNAs than array B
• Sample preparation procedure
• Experiments carried out at different time, by different users, etc.
cDNA microarray data preprocessing
Typical experiments
Wide-type cells vs mutated cells
Diseased cells with normal cells
Cells under normal growth condition vs cells treated with chemicals
Typically repeated for several times
Ratios
Probes (genes)
•
•
•
•
Transforming cDNA microarray data
•
•
•
•
Data: Cy5/Cy3 ratios as well as raw intensities
Most common is log2 transformation
2 fold increase => log2(2) = 1
2 fold decrease => log2(1/2) = -1
1800
3500
1600
3000
1400
2500
Frequency
Frequency
1200
2000
1500
1000
800
600
1000
400
500
0
200
0
2
4
6
8
Cy5/Cy3 ratio
10
12
14
0
-4
-3
-2
-1
0
1
log (Cy5/Cy3)
2
2
3
4
Dye effect
cDNA microarray experiments using two identical samples.
Observation: Cy5 consistently lower than Cy3. (mean log (cy5/cy3) < 0)
Solution: dye swapping.
Dye swapping
•
•
•
•
•
Chip 1: label test by cy5 and control by cy3
Chip 2: label test by cy3 and control by cy5
Ideally cy5/cy3 = cy3/cy5
Not so due to dye effect
Compute average ratio:
½ log2 (cy5/cy3 on chip 1)
+ ½ log2 (cy3/cy5 on chip 2)
Total intensity normalization
• Even after dye-swapping, may still
see systematic biases
• Assume the total amount of
mRNAs should not change
between two samples
• House-keeping genes
• Middle 90% (for example) of genes
• Spike-in genes
2500
2500
2000
2000
Frequency
– Rescale so that the two colors
have same total intensity
– Assumption not necessarily true
– Rescale according to a subset of
genes
3000
3000
1500
1500
1000
1000
500
500
00
-4
-4
-3
-3
-2
-2
-1
-1
00
11
log (Cy5/Cy3)
22
22
33
44
M-A plot
• Also know as ratio-intensity plot
• M: log2(cy5 / cy3) = log2(cy5) – log2(cy3)
• A: ½ log2(cy5 * cy3) = (log2(cy5) + log2(cy3)) / 2
Ideal:
• M centered at zero
• variance does not depend on A.
M
However:
• Systematic dependence between
M and A
A
• High variance of M for smaller A
Lowess normalization
• Lowess: Locally Weighted Regression
• Fit local polynomial functions
• M adjusted according to fitted line
M’
M
A
A
Replicate filtering
Ratio 1
Ratio 2
Log2(ratio2)
• Experiments repeated
• Genes with very high
variability is
questionable
Log2(ratio1)
oligo microarray data preprocessing
(Affymetrix chip)
Typical experiments
• Multiple microarrays
– n samples (from different time, location, condition,
treatment, etc.)
– k replicates for each samples
• For example
– Samples collected from 100 healthy people and 100
cancer patients
– Cells treated with some drugs, take samples every 10
minutes
• Repeat on 3 – 5 microarrays for each sample
– Improve reliability of the results
– Often averaged after some preprocessing
Main characteristics
• For each gene, there are multiple PM and
MM probes (11-16 pairs)
– how to obtain overall intensities from these
probe-level intensities?
• Array outputs are absolute values rather
than ratios
– Cross-array normalization is important for
them to be comparable
Transformation
• Log transformation for one-color array
• When get a data set from someone, be careful
with the scale
4
2.5
x 10
5000
4500
2
4000
1.5
frequency
frequency
3500
1
3000
2500
2000
1500
0.5
1000
500
0
0
1000
2000
3000
raw intensity
4000
5000
0
0
2
4
6
8
log2 (raw intensity)
10
12
14
Normalization
• Ideas similar to cDNA microarrays
– For cDNA microarray arrays, normalize on log ratios.
• May have one or more arrays.
– Here, normalize absolute expression values.
• Usually multiple array.
• Total intensity normalization
– Each array has the same mean intensity
– Can be based on all genes or a selected subset of genes
• House-keeping genes
• Middle 90% (for example) of genes
• Spike-in genes
• Lowess: using a common reference, or cyclic
• Many useful tools implemented in R (Bioconductor)
Quantile normalization
• Normalize multiple arrays
• Assume the distribution of the values obtained
from each array is the same or similar
1400
1000
900
1200
800
Quantile
normalization
800
600
700
600
# of genes
# of genes
1000
500
400
300
400
200
200
100
0
-6
-4
-2
0
2
Expression
4
6
8
0
-6
-4
-2
0
2
Expression
4
6
8
Quantile normalization
500
500
1000
1000
500
Sort col
Restore
order
mean
X3
1000
1500
1500
2000
2000
2000
2500
2500
2500
3000
1
2
3
3000
1500
3000
1
2
3
1
2
3
An example data set
• J DeRisi, V Iyer, and P Brown, “Exploring the Metabolic
and Genetic Control of Gene Expression on a Genomic
Scale”, Science, 278: 680 – 686, 1997
– Yeast cells grow in glucose medium
– When glucose was depleted, cells change their metabolic
pathways
• cDNA microarray
–
–
–
–
–
–
Test: 2, 4, 6, 8, 10, 12, 14 hours after growth
Control: 0 hour
Total data points: ~6000 x 7
No replicates!
No normalization!
Use fold-change to get differentially expressed genes!
Histogram of log ratios
Median = -0.27
1600
1400
1200
Two possibilities:
frequency
1000
• Dye effect
800
• Sample difference
600
400
200
0
-3
-2
-1
0
log2 (cy5/cy3)
1
2
3
Total intensity normalization
Median = -0.1
1600
mean(cy3) = 3141
mean(cy5) = 2838
3141 / 2838 = 1.11
1400
1200
Other options:
• use median
• use subset of genes
frequency
1000
800
–
–
–
–
600
400
Exclude 10% extreme
House-keeping genes
Spike-in genes
Etc.
200
0
-3
-2
-1
0
1
log2 (1.11*cy5/cy3)
2
3
• Net effect: constant
factor for every gene
Intensity-intensity plot
5
5
10
10
4
10
4
cy3 intensity
cy3 intensity
10
3
10
Total intensity
normalization
3
10
2
10
2
10
3
4
10
10
cy5 intensity
5
10
2
10
2
10
3
4
10
10
1.1 * cy5 intensity
• Total intensity normalization worked well here
5
10
Intensity-intensity plot
5
5
10
10
cy3 intensity
Total intensity
normalization
3
10
1.02 * cy3 intensity
4
4
10
10
3
10
2
2
10
2
10
3
4
10
10
5
10
10
2
10
3
4
10
10
cy5 intensity
cy5 intensity
• Did not work well for this experiment
• Dye-swapping can probably help
5
10
2.5
2.5
2
2
1.5
1.5
1
1
log2 (1.11*cy5/cy3)
log2 (cy5/cy3)
M-A plot
0.5
0
-0.5
0.5
0
-0.5
-1
-1
-1.5
-1.5
-2
-2
-2.5
16
18
20
22
24
26
log2 (cy5*cy3)
28
30
32
-2.5
16
18
20
A: log2(cy5 * cy3) = log2(cy5)+log2(cy3)
M: log2(cy5 / cy3) = = log2(cy5)-log2(cy3)
22
24
26
log2 (cy5*cy3)
28
30
32
1
1
0.5
0.5
0
0
log2 (cy5/1.02*cy3)
log2 (cy5/cy3)
M-A plot
-0.5
-1
-1.5
-0.5
-1
-1.5
-2
14
16
18
20
22
24
log2 (cy5*cy3)
26
28
Dependency of M on A
30
32
-2
14
16
18
20
22
24
log2 (cy5*cy3)
26
28
30
32
2
5
0
0
-2
15
20
25
30
35
2
-5
10
15
20
25
30
35
5
0
0
-2
-4
10
15
20
25
30
35
-5
15
5
5
0
0
-5
10
15
20
25
30
35
-5
10
20
15
25
20
30
25
30
35
35
Box plot
6
4
Expression
2
0
-2
-4
-6
1
2
3
4
Sample
5
6
7
Conclusions
• Microarray provides a way to measure thousands
of genes simultaneously and make the global
monitoring of cellular activities possible.
• The method produces noisy data and
normalization is crucial.
• Real Time RT-PCR for validation of small number
of genes.
Limitation
• Measures mRNA instead of proteins. Actual
protein abundance and post-translation
modification can not be detected.
• Suitable for global monitoring and should be
used to generate further hypothesis or should
combine with other carefully designed
experiments.
Mechanisms in microarray
Important mechanisms that make microarray work:
1. Reverse transcription: mRNA => cDNA. This is
usually also the step to label dyes.
(Protein can not be reverse translated to mRNA or to
another form. So difficult to label dyes.)
2. Double strand binding of complimentary DNA
sequences.
(Protein does not enjoy such a good property; there
are 20 amino acids without complementary binding)
Typical Microarray Analysis
normal
ID_REF
VALUE
AFFX-BioB-5_at
210.6
AFFX-BioB-M_at
393
AFFX-BioB-3_at
264.9
AFFX-BioC-5_at
738.6
AFFX-BioC-3_at
356.3
AFFX-BioDn-5_at
566.3
AFFX-BioDn-3_at
3911.8
AFFX-CreX-5_at
6433.3
AFFX-CreX-3_at
11917.8
AFFX-DapX-5_at
12.2
AFFX-DapX-M_at
57.8
AFFX-DapX-3_at
29.8
AFFX-LysX-5_at
15.3
AFFX-LysX-M_at
33.2
AFFX-LysX-3_at
40.7
AFFX-PheX-5_at
7.8
AFFX-PheX-M_at
4.2
AFFX-PheX-3_at
54.2
AFFX-ThrX-5_at
8.2
AFFX-ThrX-M_at
38.1
AFFX-ThrX-3_at
15.2
AFFX-TrpnX-5_at
11.2
AFFX-TrpnX-M_at
9
AFFX-TrpnX-3_at
19.8
AFFX-HUMISGF3A/M97935_5_at
82.7
AFFX-HUMISGF3A/M97935_MA_at
397.6
AFFX-HUMISGF3A/M97935_MB_at
206.2
AFFX-HUMISGF3A/M97935_3_at
663.8
AFFX-HUMR GE/M10098_5_at
547.6
AFFX-HUMR GE/M10098_M_at
239.1
AFFX-HUMR GE/M10098_3_at
1236.4
AFFX-HUMGAPDH/M33197_5_at
19508
AFFX-HUMGAPDH/M33197_M_at
18996.6
AFFX-HUMGAPDH/M33197_3_at
18016.4
AFFX-HSAC07/X00351_5_at
23294.6
AFFX-HSAC07/X00351_M_at
25373.1
AFFX-HSAC07/X00351_3_at
20032.8
tumor
ABS_C ALL VALUE
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
M
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
234.6
327.8
164.6
676.1
365.9
442.2
3703.7
5980
9376.7
44.3
42.5
6.2
16.2
12
10.7
3
4.8
39.6
11.2
30.6
5
11.8
8.1
12.8
120.7
416.7
303
723.9
405.9
175.8
721.4
19267.1
20610.4
17463.8
21783.7
24922.8
20251.1
tumor
VALUE
362.5
501.4
244.7
737.6
423.4
649.7
4680.9
7734.7
11509.3
31.2
79
23.4
15.6
17.7
36.2
7.6
6.8
19.4
13.2
37.6
15
22.2
9.1
11.8
92.7
244.8
300.8
812.1
6894.7
3675
9076.1
22892
21573.7
20921.3
18423.3
22384.2
20961.7
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
389
816.5
379.7
1191.2
711.6
834.3
6037.7
10591
16814.4
37.7
48.8
28.4
16.7
37.3
22.1
5.6
6.1
16.1
9.5
7.2
8.3
22.1
8.7
43.2
46.4
181.4
253.5
666.1
3496.1
1348.6
7795.9
26584
29936
26908.3
21858.9
25760.2
23494.6
ABS_C ALL
P
P
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
M
P
A
P
P
P
P
P
P
P
P
P
P
P
Raw data
normal
VALUE
305.6
542
261.3
917
560.3
599.1
4653.7
8162.1
13861.8
33.3
39.5
3.2
3.1
49.2
22.8
5
3.7
44.7
8.5
26.9
36.8
8.9
8.1
17.4
55.9
197.5
195.3
629.4
1958.5
695.9
4237.1
29666.6
30106.6
28382.2
23517.1
27718.5
23381.2
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
tumor
VALUE
330.5
440.8
303.7
767.9
484.9
606.9
4232
8428
13653.4
12.8
39.2
7.6
3.9
9.1
28.2
6.4
5.5
31.2
7.5
36.3
11.5
35.6
12
10
46.5
192.3
216
754.1
5799.4
2428.2
7890
25038.1
22380.2
21885
19450.3
21401.6
21173.3
A B S _C A LL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
Significance
Normalize
Classification
Function (Gene Ontology)
Regulation (Motif finding)
Filter
•Present/Absent
•Minimum value
•Fold change
Clustering
Identify differentially expressed
genes
• Two samples: one normal, one cancer
– Which set of genes have significantly different expression levels
between the two samples?
• Naïve approach: fold change threshold (e.g. two fold)
– Log2 (cy5 / cy3) > 1: up-regulated / induced
– Log2(cy5 / cy3) < -1: down-regulated / repressed
• Still widely used – very simple
• Main problem: genes with low expression levels may
have a large fold change by chance
– From 10 to 100: ten fold
– From 1000 to 3000: three fold
– However: low-intensity => relatively high variance
15
15
10
10
log2(test / control)
log2(test / control)
Problem with fold change
5
0
-5
-10
5
0
-5
0
1000
2000
3000
4000
sqrt(test * control)
5000
6000
-10
0
1000
2000
3000
4000
sqrt(test * control)
5000
• The most “differentially” expressed genes are the ones
with the lowest average expression levels
6000
More robust estimation of
differentially expression
•
•
Estimate variance as a function of average expression
Compute a Z-score depending on location: Z(x) = (x - <x>) / (x)
– x : log2(R/G) value.
– <x> : local mean
– (x): local standard deviation
Reference: Quackenbush, Nat Gen, 2002
SAM (Significance Analysis of
Microarrays)
•
•
•
•
Tusher et. al. PNAS 2001, 98:5116-5121
Excel add-in (free download, technical details)
Most cited method of microarray data analysis
Example: Test - 3 reps; Control - 3 reps
T1
T2
T3
C1
C2
C3
Ratio
Gene1
1000
2000
1500
200
300
250
6
Gene2
1000
2000
3000
1000
1500
500
2
Gene3
100
1000
100
20
80
50
8
Gene4
1800
1700
1900
1000
800
900
2
Which one is more significantly differentially expressed?
3500
3500
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
0.5
1
1.5
2
2.5
3
Gene 2
Ratio = 2000/1000 = 2
0
0
0.5
1
1.5
2
2.5
3
Gene 4
Ratio = 1800/900 = 2
SAM (Significance Analysis of
Microarrays)
• Basic idea: compute a statistic (e.g. Student’s t-test)
+ S0
To avoid small
sample problem
• Larger t => higher significance
• P-value can be directly computed for t-test or estimated
from permutation test
T1
T2
T3
C1
C2
C3
Ratio
t
Gene1
1000
2000
1500
200
300
250
6
4.3
Gene2
1000
2000
3000
1000
1500
500
2
1.5
Gene3
100
1000
100
20
80
50
8
1.2
Gene4
1800
1700
1900
1000
800
900
2
11.0
Permutation test to determine
significance
T1
T2
T3
C1
C2
C3
t
Gene1
1000
2000
1500
200
300
250
4.3
Perm1
1500
300
1000
250
2000
200
0.17
Perm2
1000
300
200
1500
2000
250
-1.3
2000
300
1000
1500
200
250
0.7
…
Perm-n
•
•
•
•
•
Number of unique permutations: (6 choose 3) = 20.
Smallest possible p-value: 1/20 = 0.05
With 5 samples on each side: (10 choose 5) = 252
With 10 samples on each side: (20 choose 10) ~ 200k
For small sample size: pool all genes
Permutation test
Sorted
Real t
t1
t2
…
tn
tavg
Treal - tavg

-
SAM
False Discovery Rate (FDR)
• Multiple testing problem
–
–
–
–
P-value cutoff = 0.05
We tested 10000 genes
Would expect 500 genes by chance at this significance level
Found 600 genes with p < 0.05. Many might be due to noise.
• Bonferroni correction
– Use p-value cutoff 0.05 / 10000
– Among all genes selected, P(at least one false positive) <= 0.05
– Too conservative. Very few genes can be selected.
• False Discovery Rate (FDR)
– FDR = 0.1, meaning among all genes selected, (say 100), we
would expect 10 to be false positive
– FDR as high as 0.5 may be acceptable to biologists
– Several different approaches to estimate (Most popular:
Benjamini & Hochberg)
FDR in SAM
Sorted
Real t
t1
t2
…
tn
tavg
Treal - tavg

-
FDR = the median number of “significant” ones in permuted columns
number of significant ones in real
Small : more genes selected; higher FDR.
Large : less genes selected; lower FDR.
FDR in SAM
FDR = 1855/5065=36%
FDR = 1.5/209<1%
Typical Microarray Analysis
normal
ID_REF
VALUE
AFFX-BioB-5_at
210.6
AFFX-BioB-M_at
393
AFFX-BioB-3_at
264.9
AFFX-BioC-5_at
738.6
AFFX-BioC-3_at
356.3
AFFX-BioDn-5_at
566.3
AFFX-BioDn-3_at
3911.8
AFFX-CreX-5_at
6433.3
AFFX-CreX-3_at
11917.8
AFFX-DapX-5_at
12.2
AFFX-DapX-M_at
57.8
AFFX-DapX-3_at
29.8
AFFX-LysX-5_at
15.3
AFFX-LysX-M_at
33.2
AFFX-LysX-3_at
40.7
AFFX-PheX-5_at
7.8
AFFX-PheX-M_at
4.2
AFFX-PheX-3_at
54.2
AFFX-ThrX-5_at
8.2
AFFX-ThrX-M_at
38.1
AFFX-ThrX-3_at
15.2
AFFX-TrpnX-5_at
11.2
AFFX-TrpnX-M_at
9
AFFX-TrpnX-3_at
19.8
AFFX-HUMISGF3A/M97935_5_at
82.7
AFFX-HUMISGF3A/M97935_MA_at
397.6
AFFX-HUMISGF3A/M97935_MB_at
206.2
AFFX-HUMISGF3A/M97935_3_at
663.8
AFFX-HUMR GE/M10098_5_at
547.6
AFFX-HUMR GE/M10098_M_at
239.1
AFFX-HUMR GE/M10098_3_at
1236.4
AFFX-HUMGAPDH/M33197_5_at
19508
AFFX-HUMGAPDH/M33197_M_at
18996.6
AFFX-HUMGAPDH/M33197_3_at
18016.4
AFFX-HSAC07/X00351_5_at
23294.6
AFFX-HSAC07/X00351_M_at
25373.1
AFFX-HSAC07/X00351_3_at
20032.8
tumor
ABS_C ALL VALUE
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
M
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
234.6
327.8
164.6
676.1
365.9
442.2
3703.7
5980
9376.7
44.3
42.5
6.2
16.2
12
10.7
3
4.8
39.6
11.2
30.6
5
11.8
8.1
12.8
120.7
416.7
303
723.9
405.9
175.8
721.4
19267.1
20610.4
17463.8
21783.7
24922.8
20251.1
tumor
VALUE
362.5
501.4
244.7
737.6
423.4
649.7
4680.9
7734.7
11509.3
31.2
79
23.4
15.6
17.7
36.2
7.6
6.8
19.4
13.2
37.6
15
22.2
9.1
11.8
92.7
244.8
300.8
812.1
6894.7
3675
9076.1
22892
21573.7
20921.3
18423.3
22384.2
20961.7
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
389
816.5
379.7
1191.2
711.6
834.3
6037.7
10591
16814.4
37.7
48.8
28.4
16.7
37.3
22.1
5.6
6.1
16.1
9.5
7.2
8.3
22.1
8.7
43.2
46.4
181.4
253.5
666.1
3496.1
1348.6
7795.9
26584
29936
26908.3
21858.9
25760.2
23494.6
ABS_C ALL
P
P
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
M
P
A
P
P
P
P
P
P
P
P
P
P
P
Raw data
normal
VALUE
305.6
542
261.3
917
560.3
599.1
4653.7
8162.1
13861.8
33.3
39.5
3.2
3.1
49.2
22.8
5
3.7
44.7
8.5
26.9
36.8
8.9
8.1
17.4
55.9
197.5
195.3
629.4
1958.5
695.9
4237.1
29666.6
30106.6
28382.2
23517.1
27718.5
23381.2
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
tumor
VALUE
330.5
440.8
303.7
767.9
484.9
606.9
4232
8428
13653.4
12.8
39.2
7.6
3.9
9.1
28.2
6.4
5.5
31.2
7.5
36.3
11.5
35.6
12
10
46.5
192.3
216
754.1
5799.4
2428.2
7890
25038.1
22380.2
21885
19450.3
21401.6
21173.3
A B S _C A LL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
Significance
Normalize
Classification
Function (Gene Ontology)
Regulation (Motif finding)
Filter
•Present/Absent
•Minimum value
•Fold change
Clustering
Source: “Practical Microarray Analysis”, Presentation by
Benedikt Brors, German Cancer Research Center
Classification (Supervised learning)
• (Clustering: unsupervised learning)
• Classification: separate items into groups based on
features of the items and based on a training set of
previously labeled items
• Many classification algorithms:
– Decision tree, SVM, naïve bayes, nearest neighbors, neural
networks, etc.
– Some tell you how the classification is made, which might help
biologists to understand the molecular mechanisms
– Some are black boxes
– In most cases, performance by different algs is similar. Having
the right features (predictor variables) is the key.
AML: acute myeloid leukemia
ALL: acute lymphoblastic leukemia
Classification is critical
for successful treatment.
Clinical distinction involves an experienced
hematopathologist’s interpretation of tumor
morphology, histochemistry, immunophenotyping,
and cytogenetic analysis. each performed in a
separate, highly specialized laboratory
Still imperfect and errors do occur.
•
•
Golub et. al., Molecular Classification of Cancer:
Class Discovery and Class Prediction by Gene
Expression Monitoring, Science 286: 531 – 537,
1999
Method: weighted vote (similar to centroid
classifier)
Centroid-based classifier
• Model Training: Based on
the training data calculate the
centroid for each class.
G1
x?
d1
*
*
*
*
c1
*
*
d2
* *
oo
o
o
c2
o
o
o
o
• Classification:
1. Given a data point, calculate the
distance between the point and
each of the class centroids.
2. Assign the point to the closest
class
ALL
?
centroid
AML
centroid
G2
K-Nearest-Neighbour classifier
• Model Training: none
• Classification:
– Given a data point, locate K nearest
points.
– Returns the most common class label
among the k points nearest to x
•
We usually set K > 1 to avoid outliers
•
Variations:
– Can also use a radius threshold rather than K.
– We can also set a weight for each neighbour that
takes into account how far it is from the query
point
_
_
_
_ _ ++
+ _ +
+
.
_
+
x
+
_ _
_
+
_
+
+
Cancer classification
• Tons of papers have been published. Many claimed high
accuracy.
– Be careful when evaluating those papers.
– Very easy to overfit: much more number of genes than number
of samples
• Simple methods often outperform fancy ones
– SVM and KNN among best
• Simple methods usually also mean robustness and easy
to interpret
• In most cases, performance by different algs is similar.
Having the right features (predictor variables) is the key.
Clustering microarray data
• Unsupervised learning
• Group genes into co-expressed sets
– Genes with similar expression patterns across
multiple experiments may be co-regulated
• Group experiments into clusters
– Experiments within the same group may have similar
“gene expression” signature
– For example, disease sub-types that can be classified
from gene expression data
Clustering microarray data
• How to tell if two expression vectors are
similar?
– Define the (dis)-similarity measure between
two vectors
• How to group multiple profiles into
meaningful subsets ?
– Describe the clustering procedure
• Are the results meaningful ?
– Evaluate biological meaning of a clustering
(Dis)-similarity measures
• Two genes, X=(x1,…, xm) and Y=(y1…,ym).
• Euclidean distance
D X , Y     x  y 
m
i 1
• Pearson correlation coefficient
• Cosine similarity
• Mutual information
• Etc.
2
i
i
Clustering algorithms
•
•
•
•
•
•
•
Hierarchical clustering
K-means clustering
Self Organizing Maps (SOMs)
Spectral clustering
Model-based
Graph-based
Etc.
• Jiang and Zhang, Cluster Analysis for Gene Expression
Data: A Survey, IEEE Transactions on Knowledge and
Data Engineering, Vol. 16, No. 11. (2004), pp. 1370-1386
Hierarchical clustering
• Agglomerative or divisive (less popular)
• Agglomerative basic idea:
– Given n genes
– Initially every gene in a single cluster
– for each iteration
• find two most similar genes (or gene groups),
combine into one cluster
• Terminate when only one cluster is left
• (how to define similarity between two groups?)
Hierarchical clustering
a
b
c
d
e
• Exact behavior depends on how to compute the distance between
two clusters
• No need to specify number of clusters
• A distance cutoff is often chosen to break tree into clusters
f
Distance between clusters
• Single-linkage
– Not recommended
– Can be reduced to MST
• Complete-linkage
• Average-linkage
– (very similar to UPGMA)
• Centroid method
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
An example
3
50
2
100
150
1
200
250
Genes
0
300
350
-1
400
450
-2
500
550
10
20
30
Experiments
40
50
-3
Hierarchical clustering
Average linkage. Cluster genes only.
Average linkage. Cluster both genes and experiments.
K-means
• Basic idea:
– Given n genes
– Guess number of clusters: k
– (Randomly) choose k genes as
cluster centers
– Assign each gene to the closest
center
– Re-compute center for each cluster
– Until assignment is stable
Similarity to EM. Objective function: minimize total distance to cluster centers.
May be trapped by local optima. Multiple runs with different random starting
points are generally needed.
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
K-means
50
100
150
200
K = 15
250
300
350
400
450
500
5
10
15
20
25
30
35
40
45
50
Another view of clusters
Cluster 1
Log ratio
4
2
0
-2
-4
0
5
10
15
20
25
30
35
40
45
50
30
35
40
45
50
Cluster 4
Log ratio
4
2
0
-2
-4
0
5
10
15
20
25
Experiments
How to determine number of
clusters?
• An open problem
• Larger K:
– More homogeneity within clusters
– Less separation between clusters
• Small K:
– The opposite
• Many heuristic methods have been
proposed, none is uniformly good
Heuristics to determine number of
clusters
• Tibshirani, Walther and Hastie, Estimating the number of
clusters in a dataset via the gap statistic (2000)
• Define some statistic with respect to the number of
clusters
– Gap statistic: (weighted) average log distance to cluster centers
 expected
Evaluating clustering
• Do genes in the same cluster share similar
functions?
– Functional enrichment analysis
• Do genes in the same cluster share similar
cis-regulatory motifs?
– Motif finding
Gene Ontology (GO)
• Gene functions were often defined using
free text
• Hard to extract, transfer, revise, predict,
annotate, comprehend, manage …
• The list of vocabularies should be predefined and commonly agreed
• Gene Ontology provides a controlled
vocabulary to describe gene and gene
product attribute
Gene ontology
• Two parts
– Ontology: list of vocabularies (terms) to use
– Annotations: characterizing genes using
ontology terms
• Three ontology categories
– Biological process
– Molecular function
– Cellular components
Part of a GO graph
Each GO category is a
directed acyclic graph
A term can have multiple
parents, and multiple
children.
A gene can be annotated by
multiple terms.
If annotated by a child term,
automatically annotated by
all ascendant terms.
Example functional enrichment
analysis
• Total number of genes in yeast: 7268
– 65 genes have function in co-enzyme biosynthesis
• Cluster A: 100 genes
– 20 of them have function in co-enzyme biosynthesis
Significance can be computed using
cumulative hyper-geometric test:
65 20 100
7268
if we randomly draw 100 genes from the
genome, what’s the chance that we’ll
see at least 20 co-enzyme biosynthesis
genes?
 N  M  N 
 

min( m , N ) 
i
m

i

cHypegeom (n; M , N , m)    
M 
i n
 
m
Example functional enrichment
analysis
65 20 100
7268
If we randomly draw 100 genes from the
genome, the prob that we’ll see exactly
20 co-enzyme biosynthesis genes:
 65  7203 
 

20
80
  1.36 10  22
hypegeom(20;7268,65,100)   
 7268 


100


65
cHypegeom (20;7268,65,100)   hypegeom(i;7268,65,100)  1.39 10  22
i  20
P-value of enrichment
Correction for multiple testing problem is usually preferred, as there are many GO
terms being tested.
Besides GO, other information can also be used to test for enrichment.
E.g. protein complexes, pathways, motifs, etc.
Gene Ontology Tools
• geneontology.org
– Download ontology files, species-specific annotation
files
– Links to many useful analysis tools
• Tools for enrichment analysis
– GO:TermFinder. Downloadable. (Web interface
available at SGD for yeast only)
– FuncAssociate: Web tool. ~a dozen model organisms
(human, mouse, fruit fly, c. elegan, yeast, Arabidopsis,
etc).
– DAVID Bioinformatics Resources: Web tool.
(Downloadable). Mammalian genes.
RNA-Seq
Transcriptiome Analysis
Figure 5 | Overview of RNA-Seq. A RNA fraction of
interest is selected, fragmented and reverse
transcribed. The resulting cDNA can then be
sequenced using any of the current ultra-highthroughput technologies to obtain ten to a hundred
million reads, which are then mapped back onto the
genome. The reads are then analyzed to calculate
expression levels.
Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009) Published
online: 15 October 2009
doi:10.1038/nmeth.1371
RNA-Seq: Strategies
Figure 1 from Hass & Zody, 2010
RNA-Seq: Strategies
107
• Alignment Strategy
– Align to transcriptome
• no new transcript discovery
– Align to genome and exonexon junction sequences
• extremely large search
space due to all possible
exon combinations
– De novo assembly
• Cufflink
• Scripture
Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009)
Published online: 15 October 2009
doi:10.1038/nmeth.1371
RNA-seq
Microarray
RNA-seq
Hybridization-based
Sequencing-based
Can only detect transcripts with known
genomic sequences
For both known and new transcripts
Cannot be easily updated when new
genome sequence info becomes available
May be updated when new genome
sequence info becomes available
Low signal to noise ratio due to crosshybridization etc.
No cross-hybridization issue => higher
signal to noise ratio
Relatively narrow dynamic range
Ability to quantify a large dynamic range of
expression levels
Insignificant computational challenge
Substantial computational challenge
Substantial data interpretation challenge
Intermediate data interpretation challenge
Download