Gene Expression Arrays

advertisement
SPH 247
Statistical Analysis of
Laboratory Data
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
1
Basic Design of Expression Arrays
 For each gene that is a target for the array, we have a
known DNA sequence.
 mRNA is reverse transcribed to DNA, and if a
complementary sequence is on the on a chip, the DNA
will be more likely to stick
 The DNA is labeled with a dye that will fluoresce and
generate a signal that is monotonic in the amount in
the sample
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
2
Intron
Exon
TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG
ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC
Probe Sequence
• cDNA arrays use variable length probes derived from
expressed sequence tags
– Spotted and almost always used with two color methods
– Can be used in species with an unsequenced genome
• Long oligoarrays use 60-70mers
– Agilent two-color arrays
– Illumina Bead Arrays
– Usually use computationally derived probes but can use probes
from sequenced EST’s
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
3
 Affymetrix GeneChips use multiple 25-mers
 For each gene, one or more sets of 8-20 distinct probes
 May overlap
 May cover more than one exon
 Affymetrix chips also use mismatch (MM) probes
that have the same sequence
as perfect match probes except for the middle base
which is changed to inhibit
binding.
 This is supposed to act as a control, but often
instead binds to another mRNA
species, so many analysts do not use them
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
4
Illumina Bead Arrays
 Beads are coated with many copies of a 50-mer gene




specific probe and a 29-mer address sequence
Multiple beads per probe, random, but around 20
Each chip of the Ref-8 contains 8 arrays with ~ 25,000
targets, plus controls
Each chip of the WG-6 contains 6 arrays with ~ 50,000
targets, plus controls
Each chip of the HT-12 chip contains 12 arrays with ~
50,000 targets and controls
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
5
Probe Design
 A good probe sequence should match the chosen gene
or exon from a gene and should not match any other
gene in the genome.
 Melting temperature depends on the GC content and
should be similar on all probes on an array since the
hybridization must be conducted at a single
temperature.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
6
 The affinity of a given piece of DNA for the probe
sequence can depend on many things, including
secondary and tertiary structure as well as GC content.
 This means that the relationship between the
concentration of the RNA species in the original
sample and the brightness of the spot on the array can
be very different for different probes for the same gene.
 Thus only comparisons of intensity within the same
probe across arrays makes sense.
 A higher signal for one gene than another on the same
array does not mean that the copy number is higher
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
7
Affymetrix GeneChips
 For each probe set, there are 8-20 perfect match (PM)
probes which may overlap or not and which target the
same gene
 There are also mismatch (MM) probes which are
supposed to serve as a control, but do so rather badly
 Most of us ignore the MM probes
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
8
Expression Indices
 A key issue with Affymetrix chips is how to summarize
the multiple data values on a chip for each probe set
(aka gene).
 There have been a large number of suggested
methods.
 Generally, the worst ones are those from Affy, by a long
way; worse means less able to detect real differences
 Summary of Illumina beads is simpler, but there are
still issues.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
9
Usable Methods
 Li and Wong’s dCHIP and follow on work is
demonstrably better than MAS 4.0 and MAS 5.0, but
not as good as RMA and GLA
 The RMA method of Irizarry et al. is available in
Bioconductor.
 The GLA method (Durbin, Rocke, Zhou) is also
available in Bioconductor/CRAN as part of the
LMGene R package
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
10
Bioconductor Documentation
> library(affy)
Loading required package: Biobase
Loading required package: tools
Welcome to Bioconductor
Vignettes contain introductory material. To view,
type
'openVignette()'. To cite Bioconductor, see
'citation("Biobase")' and for packages
'citation(pkgname)'.
Loading required package: affyio
Loading required package: preprocessCore
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
11
Bioconductor Documentation
> openVignette()
Please select a vignette:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
affy - 1.
affy - 2.
affy - 3.
affy - 4.
affy - 5.
Biobase Biobase Biobase Biobase Biobase Biobase -
Primer
Built-in Processing Methods
Custom Processing Methods
Import Methods
Automatic downloading of CDF packages
An introduction to Biobase and ExpressionSets
Bioconductor Overview
esApply Introduction
Notes for eSet developers
Notes for writing introductory 'how to' documents
quick views of eSet instances
Selection:
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
12
Reading Affy Data into R
 The CEL files contain the data from an array. We will
look at data from an older type of array, the U95A
which contains 12,625 probe sets and 409,600 probes.
 The CDF file contains information relating probe pair
sets to locations on the array. These are built into the
affy package for standard types.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
13
Example Data Set
 Data from Robert Rice’s lab on twelve keratinocyte cell
lines, at six different stages.
 Affymetrix HG U95A GeneChips.
 For each “gene”, we will run a one-way ANOVA with
two observations per cell.
 For this illustration, we will use RMA.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
14
Files for the Analysis
 .CDF file has U95A chip definition (which probe is
where on the chip). Built in to the affy package.
 .CEL files contain the raw data after pixel level analysis,
one number for each spot. Files are called LN0A.CEL,
LN0B.CEL…LN5B.CEL and are on the web site.
 409,600 probe values in 12,625 probe sets.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
15
The ReadAffy function
 ReadAffy() function reads all of the CEL files in
the current working directory into an object of
class AffyBatch, which is itself an object of class
ExpressionSet
 ReadAffy(widget=T) does so in a GUI that
allows entry of other characteristics of the dataset
 You can also specify filenames, phenotype or
experimental data, and MIAME information
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
16
rrdata <- ReadAffy()
> class(rrdata)
[1] "AffyBatch"
attr(,"package")
[1] "affy“
> dim(exprs(rrdata))
[1] 409600
12
> colnames(exprs(rrdata))
[1] "LN0A.CEL" "LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL" "LN2B.CEL"
[7] "LN3A.CEL" "LN3B.CEL" "LN4A.CEL" "LN4B.CEL" "LN5A.CEL" "LN5B.CEL"
> length(probeNames(rrdata))
[1] 201800
> length(unique(probeNames(rrdata)))
[1] 12625
> length((featureNames(rrdata)))
[1] 12625
> featureNames(rrdata)[1:5]
[1] "100_g_at" "1000_at"
"1001_at"
April 16, 2013
"1002_f_at" "1003_s_at"
SPH 247 Statistical Analysis of Laboratory Data
17
The ExpressionSet class
 An object of class ExpressionSet has several slots
the most important of which is an assayData object,
containing one or more matrices. The best way to
extract parts of this is using appropriate methods.
 exprs() extracts an expression matrix
 featureNames() extracts the names of the probe sets.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
18
Expression Indices
 The 409,600 rows of the expression matrix in the
AffyBatch object Data each correspond to a probe (25mer)
 Ordinarily to use this we need to combine the probe
level data for each probe set into a single expression
number
 This has conceptually several steps
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
19
Steps in Expression Index Construction
 Background correction is the process of adjusting the
signals so that the zero point is similar on all parts of
all arrays.
 We like to manage this so that zero signal after
background correction corresponds approximately to
zero amount of the mRNA species that is the target of
the probe set.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
20
 Data transformation is the process of changing the
scale of the data so that it is more comparable from
high to low.
 Common transformations are the logarithm and
generalized logarithm
 Normalization is the process of adjusting for
systematic differences from one array to another.
 Normalization may be done before or after
transformation, and before or after probe set
summarization.
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
21
 One may use only the perfect match (PM) probes, or
may subtract or otherwise use the mismatch (MM)
probes
 There are many ways to summarize 20 PM probes and
20 MM probes on 10 arrays (total of 200 numbers) into
10 expression index numbers
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
22
Probe intensities for LASP1 in a radiation
dose-response experiment
0
1
10
200618_at1
360
216
158
198
233.0
200618_at2
313
402
106
103
231.0
200618_at3
130
182
79
91
120.5
200618_at4
351
370
195
136
263.0
200618_at5
164
130
98
107
124.8
200618_at6
223
219
164
196
200.5
200618_at7
437
529
195
158
329.8
200618_at8
509
554
274
128
366.3
200618_at9
522
720
285
198
431.3
200618_at10
668
715
247
260
472.5
200618_at11
306
286
144
159
223.8
362.1
393.0
176.8
157.6
Expression
Index
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
100 Mean
23
Log probe intensities for LASP1 in a radiation
dose-response experiment
April 16, 2013
0
1
10
200618_at1
2.56
2.33
2.20
2.30
2.35
200618_at2
2.50
2.60
2.03
2.01
2.28
200618_at3
2.11
2.26
1.90
1.96
2.06
200618_at4
2.55
2.57
2.29
2.13
2.38
200618_at5
2.21
2.11
1.99
2.03
2.09
200618_at6
2.35
2.34
2.21
2.29
2.30
200618_at7
2.64
2.72
2.29
2.20
2.46
200618_at8
2.71
2.74
2.44
2.11
2.50
200618_at9
2.72
2.86
2.45
2.30
2.58
200618_at10
2.82
2.85
2.39
2.41
2.62
200618_at11
2.49
2.46
2.16
2.20
2.33
Expression
Index
2.51
2.53
2.21
2.18
SPH 247 Statistical Analysis of Laboratory Data
100 Mean
24
The RMA Method
 Background correction that does not make 0 signal
correspond to 0 amount
 Quantile normalization
 Log2 transform
 Median polish summary of PM probes
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
25
> eset <- rma(rrdata)
trying URL 'http://bioconductor.org/packages/2.1/…
Content type 'application/zip' length 1352776 bytes (1.3 Mb)
opened URL
downloaded 1.3 Mb
package 'hgu95av2cdf' successfully unpacked and MD5 sums checked
The downloaded packages are in
C:\Documents and Settings\dmrocke\Local Settings…
updating HTML package descriptions
Background correcting
Normalizing
Calculating Expression
> class(eset)
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
> dim(exprs(eset))
[1] 12625
12
> featureNames(eset)[1:5]
[1] "100_g_at" "1000_at"
April 16, 2013
"1001_at"
"1002_f_at" "1003_s_at"
SPH 247 Statistical Analysis of Laboratory Data
26
> exprs(eset)[1:5,]
LN0A.CEL LN0B.CEL
100_g_at 9.195937 9.388350
1000_at
8.229724 7.790238
1001_at
5.066185 5.057729
1002_f_at 5.409422 5.472210
1003_s_at 7.262739 7.323087
LN3B.CEL LN4A.CEL
100_g_at 9.394606 9.602404
1000_at
7.463158 7.644588
1001_at
4.871329 4.875907
1002_f_at 5.200380 5.436028
1003_s_at 7.185894 7.235551
April 16, 2013
LN1A.CEL
9.443115
7.733320
4.940588
5.419907
7.355976
LN4B.CEL
9.711533
7.497006
4.853802
5.310046
7.292139
LN1B.CEL
9.012228
7.864438
4.839563
5.343012
7.221642
LN5A.CEL
9.826789
7.618449
4.752610
5.300938
7.218818
SPH 247 Statistical Analysis of Laboratory Data
LN2A.CEL
9.311773
7.620704
4.808808
5.266068
7.023408
LN5B.CEL
9.645565
7.710110
4.834317
5.427841
7.253799
LN2B.CEL
9.386037
7.930373
5.195664
5.442173
7.165052
LN3A.CEL
9.386089
7.502759
4.952883
5.190440
7.011527
27
> summary(exprs(eset))
LN0A.CEL
LN0B.CEL
Min.
: 2.713
Min.
: 2.585
1st Qu.: 4.478
1st Qu.: 4.449
Median : 6.080
Median : 6.072
Mean
: 6.120
Mean
: 6.124
3rd Qu.: 7.443
3rd Qu.: 7.473
Max.
:12.042
Max.
:12.146
LN2A.CEL
LN2B.CEL
Min.
: 2.598
Min.
: 2.717
1st Qu.: 4.444
1st Qu.: 4.469
Median : 6.008
Median : 6.058
Mean
: 6.109
Mean
: 6.125
3rd Qu.: 7.426
3rd Qu.: 7.422
Max.
:13.135
Max.
:13.110
LN4A.CEL
LN4B.CEL
Min.
: 2.742
Min.
: 2.634
1st Qu.: 4.468
1st Qu.: 4.433
Median : 6.074
Median : 6.050
Mean
: 6.122
Mean
: 6.120
3rd Qu.: 7.460
3rd Qu.: 7.478
Max.
:12.033
Max.
:12.162
April 16, 2013
LN1A.CEL
Min.
: 2.611
1st Qu.: 4.458
Median : 6.070
Mean
: 6.120
3rd Qu.: 7.467
Max.
:12.122
LN3A.CEL
Min.
: 2.633
1st Qu.: 4.425
Median : 6.017
Mean
: 6.116
3rd Qu.: 7.444
Max.
:13.106
LN5A.CEL
Min.
: 2.615
1st Qu.: 4.448
Median : 6.053
Mean
: 6.121
3rd Qu.: 7.477
Max.
:11.925
SPH 247 Statistical Analysis of Laboratory Data
LN1B.CEL
Min.
: 2.636
1st Qu.: 4.477
Median : 6.078
Mean
: 6.128
3rd Qu.: 7.467
Max.
:11.889
LN3B.CEL
Min.
: 2.622
1st Qu.: 4.428
Median : 6.028
Mean
: 6.117
3rd Qu.: 7.459
Max.
:13.138
LN5B.CEL
Min.
: 2.590
1st Qu.: 4.487
Median : 6.068
Mean
: 6.123
3rd Qu.: 7.457
Max.
:11.952
28
Probe Sets not Genes
 It is unavoidable to refer to a probe set as
measuring a “gene”, but nevertheless it can be
deceptive
 The annotation of a probe set may be based on
homology with a gene of possibly known function
in a different organism
 Only a relatively few probe sets correspond to
genes with known function and known structure
in the organism being studied
April 16, 2013
SPH 247 Statistical Analysis of Laboratory Data
29
Download