GeneExpression - Wellcome Trust Centre for Human Genetics

A Gene Expression Project
Richard Mott
Wellcome Trust Centre for Human
Genetics
Project Summary
• Gene Expression Analysis
– Comparison with phenotypes
– Gene Ontology Analysis
– Comparison between tissues
– Gene Coexpression Networks
• NB: There are many R packages available from CRAN
for gene expression and network analysis, which are
not covered in this lecture.
– You should explore them!
Gene expression datasets
• Hippocampus (460 mice),
• Liver and Lung (260 mice)
• 100 Phenotypes
• Mice are from a Heterogeneous Stock, from
164 families
Gene Expression data
• Gene expression measured on Illumina Mouse
arrays
– 47000 50-mer probes
– Approx 2 probes per gene
– Covariates (eg Sex, Family) also available
• > load("liver.exp.RData")
• > load("liver.cov.RData")
• > source("expression.tutorial.R")
Exploring Expression Data
> liver.median <- apply(liver.exp, 2, median )
> hist(liver.median, breaks=50)
> liver.subset <- liver.exp[,liver.median>6]
Sex Effects
• Which transcripts have different expression levels for the two
sexes?
– Use a T-test on each transcript
– The R apply() function speeds up the analysis
– First define a function tfunc that performs the T test and reports the
P-value
– tfunc <- function( X, GENDER ) {
tt <- t.test( X ~ GENDER );
return(tt$p.value) }
– Then compute the test for each transcript
– > sex.pvalue <- apply(liver.subset, 2,tfunc, liver.cov$GENDER )
– Then plot the distribution of p-values
– > hist( sex.pvalue, breaks=50)
– > sum(sex.pvalue<1.0e-5)
– [1] 78
Sex Effects
312/2796 (11%) of transcripts with median level > 6 have sex effects with P < 0.01
78/2796 (2%) of transcripts with median level > 6 have sex effects with P < 0.00001
Family effects (Heritability)
•
•
•
•
•
•
•
•
•
Which transcripts are affected by genetic background?
Use one-way ANOVA wrapped inside apply()
First define a function to return the p-value of the ANOVA:
anova.pvalue <- function( X, factor ) {
a <- anova(lm( X ~ factor))
return(a[1,5])
}
Then find the transcripts with high heritability
family.pvalue <- apply( liver.subset, 2,
anova.pvalue, liver.cov$Family )
Family Effects
18% of transcripts with median level >6 have heritability p-value < 0.01
0.2% of transcripts with median level >6 have heritability p-value < 0.00001
Body Weight
• We can find transcripts associated with body
weight in a similar fashion to family effects,
except that linear regression is used.
– Note that the direction of causality is no longer
certain, ie it is not clear whether variation in a
transcript is causative for variation in weight or
vice versa
> weight.pvalue <- apply( liver.subset, 2,
anova.pvalue, liver.cov$EndNormalBW )
> hist(weight.pvalue,breaks=50)
Body Weight
11% of transcripts with median levels > 6 are significant at P < 0.01
1.5% of transcripts with median levels > 6 are significant at P < 0.00001
What do the genes do?
• So far we have identified sets of genes which are associated
with sex, family and weight
• How can we characterise these genes ?
• One popular method is to test if the annotations of these
genes have unusual features.
• Annotations include:
– genome location
– protein domain architecture (eg from INTERPRO)
– gene function, where known (eg from GO)
• From a statistical perspective, is is importation that a
controlled vocabulary (ontology) is used to describe gene
functions.
– The analysis then does not have to understand any biology!!
The Gene Ontology (GO)
http://www.geneontology.org/
• GO associates a set of GO-terms with every gene,
describing aspects of the gene’s known function.
• It has become a very popular tools for automated
investigation of large sets of genes.
• But note that:
– GO is not complete, covering only biological
processes, cellular components and molecular
functions. Other ontologies are also important.
– many genes have no known function
GO annotation Examples
GO:0000001 mitochondrion inheritance
GO:0000002 mitochondrial genome maintenance
GO:0000003 reproduction
GO:0000005 ribosomal chaperone activity
GO:0000006 high affinity zinc uptake transporter activity
GO:0000007 low-affinity zinc ion transporter activity
GO:0000008 thioredoxin
GO:0000009 alpha-1,6-mannosyltransferase activity
GO:0000010 trans-hexaprenyltranstransferase activity
GO:0000011 vacuole inheritance
GO:0000012 single strand break repair
ENSMUSG00000061404 Olfr936 GO:0001584 GO:0016020 GO:0007600 GO:0007166 GO:0004930 GO:0031224 GO:0003674 GO:0005623 GO:0050896
GO:0016021 GO:0004888 GO:0004871 GO:0050877 GO:0007582 GO:0005575 GO:0007186 GO:0007608 GO:0007165 GO:0004872 GO:0007154
GO:0044464 GO:0044425 GO:0004984 GO:0007606 GO:0050874 GO:0009987 GO:0051869 GO:0008150
ENSMUSG00000030105 Arl8b GO:0016043 GO:0007046 GO:0051233 GO:0005737 GO:0016817 GO:0044424 GO:0048487 GO:0005515 GO:0016787
GO:0043014 GO:0005623 GO:0007028 GO:0044422 GO:0044237 GO:0007242 GO:0043228 GO:0005856 GO:0007582 GO:0008152 GO:0007165
GO:0015630 GO:0008092 GO:0019003 GO:0016462 GO:0005622 GO:0044464 GO:0007154 GO:0003824 GO:0006139 GO:0006364 GO:0005488
GO:0003924 GO:0043170 GO:0016818 GO:0019001 GO:0009987 GO:0005525 GO:0008150 GO:0017076 GO:0043229 GO:0006396 GO:0016072
GO:0007059 GO:0043232 GO:0050875 GO:0044430 GO:0043283 GO:0044446 GO:0030496 GO:0015631 GO:0003674 GO:0042254 GO:0007264
GO:0000166 GO:0005819 GO:0017111 GO:0044238 GO:0043226 GO:0016070 GO:0005575 GO:0006996
ENSMUSG00000042428 Mgat3 GO:0016020 GO:0008375 GO:0043413 GO:0005615 GO:0044421 GO:0005737 GO:0031224 GO:0044267 GO:0044424
GO:0008194 GO:0009058 GO:0005623 GO:0044422 GO:0044237 GO:0007582 GO:0008152 GO:0044425 GO:0005622 GO:0044464 GO:0003824
GO:0044431 GO:0003830 GO:0043227 GO:0043170 GO:0019538 GO:0006487 GO:0009059 GO:0006412 GO:0009987 GO:0016740 GO:0008150
GO:0043229 GO:0006486 GO:0009101 GO:0050875 GO:0016758 GO:0043283 GO:0044446 GO:0005795 GO:0003674 GO:0009100 GO:0016021
GO:0005576 GO:0044238 GO:0044249 GO:0043226 GO:0043231 GO:0005575 GO:0044260 GO:0016757 GO:0006464 GO:0005794 GO:0043412
GO:0044444
Testing for GO association
• Set of genes G is classified into two groups eg by sex
• A given GO annotation term classifies the genes into
two groups (present, absent)
• The data are a 2x2 contingency table classified by sex
and GO, and the test of GO/sex association can be
done either by a chi-squared test or by Fisher’s Exact
Test FET, or a generalised linear model with Poisson link
function.
• The most popular methods use the FET, which can be
calculated quickly using the hypergeometric
distribution, and is exact even when the counts of data
are small
Testing for GO association
• Read in a file of GO terms associated with each
Ensembl Mouse gene (this set has been reduced to
include only those GO terms present in more than 1%
of genes)
go1 <- read.delim("GO.Ensembl.01.txt")
> dim(go1)
[1] 19988
387
• Find the common transcripts between liver.subset and the annotations,
and those transcripts with sex p-values < 0.01
> intersect <colnames(liver.subset)[match(go1$transcript,
colnames(liver.subset), nomatch = 0)]
> intersect <- unique(sort(intersect))
Testing for GO association
> liver.subset.intersect <- liver.subset[, match(intersect,
colnames(liver.subset))]
> dim( liver.subset.intersect)
[1] 275 1650
> go.intersect <- go1[match(intersect,go1$transcript),]
> dim(go.intersect)
[1] 1650 388
> sex.ids <- colnames(liver.subset)[sex.pvalue<0.01]
> sex.intersect <- sex.ids[match(sex.ids,intersect,nomatch=0)]
> length(sex.intersect)
[1] 174
> sex.idx <- go.intersect$transcript %in% sex.ids
Testing for GO Association
using apply() and fisher.test()
•
fisher.func <- function( X, sex.idx) { X <- as.factor(X) ; if ( nlevels(X) == 2 )
{f <- fisher.test(X, sex.idx); return (f$p.value)} else return(1) }
•
> fish <- apply( go.intersect[,4:ncol(go.intersect)], 2, fisher.func, sex.idx )
•
•
> length(fish)
[1] 385
•
•
•
•
•
•
•
> fish[fish < 0.01]
GO.0000267
GO.0002376
GO.0003735
GO.0005624
GO.0005783
GO.0005840
5.255498e-03 6.142841e-04 9.096193e-03 4.108839e-04 1.153113e-03 9.125263e-03
GO.0006412
GO.0006955
GO.0009058
GO.0009059
GO.0016740
GO.0016788
9.852476e-05 4.726468e-05 7.243732e-03 4.532035e-05 3.915276e-03 4.224464e-03
GO.0030529
GO.0043170
GO.0043234
GO.0044249
GO.0044422
GO.0044446
2.250219e-03 2.157644e-03 5.039780e-04 2.347306e-04 2.360906e-03 2.360906e-03
•
•
<length(fish[fish < 0.01])
[1] 18
Significant GO terms
> data.frame( pvalue=fish[fish<0.01], desc=as.character(go2name$desc[go2name$go
%in% names(fish[fish<0.01])]))
pvalue
desc
GO.0000267 5.255498e-03
cell fraction
GO.0002376 6.142841e-04
immune system process
GO.0003735 9.096193e-03
structural constituent of ribosome
GO.0005624 4.108839e-04
membrane fraction
GO.0005783 1.153113e-03
endoplasmic reticulum
GO.0005840 9.125263e-03
ribosome
GO.0006412 9.852476e-05
protein biosynthesis
GO.0006955 4.726468e-05
immune response
GO.0009058 7.243732e-03
biosynthesis
GO.0009059 4.532035e-05
macromolecule biosynthesis
GO.0016740 3.915276e-03
transferase activity
GO.0016788 4.224464e-03 hydrolase activity, acting on ester bonds
GO.0030529 2.250219e-03
ribonucleoprotein complex
GO.0043170 2.157644e-03
macromolecule metabolism
GO.0043234 5.039780e-04
protein complex
GO.0044249 2.347306e-04
cellular biosynthesis
GO.0044422 2.360906e-03
organelle part
GO.0044446 2.360906e-03
intracellular organelle part
Lecture V
Comparing gene expression across
experments
• We sometimes want to compare expression levels
between experiments
– In different tissues in the same individuals
– At different time points or under different experimental
conditions with genetically identical individuals
• Inbred lines
• Cell lines
• Comparisons may be done
– At the level of the individual
– Across the population
Comparing gene expression across
tissues
• A common set of individuals are used for two tissues (eg liver, lung)
• Look for transcripts whose expression levels across samples between the
tissues are correlated
samples
Lung transcripts
Liver transcripts
An efficient way to do the computation in R
Liver transcripts
Process each column in the combined
matrix using apply()
samples
Lung transcripts
Comparing gene expression across
tissues
> load(“lung.liver.exp”) # CONTAINS lung and liver stacked vertically
> lung.liver <- c(rep( TRUE, 262), rep(FALSE, 262)
> lung.liver.cor.pvalue <- apply( lung.liver.exp, 2,
function( X, by ) { f <- cor.test(X[by], X[!by]); return f$pvalue },
by=lung.liver )
> hist(-log10(lung.liver.cor.pvalue),breaks=50)
> lung.liver.cor.logP <- -log10(lung.liver.cor.pvalue)
> sum( lung.liver.cor.logP>
[1] 0.0837884
> sum( lung.liver.cor.logP>
[1] 0.04374960
> sum( lung.liver.cor.logP>
[1] 0.02791541
> sum( lung.liver.cor.logP>
[1] 0.01906007
>
2)/47429
3)/47429
4)/47429
5)/47429
Distribution of P-values
GO associations of transcripts with correlated
expression between liver and lung
• General-purpose function for testing GO associations
– genes is a vector of the transcript names under consideration
– classification is a boolean vector the same length as genes, indicating which
transcripts are in the classification
– go is a matrix where go[transcript,Goterm] = TRUE if Goterm is attached
to transcript
– go2name is a data frame with the descriptions of the Go terms
> GO.analysis <- function( genes, classification, go, go2name ) {
intersect <- unique(sort(genes[match(go$transcript,genes,nomatch=0)]))
go.intersect <- go[match(intersect,go$transcript),]
classification.intersect <- classification[match(classification,intersect,nomatch=0)]
class.idx <- go.intersect$transcript %in% genes[classification]
fish <- apply( go.intersect[,4:ncol(go.intersect)], 2, fisher.func, class.idx)
fish <- fish[names(fish) %in% go2name$go]
r <- order(fish)
return( data.frame( p.value=fish, desc=go2name$desc[match(go2name$go,
names(fish2[r]), nomatch=0)]))
}
Many very significant GO terms
> fish <- GO.analysis( colnames(lung.liver.exp), lung.liver.cor.logP> 4, go1, go2name )
> head(fish$pvalue,n=20)
GO.0001584
2.577931e-18
GO.0043227
2.572202e-12
GO.0005737
2.648800e-08
GO.0005634
4.469984e-07
GO.0004930
GO.0004888
GO.0004984
GO.0007186
GO.0043231
3.011990e-17 3.477225e-14 2.396701e-13 1.138140e-12 1.838885e-12
GO.0044424
GO.0005622
GO.0007166
GO.0043229
GO.0043226
1.000009e-11 2.866695e-11 6.100367e-10 6.792908e-10 6.806115e-10
GO.0004872
GO.0004871
GO.0044444
GO.0007582
GO.0043170
3.289317e-08 5.339634e-08 7.805342e-08 9.355738e-08 1.049470e-07
GO.0050875
4.844296e-07
Biological Processes coordinated
between liver and lung
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
GO.0001584
GO.0004930
GO.0004888
GO.0004984
GO.0007186
GO.0043231
GO.0043227
GO.0044424
GO.0005622
GO.0007166
GO.0043229
GO.0043226
GO.0005737
GO.0004872
GO.0004871
2.577931e-18
3.011990e-17
3.477225e-14
2.396701e-13
1.138140e-12
1.838885e-12
2.572202e-12
1.000009e-11
2.866695e-11
6.100367e-10
6.792908e-10
6.806115e-10
2.648800e-08
3.289317e-08
5.339634e-08
GO.0001584
bud site selection
GO.0004930 activation of MAPKK activity during cell wall biogenesis
GO.0004888
C-8 sterol isomerase activity
GO.0004984
protein import into nucleus, translocation
GO.0007186
shmoo orientation
GO.0043231
nuclear interphase chromosome
GO.0043227
isoleucine/valine:sodium symporter activity
GO.0044424
cytoplasmic interphase chromosome
GO.0005622
spliceosomal catalysis
GO.0007166
inactivation of MAPK (pseudohyphal growth)
GO.0043229
second spliceosomal transesterification activity
GO.0043226
mitochondrion inheritance
GO.0005737
re-entry into mitotic cell cycle
GO.0004872
sulfite transport
GO.0004871
organellar small ribosomal subunit
>
Mapping expression Quantitative
Trait Loci (eQTL)
• eQTL are Genetic loci at which genotype variation is
associated with variation in a transcript
–
–
–
–
cis eQTL co-localise with the correaponding gene’s position
trans eQTL map to other locations in the genome
Many transcripts have cis eQTL as well as trans eQTL
hubs are loci that contain many trans eQTL, indicating sets
of genes under the same control
– Basic principle of mapping any quantitative trait is to
determine the association between the phenotype and
each genotyped SNP
– anova(lm( transcript ~ genotype))
cis eQTLs on mouse Chr 19
without SNPs
Tissue
Hippocampus
Liver
Lung
cis
trans
1996
408
3442
2774
3411
911
with SNPs
Hippocampus
Liver
Lung
733
209
424
368
426
450
A simple eQTL mapping
programme
Data: transcript is a matrix of transcript levels (columns), genotypes is a matrix of genotypes (columns)
The rows are the subjects, which must be identical across the two matrices
>load(“g19.RData”)
>load(“liver.19.exp.RData”)
> map.eqtl <- function( transcript, genotypes ) {
logP = -log10(apply( genotypes, 2, function( X, transcript ) { X <- as.factor(X) ; if ( nlevels(X) > 1
) { a <- anova(lm( transcript ~ X )); return(a[1,5])} else return(1)}, transcript))
return(logP)
}
> map.eqtls <- function( genotypes, transcripts ) {
genos <- data.frame(genotypes[,2:ncol(genotypes)])
logP <- apply( transcripts, 2, map.eqtl, genos )
rownames(logP) <- colnames(genos)
colnames(logP) <- colnames(transcripts)
return(logP)
}
logP <- map.eqtls( g19, liver.19.exp[,c(20346,29095)])
# map transcripts
29095 = "LIVER.express.scl074170.1_145.S"
20346 = "LIVER.express.scl0023972.1_63.S"
Example chromosome scans
LIVER.express.scl074170.1_145.S
LIVER.express.scl0023972.1_63.S
Different Types of Association
Analysis
•
•
SNP genotypes G take three possible values AA, AB, BB
Under an additive genetic model, the phenotypic effect of a genotype is
proportional to the number of A alleles (or equivalently B alleles)
– we recode the genotype as a number X <- additive(G)
• BB = 0,
• AB = 1
• AA = 2
•
Under a dominant genetic model, the phenotypic effect is zero unless an A allele is
present (or vice-versa)
– we recode the genotype as a number X <- dominance(G)
• BB = 0,
• AB = 1
• AA = 1
•
•
Under a full genetic mode, the phenotypic effect of each genotype is arbitrary
Note that a full model is the sum of the additive and dominance models
–
–
phenotype ~ G
phenotype ~ additive(G) + dominance(G)
R functions for converting full
genotypes to submodels
additive <- function( G ) {
G <- as.factor(G);
G <- reorder( G, sort(levels(G))
return(as.numeric(G)
}
dominance <- function(G) {
a <- additive(G)
return(ifelse(a>0,1,0))
}