Pathway‐Based Approach for  Genetic Analysis of Gene Expression Alex C. Lam GeneSys inaugural meeting

advertisement
Pathway‐Based Approach for Genetic Analysis of Gene Expression
Alex C. Lam
GeneSys inaugural meeting
1 Oct ‐ 3 Oct 2008
Edinburgh, UK
eQTL mapping
Rockman et al. Nat.
Rev. Genet. 2006
Signals from a gene set
Type II error
• eQTL studies have great potential in dissecting complex traits, but….
– Genome scan + large number of traits
– Massive multiple testing
– Stringent threshold required
• Some real signals fail to reach the statistical significant threshold
• Can we use existing knowledge in biology as prior in guiding our detection?
Outline of this presentation
•
•
•
•
•
•
•
Idea of gene set testing
The BXH/HXB rat inbred line dataset
KEGG pathway
Wilcoxon Test
Fisher’s Exact Test
Gene Ontology (GO)
Results and summary
Gene set testing
• Consider differential expression (DE)
• Genes are classified as DE or non‐DE
• Suppose that genes can be categorized according to their gene function
• Null hypothesis: The extent of DE is identical across all functional gene sets
• If more DE genes came from a gene set than expected, the relationship could be interesting Hypertensive rat eQTL study
• Hubner et al. (Nat. Genet. 2005)
• 30 Recombinant Inbred Lines of Spontaneous Hypertensive Rat / Brown Norway at F60
• Rat Affymetrix GeneChip
• 2 tissues: kidney and fat
• 1,011 autosomal microsatellite markers
• Linkage analysis for each transcript
KEGG pathways
• Kyoto Encyclopedia of Genes and Genomes
Gene filtering
15,923 probesets
Remove controls and non‐
expressed probesets
~13,000 probesets
Remove probesets with no EntrezGene entry. Also remove duplicates
~9,000 probesets
Remove probesets with no KEGG entry. Also remove KEGG sets with < 5 genes
Fat: 2185 genes, 152 KEGG pathways
Gene set testing along the genome
• Conventional eQTL analysis:
– For each gene expression phenotype, consider the linkage evidence (e.g. Likelihood Ratio Test statistics) along the genome at regular intervals (e.g. every 1 cM)
• Gene set testing:
– At regular intervals, consider the linkage evidence of multiple gene expression traits. The LRT statistics are the input of the test.
Method (1)
• Two‐sample Wilcoxon test
– Non‐parametric version of the “t‐test”
– Rank the LRT test statistics
– Test if the genes in the gene set rank higher than those not in the gene set
Genes in set A
Genes not in set A
RANK
Permutation (1)
W
Genes in set A
Re‐sampling
W1
W1000
Ribosome pathway
Permutation (2)
W
Genes in set A
Re‐sampling
W1
W1000
Results ‐ Wilcoxon Test
• 10 signals < 5% genome‐wise significance
• 2 examples below: most members of the gene set ranked highly at the locus
Ranks might not be meaningful
Point‐wise 5%
Method (2)
• Set a linkage threshold for individual genes
• Test for over‐representation of gene set
Expected
Enrichment
Gene set A
Gene universe
Genes with linkage detected
2 by 2 table representation
• One‐tailed Fisher’s Exact test
• P‐value 0.001 used as linkage threshold
In gene set
Not in gene set
Genes linked to Genes not linked eQTL
to eQTL
A
B
C
D
A + B + C + D = all genes in gene universe
Results ‐ Fisher’s Exact Test
KEGG
ID
Ch cM
r
Minimum
genomewise
P-value
Gene
set
size
No. of genes
with
significant
statistic *
KEGG pathway name
local
05020
1
143
0.013
12
3 (23)
Parkinson's disease
04360
3
210211
0.015
70
5 (8)
Axon guidance
00630
5
71-72
0.030
7
3 (8)
Glyoxylate and dicarboxylate
metabolism
03022
12
18
0.043
15
2 (3)
Basal transcription factors
00260
19
35-36
0.041
24
3 (10)
Glycine, serine and threonine
metabolism
04514
20
1-6
0.007
81
9 (14)
Cell
adhesion
(CAMs)
04612
20
1-5
0.003
46
10 (14)
Antigen
processing
presentation
04940
20
2-5
0.003
34
9 (14)
Type I diabetes mellitus
molecules
and
Results ‐ Fisher’s Exact Test
• Chromosome 20 signals came from MHC genes; likely to be artefacts
• Other signals tend to come from a small number of genes; robustness?
• Overall, none of the signals are very convincing
• Too many eQTL discarded; KEGG coverage is quite low
Gene Ontology (GO)
• An alternative way to KEGG to create gene sets
• Describe the functions of gene products
• More genes are annotated in GO than in KEGG
– After removing GO terms with > 100 genes and < 10 genes, 5893 genes are retained for analysis (from 1676 GO terms)
Results ‐ Fisher’s Exact Test
• 13 signals
• LRT statistic threshold P < 0.001
• Examples: – Chr 17 potassium channel activity (19 / 61)
– Chr 17 oxidoreductase activity (13 / 28)
– Chr 5 immunological synapse (6 / 14)
• Should these genes be looked at more closely?
Potentials
• Highlight putative effects that are moderate in size
• Start more in‐depth in‐silico analyses
– Correlation
– Interaction
• Generate new hypotheses for future study
Caveats
• Analysis is only as good as the annotation
– Incomplete?
– Incorrect?
• The number of genes included has a strong influence
– Repeatable in a larger study?
• Arbitrary threshold for Fisher’s Exact Test
• Multiple testing of gene sets
Summary
• Gene set testing can identify interesting “multi‐trait eQTL”
• Permutation should be carried out at the subject level
• Wilcoxon Test picks up a lot of noise • The size of the gene universe is important
• Currently, GO has a higher coverage than KEGG
Acknowledgements
– Edinburgh
• DJ de Koning
• Chris Haley
– MRC Clinical Sciences Centre / Imperial
• Tim Aitman
• Enrico Petretto
– Aarhus
• Peter Sørensen • Funding
– BBSRC
– Genesis Faraday
– Genus / PIC
Download