Pathway‐Based Approach for Genetic Analysis of Gene Expression Alex C. Lam GeneSys inaugural meeting 1 Oct ‐ 3 Oct 2008 Edinburgh, UK eQTL mapping Rockman et al. Nat. Rev. Genet. 2006 Signals from a gene set Type II error • eQTL studies have great potential in dissecting complex traits, but…. – Genome scan + large number of traits – Massive multiple testing – Stringent threshold required • Some real signals fail to reach the statistical significant threshold • Can we use existing knowledge in biology as prior in guiding our detection? Outline of this presentation • • • • • • • Idea of gene set testing The BXH/HXB rat inbred line dataset KEGG pathway Wilcoxon Test Fisher’s Exact Test Gene Ontology (GO) Results and summary Gene set testing • Consider differential expression (DE) • Genes are classified as DE or non‐DE • Suppose that genes can be categorized according to their gene function • Null hypothesis: The extent of DE is identical across all functional gene sets • If more DE genes came from a gene set than expected, the relationship could be interesting Hypertensive rat eQTL study • Hubner et al. (Nat. Genet. 2005) • 30 Recombinant Inbred Lines of Spontaneous Hypertensive Rat / Brown Norway at F60 • Rat Affymetrix GeneChip • 2 tissues: kidney and fat • 1,011 autosomal microsatellite markers • Linkage analysis for each transcript KEGG pathways • Kyoto Encyclopedia of Genes and Genomes Gene filtering 15,923 probesets Remove controls and non‐ expressed probesets ~13,000 probesets Remove probesets with no EntrezGene entry. Also remove duplicates ~9,000 probesets Remove probesets with no KEGG entry. Also remove KEGG sets with < 5 genes Fat: 2185 genes, 152 KEGG pathways Gene set testing along the genome • Conventional eQTL analysis: – For each gene expression phenotype, consider the linkage evidence (e.g. Likelihood Ratio Test statistics) along the genome at regular intervals (e.g. every 1 cM) • Gene set testing: – At regular intervals, consider the linkage evidence of multiple gene expression traits. The LRT statistics are the input of the test. Method (1) • Two‐sample Wilcoxon test – Non‐parametric version of the “t‐test” – Rank the LRT test statistics – Test if the genes in the gene set rank higher than those not in the gene set Genes in set A Genes not in set A RANK Permutation (1) W Genes in set A Re‐sampling W1 W1000 Ribosome pathway Permutation (2) W Genes in set A Re‐sampling W1 W1000 Results ‐ Wilcoxon Test • 10 signals < 5% genome‐wise significance • 2 examples below: most members of the gene set ranked highly at the locus Ranks might not be meaningful Point‐wise 5% Method (2) • Set a linkage threshold for individual genes • Test for over‐representation of gene set Expected Enrichment Gene set A Gene universe Genes with linkage detected 2 by 2 table representation • One‐tailed Fisher’s Exact test • P‐value 0.001 used as linkage threshold In gene set Not in gene set Genes linked to Genes not linked eQTL to eQTL A B C D A + B + C + D = all genes in gene universe Results ‐ Fisher’s Exact Test KEGG ID Ch cM r Minimum genomewise P-value Gene set size No. of genes with significant statistic * KEGG pathway name local 05020 1 143 0.013 12 3 (23) Parkinson's disease 04360 3 210211 0.015 70 5 (8) Axon guidance 00630 5 71-72 0.030 7 3 (8) Glyoxylate and dicarboxylate metabolism 03022 12 18 0.043 15 2 (3) Basal transcription factors 00260 19 35-36 0.041 24 3 (10) Glycine, serine and threonine metabolism 04514 20 1-6 0.007 81 9 (14) Cell adhesion (CAMs) 04612 20 1-5 0.003 46 10 (14) Antigen processing presentation 04940 20 2-5 0.003 34 9 (14) Type I diabetes mellitus molecules and Results ‐ Fisher’s Exact Test • Chromosome 20 signals came from MHC genes; likely to be artefacts • Other signals tend to come from a small number of genes; robustness? • Overall, none of the signals are very convincing • Too many eQTL discarded; KEGG coverage is quite low Gene Ontology (GO) • An alternative way to KEGG to create gene sets • Describe the functions of gene products • More genes are annotated in GO than in KEGG – After removing GO terms with > 100 genes and < 10 genes, 5893 genes are retained for analysis (from 1676 GO terms) Results ‐ Fisher’s Exact Test • 13 signals • LRT statistic threshold P < 0.001 • Examples: – Chr 17 potassium channel activity (19 / 61) – Chr 17 oxidoreductase activity (13 / 28) – Chr 5 immunological synapse (6 / 14) • Should these genes be looked at more closely? Potentials • Highlight putative effects that are moderate in size • Start more in‐depth in‐silico analyses – Correlation – Interaction • Generate new hypotheses for future study Caveats • Analysis is only as good as the annotation – Incomplete? – Incorrect? • The number of genes included has a strong influence – Repeatable in a larger study? • Arbitrary threshold for Fisher’s Exact Test • Multiple testing of gene sets Summary • Gene set testing can identify interesting “multi‐trait eQTL” • Permutation should be carried out at the subject level • Wilcoxon Test picks up a lot of noise • The size of the gene universe is important • Currently, GO has a higher coverage than KEGG Acknowledgements – Edinburgh • DJ de Koning • Chris Haley – MRC Clinical Sciences Centre / Imperial • Tim Aitman • Enrico Petretto – Aarhus • Peter Sørensen • Funding – BBSRC – Genesis Faraday – Genus / PIC