Functional genomics approaches to disease genomics • Biological information and organisation • Genomics approaches to identifying diseaserelevant enrichment • Candidate gene approaches Biological information increases rapidly • Everyday hundreds of articles are published – We can’t read them all – We can’t remember them all – Our memories are subjective anyway • To make use of this incredible research output, we need some ways to bring this information together and summarise it • If we could make it readable by a computer then our power to use it increases hugely OMIM Home Page http://www.ncbi.nlm.nih.gov/omim/ OMIM • Online Mendelian Inheritance in Man (OMIM) is a catalog of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases • Annotates 325 genes associated with human disease • 2,710 disorders with a known molecular basis • 1,634 genetic disorders with an unknown basis • The OMIM entries are made by experienced annotators – Even the best annotators are not wholly consistent What is Ontology? 1606 1700s • Dictionary: A branch of metaphysics concerned with the nature and relations of being. • Barry Smith:The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality. Slide from the GO website www.geneontology.org Ontologies • Formalising our knowledge into a structured and defined vocabulary is essential for genomics approaches • The benefits from an agreed language enable rapid progress (e.g. Species classification) • Recently, biological research communities have been defining a common language for describing everything from protein function through to phenotype From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. Slide taken from GO (www.geneontology.org) Gene Ontology (GO) • The Gene Ontology project was set up to provide a controlled vocabulary that describes a gene and its products (principally its product) • GO describes genes in 3 separate ontologies – Molecular function, biological process and cellular location – Genes can be annotated with many terms in each category GO Molecular Function GO term: Malate dehydrogenase. GO id: GO:0030060 (S)-malate + NAD(+) = oxaloacetate + NADH. NAD+ O HO H HO NADH + H+ OH O H O OH H H H HO O O Biological Process GO term: tricarboxylic acid cycle Synonym: Krebs cycle Synonym: citric acid cycle GO id: GO:0006099 Cellular Component GO term: mitochondrion GO id: GO:0005739 Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration. GO Biological Process Is_a • Directed Acyclic Graph (DAG) • Allows a child node to have more than one parent Physiological Process Is_a Metabolism Is_a Is_a Primary Metabolism Is_a Biosynthesis Protein Metabolism Is_a Is_a Protein Biosynthesis Mammalian Phenotype Ontology • Really the mouse phenotype ontology • Annotators take each published mouse gene knock-out experiment and annotate the phenotype with the MPO Human Medical Ontologies • Human Phenotype Ontology www.human-phenotype-ontology.org • The HPO provides a standardized vocabulary of phenotypic abnormalities encountered in human genetic syndromes Organ abnormality Cardiovascular abnormality Cardiac abnormality Cardiac malformation Abn. of the cardiac atria • London Dysmorphology Database www.human-phenotype-ontology.org Brachycephaly Cranium, general abnormalities Microcephaly Neurology Mental cognitive function Abn. of the cardiac septa Intellectual disability Model Organisms • Excellent functional genomics resources – The comparison between a human phenotype and a mouse phenotype is often very readily interpretable. – Other useful organisms include the fly, the worm and even yeast • Useful as they have well-curated data for many genes Kyoto Encyclopaedia of Genes and Genomes (KEGG) • Pathway database • manually-curated information from literature High-throughput functional resources • Tissue-expression – Where and when genes are expressed may be relevant to the disease • Interactions – genes that interact may be involved in the same biological process – E.g. protein-protein interactions or genetic interactions (coordinated regulation) • Sequence patterns (coding or regulatory) – Similar sequence can infer common functionality Different data sources have different types of error • Literature sources (GO, model organism data, etc) have poor coverage and a lack of true negatives – We publish “A is an X” more than “A is not a Y” – All genes have not been subject to the same studies • High-throughput sources often have high-error rates – False-positives are particularly a problem for gene/protein interactions when you’re considering all pairs The value of mouse phenotypic data Ability to predict Human Phenotype Ontology terms Forming interesting gene sets • If you can’t identify a single gene/loci, may be you can form a subset of genes likely to contain gene(s) of interest – Genes in large intervals identified by linkage studies – Genes near SNPs with low, but not genome-wide significant, p-values from GWAS studies – Genes in de novo or rare CNVs seen in cases • Power is important – Bringing together many similar cases enriches for disease genes associated with that disease Testing for enrichments • Compare to the genome – Pulling balls (genes) from a bag (genome) is sampling without replacement, hypergeometric distribution • Compare to controls – If chosen well, may account for biases – Contingency tables, Chi2 tests – If controls are unavailable, you can randomise to help address potential biases like gene length and function Rare de novo copy number variant (CNV) associated with learning disability 2.8 Mb How does this CNV relate to the etiology of the disease? Which gene(s) underlie the phenotype? Rare de novo CNVs are frequent in learning disability • Rare de novo CNVs > 100kb present in ~10% of LD cases • Occur all over genome • 80% unique, non-recurrent Collect a list of 148 rare de novo CNVs CNVs are common in all people • Apparently benign, mostly inherited CNVs occur all over genome Collect a list of 26,472 benign CNVs Redon et al. Nature 2006 Mutations at different loci can give a similar phenotype SYMPTOM/PHENOTYPE Method Interesting intervals in patients Mouse Genes Human Genes Available Mouse KO phenotypes ORTHOLOGY Mouse models relevant to the human disorder Disease phenotype Significantly overrepresented phenotype Significant enrichments of genes associated with particular mouse phenotypes within de novo CNVs identified in patients with Intellectual disability * 15 200 10 150 5 % change % change overover expected expected 0 * * * * * * 250 200 100 50 50 -10 FDR < 5% * 150 100 -5 -15 300 0 0 Nervous System category Benign CNVs All LD CNVs Abnormal axon morphology LD CNVs benign CNVs Abnormal dopaminergic neuron morphology Loss LD CNVs Loss LD CNVs benign CNVs Human brain-specific genes corroborates mouse findings 40 30 * * * * % change 20 over 10 expected 0 Benign CNVs All LD CNVs All LD CNVs minus benign CNVs -10 Loss LD CNVs -20 Loss LD CNVs minus benign CNVs Brain-specific Genes “Brain-specific” genes are defined as those whose expression in human whole brain is > 4 x median expression across all other tissues Provides ~ 3.75% of human genes as “brain-specific” Autism Spectrum Disorders – the ‘triad’ of symptoms Impaired social interaction Restrictive, repetitive behaviours and interests Impaired communication Autism.org.uk Behavioural model phenotypes associated with Autism Spectrum Disorder (ASD) de novo CNVs “Difficulty processing and retaining verbal information” “Difficulty understanding social language” “Difficulty coping with changes in routine” Behavioural model phenotypes associated with Autism Spectrum Disorder (ASD) de novo CNVs “Difficulty understanding social language” “Difficulty with empathy and friendships” Behavioural model phenotypes associated with ASD de novo CNVs “Restricted and Repetitive Behaviours and Interests” 60-80% of individuals with ASD exhibit poor motor planning and coordination Candidate genes • The genes that constitute significant enrichments become candidate disease genes • While the enrichment is significantly associated with the intervals, the individual genes are not, and each requires further proof individually • Experimental follow-up is costly and thus the genes taken forward need to be considered carefully Annotations vary in coverage and specificity GO Transcription % change over expected BrainSpecific KEGG Neuro KEGG Parkinson’s 200 500 150 Number of 300 candidate genes 200 400 100 50 100 0 0 80 % of CNVs with a candidate gene 70 60 50 40 30 20 10 0 Mouse phenotypes Abnormal Axon/Neuron The better the patients are classified the more power we have to identify enrichments Tremor phenotype 6 of 148 LD patients have a cleft palate 400 Enrichment for KO phenotype cleft palate 250 * * 300 % change 150 over 200 100 expected 50 100 0 0 Benign CNVs * 200 -50 Patients +/- seizures -100 LD CNVs in 6 patients with cleft palate Abnormal myelination phenotype 142 without cleft palate 600 * * 500 400 300 200 100 0 -100 Patients +/- brain abnormality Some associations found for the main cohort may be more relevant to associated, or co-occurring, symptoms – ASD Mutation databases are a rich source of discovery: DECIPHER • DECIPHER is a database that holds genetic information about patients who present with congenital abnormalities Proband 1 Proband 2 Proband 3 Very similar phenotype Single gene DECIPHER patients are annotated with London Medical Database terms Level 1 Level 2 Level3 Brachycephaly Cranium, general abnormalities Microcephaly Neurology Mental cognitive function Intellectual disability Formed groups CNVs associated with each human phenotype Cranium, General abnormalities Brachycephaly Microcephaly 7 CNVs 11 CNVs 114 CNVs 121 CNVs 18 CNVs ENSEMBL genes assigned to CNVs Remove copy number variable genes observed in healthy individuals 132 CNVs 692 genes 3320 genes 3036 genes 633 genes 3030 genes 2767 genes Many enrichments are readily interpretable Human Symptom: Short Stature, Prenatal Onset * 300 250 200 150 100 50 Mouse Phenotype: Decreased Fetal Size % Enrichment 350 1200 * 1000 800 600 400 200 * * 300 250 200 150 100 50 Mouse Phenotype: Abnormal Palate Development Human Symptom: Malocclusion 3000 % Enrichment Human Symptom: Syndactyly of toes 400 1400 0 0 450 * 1600 % Enrichment % Enrichment 350 Human Symptom: Cupid bow shape of mouth 2500 * 2000 1500 1000 500 0 -500 0 Mouse Phenotype: Syndactyly All Gain Loss Mouse Phenotype: Malocclusion * Statistically Significant FDR < 0.05 Others identify less obvious relationships Human Symptom: Psychotic Behaviour Human Symptom: Complex Partial Seizures 6000 3000 * * 2000 1000 0 Mouse Phenotype: Abnormal pre-pulse inhibition % Enrichment % Enrichment 4000 * 4000 2000 0 Mouse Phenotype: Abnormal circadian rhythm KEY All Gain Loss * Statistically Significant FDR < 0.05 Mutations can be dissected to identify the contributions of individual genes Patient id: 248772 ATG7 OXTR ATP2B2 Intellectual disability/ developmental delay candidate genes FANCD2 Short stature, prenatal onset candidate gene Patient id: 785 SNX2 Mental retardation/ developmental delay candidate gene FBN2 Camptodactyly candidate gene Gene set enrichment analysis Aravind Subramanian et al, 2005 • Start with some list of ranked genes – Genes ranked by expression cases vs controls (Microarrays) – Genes ranked by nearby SNP p-values • Score genes + or – according to some property • Ask, are genes with this property more focussed towards the top of this list that I would expect by chance? Gene Prioritisation for disease • Given a list of genes, which are most likely to be involved in this disease? • We just want a ranking, not a significant association • Commonly employed approaches involve supervised learning methodologies – Collect data points from one or more sources – Take a “Gold Standard” set of genes for this disease – Train a method using known true +ives (and true –ives if known) – Given a list of genes, which ones “look” most similar to the known disease genes? Linkage networks can infer missing values – “guilt by association” From pubmed ID: 19728866 Linkage network for human disorders using the Human Phenotype ontology (PMID 18950739) Conserved co-expression of disease genes (Ala et al. ,PLoS Genetics 2008) • 850 OMIM entries where a phenotype was mapped to a loci but specific genes unknown • Used conserved human-mouse co-expression data as other interaction or pathway data can bias towards studied genes • Generated single species gene co-expression networks – Calculated Pearson’s cor. coef. between all pairs of gene expression data. Formed a network edge if 2 genes’ exp. correlation was in the top 1% either gene. • Clustered OMIM phenotypes using MimMiner – A text-mining tool Using this methodology, they were able to predict 321 candidates across 81 disease-associated loci at an FDR of <10% Human phenome-interactome network for predicting disease candidate genes (Lage et al., Nature Biotech. 2007) • 2 data networks – Phenotypic similarity, consisting of detecting words that are common to two phenotype descriptions and do not occur frequently among all phenotype description. – Human interactome, consisting of several large human sets and sets transferred from model organisms, weighted according to observation frequency. (1) a given positional candidate is queried for high-scoring interaction partners (“virtual pull-down”). These are interaction partners for the candidate complex. (2) proteins known to be involved in disease are identified in the candidate complex, and pairwise scores of the phenotypic overlap between disease of these proteins and the candidate phenotype are assigned. (3) Based on the phenotypes represented in the candidate complex, a Bayesian predictor awards a probability to the candidate in the complex. The score is used to form the ranking.