WP4: Variation (haplotypes and SNPS) Introduction Genotype data from our mouse Quantitative Trait Locus (QTL) mapping experiment became available in March 2005, to augment the phenotype data we had collected previously. We have spent much of the past 9 months analysing the data from the population of 2000 Heterogeneous Stock (HS) mice. We have mapped over 100 phenotypes (behavioural, physiological, asthma-related and diabetes related, see www.well.ox.ac.uk/mouse/HS for a listing) across build 34 of the mouse genome using 13000 SNPs spaced 200kb apart. We identified over 700 QTL at a genomewide 5% significant threshold. QTL mapping was performed using the R HAPPY package (www.well.ox.ac.uk/happy , Mott et al 2000) which uses a multipoint hidden-Markov formulation to map the QTL. The mean 50% confidence interval for each QTL is about 2 Mb, and the mean 95% confidence interval about 4Mb. These intervals are small enough that it is possible to hunt for the causative genes. Along with the phenotype data, we collected a large number of covariates – these are auxiliary observations such as the ID of the experimenter, the time and date of the experiment, the body weight of the animal etc. On of our most exciting findings has been the very large number of interactions between the genotype and the environment, which we have been able to quantify systematically for the first time (Valdar et al 2006a). We have also been able to map QTL responsible for the gene by environment interactions. Strikingly we find that these QTL generally do not overlap with the QTL identified during the genome scans which ignores gene by environment interaction. Some of the results are already in press (Solberg et al, 2006) or submitted (Valdar et al, or in preparation (Valdar et al 2006b). GSCANDB We have written gscandb, a relational database (implemented in MYSQL) and a web interface (in PERL CGI) for viewing the results of genome scans. This corresponds to the BIOSAPIENS milestone 4.5. The database stores information about the population under study, the phenotypes measured on the subjects and the results of the genome scans. Multiple analyses of the same scan (e.g. different genetic models such as additive or dominance effects) can be stored, and against multiple genome builds. Using the viewer it is possible to browse the data, viewing as many phenotypes as desired, either on a genome-wide scale or within a chromosomal region. The user can select which phenotypes and which types of scan to display. The user can also specify a chromosomal region by its base pair coordinates, or jump to a particular gene or SNP. The browser automatically links to Ensembl and UCSC to draw in gene annotation information for display. We have found this viewer essential for our exploration of this challenging and important data set. At present the database contains mouse QTL data but is extendable to other species. The browser interface is at http://zeon.well.ox.ac.uk/rmott-bin/wwwqtl.cgi Examples To login select the project “HS_mouse” with password “socksbeware” Once logged in the interface will appear thus: Using this interface, select the genome build required (default is “Mus Musculus 34”, which is correct), the phenotypes of interest from the scrolling list titled “Plottable Scans” (more than one scan can be selected), the type of scan from “Scan Types”, and the type of view. The most important views are “genome” and “region”. A region view will show a part of one chromosome which must be selected from the “chromosome” scrolling list, and optionally a range (in bp) can be entered in the “from” and “to” test boxes. Then click on “region” to view the scan focussed on that region. For example, if the scan “Cue.Mean.Freeze.Corrected.During” and the chromosome “15” are selected, and “region” clicked then the following image appears: The display has three panels by default: (a) The top panel displays a graphic of the statistical association between the phenotype and genome position along chromosome 15. The y-axis indicates the degree of statistical significance, measured in logP units [logP is defined to be the negative base 10 logarithm of the P-value of the test of no association; for example a logP of 3 means a p-value of 0.001]. Two tracks are plotted (determined by the selected items in the “Scan Types” scrolling list) by default: “additive” and “full”. Additive scan measure the additive genetic component (ie were the contributions from the two alleles at a locus act additively). Full scans measure the additive plus dominance components. The dotted green line indicates the threshold for 1% genomewide significance, based on permutation. There is a highly significant peak at around 90Mb. Below the image is the statistical model that was fitted in the analysis. In this instance it is Cue.Mean.Freeze.Corrected.During ~ GENDER + EndNormalBW + CoatColour + FPS.TrainingChamber + THE.LOCUS which means that the phenotype (Cue.Mean.Freeze.Corrected.During, a behavioural trait measuring the degree to which an animal freezes in response to a threatening sound) was modelled in terms of the sex of the animals (GENDER), the body weight (EndNormalBW), the coat colour, the training chamber ID (FPS.Training.Chamber) and by THE.LOCUS, which is a matrix of probabilities representing the probability of descent from the eight founder strain in the HS (Mott et al, 2000). (b) The second panel shows the haplotype map of the eight inbred founder strains used in the HS mapping population. It shows the “strain distribution pattern” of each genotyped SNP in the eight strains C57BL/6J, A/J.AKR/J, BALB/cJ, C3H/HeJ, CBA/J, DBA/2J, LP/J. We will not discuss this further. (c) the bottom panel shows the positions of known genes along the chromosome. Clicking anywhere on this panel will jump to the Ensembl gene view of that region. The user can zoom in and out and navigate left or right by clicking on appropriate parts of the figure. Clicking on the position of the peak in the top panel will zoom in, displaying the interval between 87 and 95 Mb.: Zooming in once again gives the detailed view The names of the SNPs using in the genetic mapping now appear, and a list of genes in the region is added to the foot of the page. This list contains links to more information about each gene in the genome annotation databases Ensembl, UCSC and NCBI gene. In this example the gene Cntn1 (Contactin) appears, which is a very strong candidate for the phenotype in question (Cue.Mean.Freeze.Corrected.During is a behavioural phenotype which measures how much a mouse “freezes” in response to a threatening noise. Viewing All Trait Loci in a Region Optionally the use can check the box labelled “Display known trait loci”. This causes an additional panel to be displayed below the haplotype map, showing the 95% confidence intervals of all the QTLs mapped to the region. By clicking on a QTL the user is taken to an interface (described in Deliverable 4.1) which links to the Ensembl Martview of the region: Different Types of Scan Selecting different scan types reveals insights about the data. Each scan type corresponds to an analysis of the same data, but under a different genetics model. For the HS mouse project we have defined five scan types at present but will add to them. As well as the additive and full models, computed using a multipoint hidden Markov model, the “dominant” model is the logP of the difference between the two models, ie the dominance component. We als have scan types “additive.sma” and “full.sma” . These are additive and full models computed using single-marker association analysis and are the analogues of the multipoint “additive” and “full” models. In the figure below we show the results for a different phenotype, Biochem.ALP (alkaline phosphatase levels), showing both scant types. Generally the sma performs less well at detecting genetic associations, for reasons described in Mott et al 2000. There is however, one region in the figure, around 118 Mb where it performs better. Viewing complete genome scans The genome view, which is obtained by cliscking on the “genome” button, summarizes genome scans for the selected phenotypes: Each chromosome is represented by a separate condensed plot. Scrolling right will show the remainder of the chromosomes. Clicking on a chromosome plot will jump to the region view of that chromosome. References Mott R, Talbot CJ Turri M, Collins AC, Flint J (2000) A method for fine-mapping quantitative trait loci in outbred animal stocks Proc Natl Acad Sci USA, 97( 23):12649-12654. Solberg LC, Valdar W, Gauguier D, Nunez G, Taylor A, Hernandez P, Davidson S, Burns P, Cookson W, Deacon R, Rawlins JNP, Mott R, Flint J (2006) A protocol for high throughput phenotyping, suitable for quantitative trait analysis in mice. Mammalian Genome (in press) Yalcin B, Flint J, Mott R. (2005) Using progenitor strain information to identify quantitative trait nucleotides in outbred mice. Genetics. 2005 171(2):673-81] Valdar W, Solberg L, Gauguier D , Cookson W, Rawlins N, Mott R, Flint J. (2006) Genetic and environmental effects on complex traits in mice submitted. Future Plans. This is an informal description of the future directions for this body of work: see the future WP4 deliverables for a formal specification. Several groups have expressed interest in obtaining gscandb to install and use locally with their own data, both human and mouse. We therefore intend to make a free downloadable version and write it up for publication. We will also investigate adding additional displays: (i) A General viewer for displaying additional annotation data encoded in GFF files. (ii) A local linkage disequilibrium plot for human and mouse data to augment the haplotype viewer. Our ultimate aim is to identify which of the genes under each QTL is causative for the trait variation (there may more more than one, or in some cases it is possible that no gene under the QTL peak is causative, instead there is a DNA variant which acts in trans on some distant gene). We are investigating three sources of data: (i) Use of mouse inbred strain sequence data to identify potentially causative variants. In (Yalcin et al 2006) we published a statistical method called merge analysis for this purpose. We are now applying it to the QTL data. (iii) Integration of functional information such as gene expression . We have already performed gene expression microarray experiments on 300 HS livers and 200 HS lungs. We are now using this information to filter the gene lists underlying the QTLs associated with liver or lung activity. (iv) Integration of textual data mining tools. We have a collaboration with Kate Elliot to extend her GeneSniffer software for this purpose. We have found the one-line descriptions attached to many genes to be uninformative. Often more information is available when orthology with the human genome is considered and if the Pubmed abstracts are searched in an intelligent manner using keywords related to the phenotype of interest. We departed slightly from our previous description of deliverable 4.x in that we decided not to implement a variable thresholding tool. Instead we have engineered the database to store genome-wide significance thresholds at the 0.5, 0.05 and 0.01% thresholds significance, and the viewer to display the 0.01% thresholds as dotted lines. These are computed by permutation (a time-consuming process which takes several weeks) and are much more useful than the variable threshold concept. Finally we will improve the integration of the data into the Ensembl browser via DAS. 2. Sanger Institute (Cambridge UK) The main aims for Sanger in this work package were to align human and mouse resequencing reads to their respective genomes, make these alignments available, and call SNPs from them. We have aligned 17,983,363 HapMap resequencing reads to the human genome, and SNPs from these have been submitted to dbSNP. During 2005 we have aligned 25,454,096 mouse reads to the mouse genome; these come from multiple strains, including NOD, MSM and C3H, and all the resequencing reads released by Celera during 2005. From the mouse reads we identified 17,312,642 SNPs, many not previously identified, and a paper about these has been submitted to Nature Genetics (Cunningham et al.). During 2005 we restructured the way we plan to store and represent these alignments in the long term. All the human and mouse alignments that we have obtained have now been deposited in the trace archive at http://trace.ensembl.org . The human alignments are currently visible via an Ensembl code-based interface at http://www.glovar.org, as illustrated by the example below showing traces aligned to a region of human chromosome 20. Adaptor code for access to the mapping in the trace archive is currently under development, and will be released in the near future. This will enable more flexible and more open access, and extend the display to mouse,while maintaining the current visualisation. Pablo Marin-Garcia was recruited to work on the BioSapiens variation work from June 13 2005, working with Mark Griffiths who had previously participated in the project. Other relevant research With relevance to WP1 (Gene annotation), postdoc David Carter in Richard Durbin’s group completed his work on a novel comparative gene finding method, DOGFISH, during 2005 and applied it to gene annotation across the whole human genome. He participated in the EGASP’05 workshop at Hinxton in May 2005, which compared gene annotation methods across a subset of the ENCODE regions of the human genome. EGASP’05 was coordinated by Roderic Guigo (WP1 lead investigator) and involved other BioSapiens WP1 groups. A paper on DOGFISH and its performance in EGASP’05 has been submitted to Genome Biology. Another relevant project in Richard Durbin’s group is TreeFam (www.treefam.org) which presents gene trees for all animal gene families, allowing inference of orthologs and paralogs between the human and other animal genomes. 3. University of Helsinki, Finland (Esko Ukkonen) We have developed a new algorithm for reconstructing haplotypes from genotypes of a population sample. The method is based on a hidden Markov model of haplotypes, and it performs well against other state-of-the-art tools both on accuracy and on speed. The software will be made publically available. The work has been published in: Pasi Rastas, Mikko Koivisto, Heikki Mannila and Esko Ukkonen: A Hidden Markov Technique for Haplotype Reconstruction. In Proc. 5th Workshop on Algorithms in Bioinformatics - WABI 2005, LNCS 3692, Springer, 2005, pp. 140-151. The paper is available at http://www.cs.helsinki.fi/u/mkhkoivi/publications/wabi2005.pdf