Second Annual Report

advertisement
WP4: Variation (haplotypes and SNPS)
Introduction
Genotype data from our mouse Quantitative Trait Locus (QTL) mapping experiment
became available in March 2005, to augment the phenotype data we had collected
previously. We have spent much of the past 9 months analysing the data from the
population of 2000 Heterogeneous Stock (HS) mice. We have mapped over 100
phenotypes (behavioural, physiological, asthma-related and diabetes related, see
www.well.ox.ac.uk/mouse/HS for a listing) across build 34 of the mouse genome
using 13000 SNPs spaced 200kb apart. We identified over 700 QTL at a genomewide 5% significant threshold. QTL mapping was performed using the R HAPPY
package (www.well.ox.ac.uk/happy , Mott et al 2000) which uses a multipoint
hidden-Markov formulation to map the QTL. The mean 50% confidence interval for
each QTL is about 2 Mb, and the mean 95% confidence interval about 4Mb. These
intervals are small enough that it is possible to hunt for the causative genes.
Along with the phenotype data, we collected a large number of covariates – these are
auxiliary observations such as the ID of the experimenter, the time and date of the
experiment, the body weight of the animal etc. On of our most exciting findings has
been the very large number of interactions between the genotype and the
environment, which we have been able to quantify systematically for the first time
(Valdar et al 2006a). We have also been able to map QTL responsible for the gene by
environment interactions. Strikingly we find that these QTL generally do not overlap
with the QTL identified during the genome scans which ignores gene by environment
interaction.
Some of the results are already in press (Solberg et al, 2006) or submitted (Valdar et
al, or in preparation (Valdar et al 2006b).
GSCANDB
We have written gscandb, a relational database (implemented in MYSQL) and a web
interface (in PERL CGI) for viewing the results of genome scans. This corresponds
to the BIOSAPIENS milestone 4.5. The database stores information about the
population under study, the phenotypes measured on the subjects and the results of the
genome scans. Multiple analyses of the same scan (e.g. different genetic models such
as additive or dominance effects) can be stored, and against multiple genome builds.
Using the viewer it is possible to browse the data, viewing as many phenotypes as
desired, either on a genome-wide scale or within a chromosomal region. The user can
select which phenotypes and which types of scan to display. The user can also specify
a chromosomal region by its base pair coordinates, or jump to a particular gene or
SNP. The browser automatically links to Ensembl and UCSC to draw in gene
annotation information for display. We have found this viewer essential for our
exploration of this challenging and important data set.
At present the database contains mouse QTL data but is extendable to other species.
The browser interface is at http://zeon.well.ox.ac.uk/rmott-bin/wwwqtl.cgi
Examples
To login select the project “HS_mouse” with password “socksbeware”
Once logged in the interface will appear thus:
Using this interface, select the genome build required (default is “Mus Musculus 34”,
which is correct), the phenotypes of interest from the scrolling list titled “Plottable
Scans” (more than one scan can be selected), the type of scan from “Scan Types”, and
the type of view. The most important views are “genome” and “region”.
A region view will show a part of one chromosome which must be selected from the
“chromosome” scrolling list, and optionally a range (in bp) can be entered in the
“from” and “to” test boxes. Then click on “region” to view the scan focussed on that
region. For example, if the scan “Cue.Mean.Freeze.Corrected.During” and the
chromosome “15” are selected, and “region” clicked then the following image
appears:
The display has three panels by default:
(a) The top panel displays a graphic of the statistical association between the
phenotype and genome position along chromosome 15. The y-axis indicates the
degree of statistical significance, measured in logP units [logP is defined to be the
negative base 10 logarithm of the P-value of the test of no association; for example a
logP of 3 means a p-value of 0.001]. Two tracks are plotted (determined by the
selected items in the “Scan Types” scrolling list) by default: “additive” and “full”.
Additive scan measure the additive genetic component (ie were the contributions from
the two alleles at a locus act additively). Full scans measure the additive plus
dominance components. The dotted green line indicates the threshold for 1% genomewide significance, based on permutation. There is a highly significant peak at around
90Mb.
Below the image is the statistical model that was fitted in the analysis. In this instance
it is
Cue.Mean.Freeze.Corrected.During ~ GENDER + EndNormalBW + CoatColour + FPS.TrainingChamber + THE.LOCUS
which means that the phenotype (Cue.Mean.Freeze.Corrected.During, a behavioural trait
measuring the degree to which an animal freezes in response to a threatening sound)
was modelled in terms of the sex of the animals (GENDER), the body weight
(EndNormalBW), the coat colour, the training chamber ID (FPS.Training.Chamber) and by
THE.LOCUS, which is a matrix of probabilities representing the probability of descent
from the eight founder strain in the HS (Mott et al, 2000).
(b) The second panel shows the haplotype map of the eight inbred founder strains
used in the HS mapping population. It shows the “strain distribution pattern” of each
genotyped SNP in the eight strains C57BL/6J, A/J.AKR/J, BALB/cJ, C3H/HeJ,
CBA/J, DBA/2J, LP/J. We will not discuss this further.
(c) the bottom panel shows the positions of known genes along the chromosome.
Clicking anywhere on this panel will jump to the Ensembl gene view of that region.
The user can zoom in and out and navigate left or right by clicking on appropriate
parts of the figure. Clicking on the position of the peak in the top panel will zoom in,
displaying the interval between 87 and 95 Mb.:
Zooming in once again gives the detailed view
The names of the SNPs using in the genetic mapping now appear, and a list of genes
in the region is added to the foot of the page. This list contains links to more
information about each gene in the genome annotation databases Ensembl, UCSC and
NCBI gene.
In this example the gene Cntn1 (Contactin) appears, which is a very strong candidate
for the phenotype in question (Cue.Mean.Freeze.Corrected.During is a behavioural
phenotype which measures how much a mouse “freezes” in response to a threatening
noise.
Viewing All Trait Loci in a Region
Optionally the use can check the box labelled “Display known trait loci”. This causes
an additional panel to be displayed below the haplotype map, showing the 95%
confidence intervals of all the QTLs mapped to the region. By clicking on a QTL the
user is taken to an interface (described in Deliverable 4.1) which links to the Ensembl
Martview of the region:
Different Types of Scan
Selecting different scan types reveals insights about the data. Each scan type
corresponds to an analysis of the same data, but under a different genetics model. For
the HS mouse project we have defined five scan types at present but will add to them.
As well as the additive and full models, computed using a multipoint hidden Markov
model, the “dominant” model is the logP of the difference between the two models, ie
the dominance component. We als have scan types “additive.sma” and “full.sma” .
These are additive and full models computed using single-marker association analysis
and are the analogues of the multipoint “additive” and “full” models. In the figure
below we show the results for a different phenotype, Biochem.ALP (alkaline
phosphatase levels), showing both scant types. Generally the sma performs less well
at detecting genetic associations, for reasons described in Mott et al 2000. There is
however, one region in the figure, around 118 Mb where it performs better.
Viewing complete genome scans
The genome view, which is obtained by cliscking on the “genome” button,
summarizes genome scans for the selected phenotypes:
Each chromosome is represented by a separate condensed plot. Scrolling right will
show the remainder of the chromosomes. Clicking on a chromosome plot will jump to
the region view of that chromosome.
References
Mott R, Talbot CJ Turri M, Collins AC, Flint J (2000) A method for fine-mapping
quantitative trait loci in outbred animal stocks Proc Natl Acad Sci USA, 97(
23):12649-12654.
Solberg LC, Valdar W, Gauguier D, Nunez G, Taylor A, Hernandez P, Davidson S,
Burns P, Cookson W, Deacon R, Rawlins JNP, Mott R, Flint J (2006) A protocol for
high throughput phenotyping, suitable for quantitative trait analysis in mice.
Mammalian Genome (in press)
Yalcin B, Flint J, Mott R. (2005) Using progenitor strain information to identify
quantitative trait nucleotides in outbred mice. Genetics. 2005 171(2):673-81]
Valdar W, Solberg L, Gauguier D , Cookson W, Rawlins N, Mott R, Flint J. (2006)
Genetic and environmental effects on complex traits in mice submitted.
Future Plans.
This is an informal description of the future directions for this body of work: see the
future WP4 deliverables for a formal specification.
Several groups have expressed interest in obtaining gscandb to install and use locally
with their own data, both human and mouse. We therefore intend to make a free
downloadable version and write it up for publication. We will also investigate adding
additional displays:
(i) A General viewer for displaying additional annotation data encoded in GFF files.
(ii) A local linkage disequilibrium plot for human and mouse data to augment the
haplotype viewer.
Our ultimate aim is to identify which of the genes under each QTL is causative for the
trait variation (there may more more than one, or in some cases it is possible that no
gene under the QTL peak is causative, instead there is a DNA variant which acts in
trans on some distant gene). We are investigating three sources of data:
(i) Use of mouse inbred strain sequence data to identify potentially causative variants.
In (Yalcin et al 2006) we published a statistical method called merge analysis for this
purpose. We are now applying it to the QTL data.
(iii) Integration of functional information such as gene expression . We have already
performed gene expression microarray experiments on 300 HS livers and 200 HS
lungs. We are now using this information to filter the gene lists underlying the QTLs
associated with liver or lung activity.
(iv) Integration of textual data mining tools. We have a collaboration with Kate Elliot
to extend her GeneSniffer software for this purpose. We have found the one-line
descriptions attached to many genes to be uninformative. Often more information is
available when orthology with the human genome is considered and if the Pubmed
abstracts are searched in an intelligent manner using keywords related to the
phenotype of interest.
We departed slightly from our previous description of deliverable 4.x in that we
decided not to implement a variable thresholding tool. Instead we have engineered the
database to store genome-wide significance thresholds at the 0.5, 0.05 and 0.01%
thresholds significance, and the viewer to display the 0.01% thresholds as dotted
lines. These are computed by permutation (a time-consuming process which takes
several weeks) and are much more useful than the variable threshold concept.
Finally we will improve the integration of the data into the Ensembl browser via
DAS.
2. Sanger Institute (Cambridge UK)
The main aims for Sanger in this work package were to align human and mouse
resequencing reads to their respective genomes, make these alignments available, and
call SNPs from them.
We have aligned 17,983,363 HapMap resequencing reads to the human genome, and
SNPs from these have been submitted to dbSNP. During 2005 we have aligned
25,454,096 mouse reads to the mouse genome; these come from multiple strains,
including NOD, MSM and C3H, and all the resequencing reads released by Celera
during 2005. From the mouse reads we identified 17,312,642 SNPs, many not
previously identified, and a paper about these has been submitted to Nature Genetics
(Cunningham et al.).
During 2005 we restructured the way we plan to store and represent these alignments
in the long term. All the human and mouse alignments that we have obtained have
now been deposited in the trace archive at http://trace.ensembl.org . The human
alignments are currently visible via an Ensembl code-based interface at
http://www.glovar.org, as illustrated by the example below showing traces aligned to
a region of human chromosome 20. Adaptor code for access to the mapping in the
trace archive is currently under development, and will be released in the near future.
This will enable more flexible and more open access, and extend the display to
mouse,while maintaining the current visualisation.
Pablo Marin-Garcia was recruited to work on the BioSapiens variation work from
June 13 2005, working with Mark Griffiths who had previously participated in the
project.
Other relevant research
With relevance to WP1 (Gene annotation), postdoc David Carter in Richard Durbin’s
group completed his work on a novel comparative gene finding method, DOGFISH,
during 2005 and applied it to gene annotation across the whole human genome. He
participated in the EGASP’05 workshop at Hinxton in May 2005, which compared
gene annotation methods across a subset of the ENCODE regions of the human
genome. EGASP’05 was coordinated by Roderic Guigo (WP1 lead investigator) and
involved other BioSapiens WP1 groups. A paper on DOGFISH and its performance
in EGASP’05 has been submitted to Genome Biology.
Another relevant project in Richard Durbin’s group is TreeFam (www.treefam.org)
which presents gene trees for all animal gene families, allowing inference of orthologs
and paralogs between the human and other animal genomes.
3. University of Helsinki, Finland (Esko Ukkonen)
We have developed a new algorithm for reconstructing haplotypes from genotypes of
a population sample. The method is based on a hidden Markov model of haplotypes,
and it performs well against other state-of-the-art tools both on accuracy and on
speed. The software will be made publically available. The work has been published
in:
Pasi Rastas, Mikko Koivisto, Heikki Mannila and Esko Ukkonen:
A Hidden Markov Technique for Haplotype Reconstruction.
In Proc. 5th Workshop on Algorithms in Bioinformatics - WABI 2005,
LNCS 3692, Springer, 2005, pp. 140-151.
The paper is available at
http://www.cs.helsinki.fi/u/mkhkoivi/publications/wabi2005.pdf
Download