Haplotype-Based Computational Genetic Analysis In Mice

advertisement
Haplotype-Based Computational Genetic Analysis In Mice
among strains within short haplotype blocks (less than four SNPs) may result from a
random co-occurrence of the same genetic alteration. These blocks may not result from
true linkage disequilibrium among the strains, and other sequence variation within
these small blocks may not conform to the genotype of the flanking markers. For these
small blocks, the trait values should be compared with individual allelic markers. In
contrast, a haplotypic block constructed with four or more SNPs generated by this
algorithm is highly likely to reflect the presence of true linkage disequilibrium within
the block. Therefore, our computational mapping studies use haplotype blocks that
contain four or more SNPs.
3. A METHOD FOR HAPLOTYPE-BASED COMPUTATIONAL
GENETIC MAPPING
In traditional murine QTL mapping, genetic analysis is usually performed by analysis of
genotypic data obtained from 50 to 200 markers used to analyze the genome of the
intercross progeny. The chromosomal regions containing the genetic loci are identified
using interval mapping, which models the recombination between the marker and the
loci. On the other hand, computational mapping using inbred strains takes advantage of
a dense set of markers characterizing all sequence variation in functional regions across
the entire genome. Even though most individual SNPs are binary, many regions of
strong linkage disequilibrium have more than two distinct alleles among the inbred
strains. Some multiallelic sequence variation within a locus can result in more than
two distinct phenotypes (see major histocompatibility complex [MHC] example in ref.
4). The haplotypic blocks defined in Heading 2 represent natural grouping of SNPs into
such multiallelic regions. When haplotypic blocks are used as markers instead of
individual SNPs, the number of comparisons in association study is reduced by roughly
100-fold. This computational analysis is performed under the assumption that the
causative genetic locus has been analyzed and haplotypes for this locus that distinguish
among the inbred strains have been identified. Additionally, the contribution of a single
locus to the quantitative trait must be relatively large in order for the genetic effect to be
detectable. The minimum effect size depends on the available number of strains.
Although the assumptions and requirements may seem stringent, many traits, even
complex traits, can be investigated. Mapping quantitative traits onto nonbinary markers
requires new analytical methodology. We will describe how phenotypic traits can be
computationally analyzed and specific candidate genetic loci can be identified.
We will also provide quantitative statistical measures used to assess the results of a
computational mapping experiment.
A Linear Model for Haplotype-Based Computational:
Mapping of Genetic Traits
Genetic researchers have traditionally applied a linear model to analyze a quantitative
trait using the observed variance among a defined population. For a model in which an
observed difference is caused by a single genetic locus, the total phenotypic variance is
first partitioned into genetic and environmental variances. Following this, the genetic
variance is further divided into a variance resulting from additive and dominance
effects. The additive effect is half the measured trait value difference between the two
strains with homozygous alleles. The dominance effect is quantified as the difference in
measured trait values between strains with heterozygous alleles and the average of those
with homozygous alleles. Experimental intercross progeny can be heterozygous at many
genetic loci, but the parental inbred mouse strains are homozygous at all genetic loci.
Therefore, the dominant effect does not contribute to genetic variance among the
parental inbred strains. This greatly simplifies the analysis of genetically controlled trait
differences among inbred strains, which provides a key advantage to our haplotypebased mapping method. Assuming that the genotypic differences within the gene
controlling the selected trait of interest have been characterized, the linear model for the
trait becomes (1) where yj is the trait value for the jth inbred strain, f(Gj) is the
component of the phenotypic trait that is determined by the genotype controlling the
trait, and  j is the residual variance in the jth strain that is independent of the genetic
effect at the given locus. Assume that genetic heterogeneity within the gene is fully
captured by known haplotypes with the haplotypic block constructed using allelic
markers within the gene. Following this, the genotype contribution Gj takes value in {H1,
H2, ..., Hk}, corresponding to the k distinct haplotypes within the gene found among the
inbred strains analyzed. k 2 or 3 for most of the haplotype blocks. The trait value
determined by the.
genotype component is now:
For a trait whose value varies among the inbred strains, mapping the genetic locus
becomes a process of finding the haplotype block whose genetic variance explains the
largest amount of the total trait variance. In other words, the residual variance var() is
minimized, where 58 Wang and Peltz
Haplotype-Based Computational Genetic Analysis In Mice :
If the number of haplotypes within each haplotype block is fixed, the problem is further
reduced to simple linear regression. Note that in general, the number of haplotype k
varies among blocks, and is the estimated trait value determined by the genotype. Here
nl is the number of strains with haplotype Hl. Var() is the “within-group sum of
squares” divided by n 1. The within-group sum of squares is used as the criterion
function for the k mean clustering algorithm (11). It is the most commonly used measure
of the clustering quality of the data set Y that partitions with fixed number of clusters.
Let SST be the total sum of squares for the measured trait values. It is easy to see that
SST for data set Y is: , where is the between-group sum of squares. Similarly, the total
variance var(Y) consists of the genetic variance and the residual variance. The
normalized sum of squares can be interpreted as the proportion of the total variance
that is not explained by the genotype of the gene in question. The normalized withingroup sum of squares provides an objective measure to compare the genetic effect only
for blocks with fixed number of haplotypes. It is not fair to compare the residual
variance for different k, because different numbers of parameters (1, 2, ..., k) were fit.
In order to appropriately compare the normalized within-group sum of squares for
different k, it is necessary to use parametric statistics. We apply the analysis of
variance (ANOVA) design to analyze genetic effect. In (1), assume that the residual term
 j is independent and normally distributed with mean zero and constant variance 2.
For each haplotype block, the F statistics are calculatedas:
The F statistics can then be used to test the null hypothesis For each block, a p value is
calculated by comparing F with the theoretical F distribution of degree k1 and nk
(Fig. 1). The correlation between strain groupings within haplotypic blocks and
phenotypic trait values is assessed by this calculated p value. Note that the p value
calculated from F statistic analysis is an approximate p value. It cannot be interpreted
as an exact estimate of the probability of a false-positive result. When the distribution
of the residual terms j deviates from normality or the sample size is small, the p value is
not accurate. Furthermore, it is not corrected for multiple comparisons. When allelic
information is missing, the correct haplotype may not be available for all strains within
a haplotypic block. For these blocks, the trait values can be compared with genotype for
only reduced set of strains. When key strains are missing, the p values obtained after
computational mapping using a reduced set of strains may be much smaller than would
be obtained if the allelic data for all the strains were used. This does not indicate that the
block is better correlated with the trait data. In order to rank blocks more
appropriately, an adjustment factor is applied to the p values obtained using blocks with
missing haplotypes. For a block with k haplotypes and some strain haplotype missing,
let pmin be the minimum p value among all possible of n strains into k haplotypes, and let
p’min be the minimum p value among all possible partitions of the subset of the strains
into k haplotypes. The multiplicative factor min{pmin/p’min, 1} is applied to the p value
score. This crudely defined factor ensures that the p value for a block 60 Wang and Peltz
Download