Haplotype-Based Computational Genetic Analysis In Mice among strains within short haplotype blocks (less than four SNPs) may result from a random co-occurrence of the same genetic alteration. These blocks may not result from true linkage disequilibrium among the strains, and other sequence variation within these small blocks may not conform to the genotype of the flanking markers. For these small blocks, the trait values should be compared with individual allelic markers. In contrast, a haplotypic block constructed with four or more SNPs generated by this algorithm is highly likely to reflect the presence of true linkage disequilibrium within the block. Therefore, our computational mapping studies use haplotype blocks that contain four or more SNPs. 3. A METHOD FOR HAPLOTYPE-BASED COMPUTATIONAL GENETIC MAPPING In traditional murine QTL mapping, genetic analysis is usually performed by analysis of genotypic data obtained from 50 to 200 markers used to analyze the genome of the intercross progeny. The chromosomal regions containing the genetic loci are identified using interval mapping, which models the recombination between the marker and the loci. On the other hand, computational mapping using inbred strains takes advantage of a dense set of markers characterizing all sequence variation in functional regions across the entire genome. Even though most individual SNPs are binary, many regions of strong linkage disequilibrium have more than two distinct alleles among the inbred strains. Some multiallelic sequence variation within a locus can result in more than two distinct phenotypes (see major histocompatibility complex [MHC] example in ref. 4). The haplotypic blocks defined in Heading 2 represent natural grouping of SNPs into such multiallelic regions. When haplotypic blocks are used as markers instead of individual SNPs, the number of comparisons in association study is reduced by roughly 100-fold. This computational analysis is performed under the assumption that the causative genetic locus has been analyzed and haplotypes for this locus that distinguish among the inbred strains have been identified. Additionally, the contribution of a single locus to the quantitative trait must be relatively large in order for the genetic effect to be detectable. The minimum effect size depends on the available number of strains. Although the assumptions and requirements may seem stringent, many traits, even complex traits, can be investigated. Mapping quantitative traits onto nonbinary markers requires new analytical methodology. We will describe how phenotypic traits can be computationally analyzed and specific candidate genetic loci can be identified. We will also provide quantitative statistical measures used to assess the results of a computational mapping experiment. A Linear Model for Haplotype-Based Computational: Mapping of Genetic Traits Genetic researchers have traditionally applied a linear model to analyze a quantitative trait using the observed variance among a defined population. For a model in which an observed difference is caused by a single genetic locus, the total phenotypic variance is first partitioned into genetic and environmental variances. Following this, the genetic variance is further divided into a variance resulting from additive and dominance effects. The additive effect is half the measured trait value difference between the two strains with homozygous alleles. The dominance effect is quantified as the difference in measured trait values between strains with heterozygous alleles and the average of those with homozygous alleles. Experimental intercross progeny can be heterozygous at many genetic loci, but the parental inbred mouse strains are homozygous at all genetic loci. Therefore, the dominant effect does not contribute to genetic variance among the parental inbred strains. This greatly simplifies the analysis of genetically controlled trait differences among inbred strains, which provides a key advantage to our haplotypebased mapping method. Assuming that the genotypic differences within the gene controlling the selected trait of interest have been characterized, the linear model for the trait becomes (1) where yj is the trait value for the jth inbred strain, f(Gj) is the component of the phenotypic trait that is determined by the genotype controlling the trait, and j is the residual variance in the jth strain that is independent of the genetic effect at the given locus. Assume that genetic heterogeneity within the gene is fully captured by known haplotypes with the haplotypic block constructed using allelic markers within the gene. Following this, the genotype contribution Gj takes value in {H1, H2, ..., Hk}, corresponding to the k distinct haplotypes within the gene found among the inbred strains analyzed. k 2 or 3 for most of the haplotype blocks. The trait value determined by the. genotype component is now: For a trait whose value varies among the inbred strains, mapping the genetic locus becomes a process of finding the haplotype block whose genetic variance explains the largest amount of the total trait variance. In other words, the residual variance var() is minimized, where 58 Wang and Peltz Haplotype-Based Computational Genetic Analysis In Mice : If the number of haplotypes within each haplotype block is fixed, the problem is further reduced to simple linear regression. Note that in general, the number of haplotype k varies among blocks, and is the estimated trait value determined by the genotype. Here nl is the number of strains with haplotype Hl. Var() is the “within-group sum of squares” divided by n 1. The within-group sum of squares is used as the criterion function for the k mean clustering algorithm (11). It is the most commonly used measure of the clustering quality of the data set Y that partitions with fixed number of clusters. Let SST be the total sum of squares for the measured trait values. It is easy to see that SST for data set Y is: , where is the between-group sum of squares. Similarly, the total variance var(Y) consists of the genetic variance and the residual variance. The normalized sum of squares can be interpreted as the proportion of the total variance that is not explained by the genotype of the gene in question. The normalized withingroup sum of squares provides an objective measure to compare the genetic effect only for blocks with fixed number of haplotypes. It is not fair to compare the residual variance for different k, because different numbers of parameters (1, 2, ..., k) were fit. In order to appropriately compare the normalized within-group sum of squares for different k, it is necessary to use parametric statistics. We apply the analysis of variance (ANOVA) design to analyze genetic effect. In (1), assume that the residual term j is independent and normally distributed with mean zero and constant variance 2. For each haplotype block, the F statistics are calculatedas: The F statistics can then be used to test the null hypothesis For each block, a p value is calculated by comparing F with the theoretical F distribution of degree k1 and nk (Fig. 1). The correlation between strain groupings within haplotypic blocks and phenotypic trait values is assessed by this calculated p value. Note that the p value calculated from F statistic analysis is an approximate p value. It cannot be interpreted as an exact estimate of the probability of a false-positive result. When the distribution of the residual terms j deviates from normality or the sample size is small, the p value is not accurate. Furthermore, it is not corrected for multiple comparisons. When allelic information is missing, the correct haplotype may not be available for all strains within a haplotypic block. For these blocks, the trait values can be compared with genotype for only reduced set of strains. When key strains are missing, the p values obtained after computational mapping using a reduced set of strains may be much smaller than would be obtained if the allelic data for all the strains were used. This does not indicate that the block is better correlated with the trait data. In order to rank blocks more appropriately, an adjustment factor is applied to the p values obtained using blocks with missing haplotypes. For a block with k haplotypes and some strain haplotype missing, let pmin be the minimum p value among all possible of n strains into k haplotypes, and let p’min be the minimum p value among all possible partitions of the subset of the strains into k haplotypes. The multiplicative factor min{pmin/p’min, 1} is applied to the p value score. This crudely defined factor ensures that the p value for a block 60 Wang and Peltz