Detecting Allelic Effects December 10, 2004 PI: Fernando Pardo Manuel, PhD (UNC/Genetics) Programmer Supervisor: Patrick Sullivan, MD (UNC/Genetics) Background. The locations of millions of single nucleotide polymorphisms (SNPs) are known to high precision. However, we very rarely know whether a particular SNP is functional – meaning whether it is “silent” or if it yields different amounts of messenger RNA or its protein product or whether the protein has altered function. There is an urgent need for high-throughput methods to catalog SNP functional variation. Dr. Pardo Manuel has developed a method to investigate the functionality of SNPs within virtually any gene in the mouse genome. The pilot studies are promising and he wishes to scale the project upward. To do this requires more sophisticated and integrated data management that is now available. Without improved data management, error and inefficiency could hinder the project. Précis. - Assume there is a mouse gene for which one wishes to catalog genetic variation. As an example, the mouse gene Il9r (interleukin 9 receptor) on chromosome 11 is shown above. The transcript is about 12,000 bases. - The first step is to conduct DNA sequencing of a considerable portion of the gene including all exons and many introns. This is done in tiles of 500-700 bases strategically chosen across the gene from both right-to-left and from left-to-right. Existing software is used to design primers and to create the DNA sequence tiles. - Sequencing is done in a sizeable panel of diverse inbred mouse strains (about 25). So, DNA sequence has to be stored for Il9r from 25 different mice. - Next, the sequence across the strains has to be compared to note the presence of a variant position across one or more strains and to classify the type of variant (e.g., SNP, insertion/deletion, microsatellite, etc). - Several standard indices need to be computed – e.g., nucleotide diversity, etc. - A schema to detect the consequences of a variant is then designed and conducted in multiple different mouse tissues. Initially at the mRNA level and later at the protein level. - At every step there are critical quality control steps to be conducted. - All data entries and changes to existing data have to be recorded. - This process will be scaled up to consider many hundreds (perhaps thousands) of genes. Need. An experienced data base programmer is required to work under supervision to develop a relational data base to record, track, and manipulate the data from this project. A. Specific Aims Several large-scale studies indicate that allelic variation in gene expression is common and may account for much of the phenotypic variation within and among species. These observations and the repeated finding in human studies that susceptibility alleles at candidate genes often lack changes in the coding sequence, suggest that allelic variation in gene expression may play a central role in the etiology of complex genetic traits including common human diseases. Regulatory variation may be due to differences in trans-acting factors or cis-acting elements and may lead to differences in the level of gene expression. The challenges facing the identification of cis-regulatory variants include our limited capacity to identify regulatory elements and to evaluate the functional consequences of sequence variants based solely on sequence data. Functional annotation of the predicted 107 common variants present in the human genome (and similar numbers in model organisms) is an essential step to increase our understanding of basic biological processes, to identify the causative genetic variants responsible for common human diseases. Genes that harbor regulatory variants in cis-acting elements can be identified by differential allelic expression in heterozygous carriers of sequence variants within the transcript. This sensitive approach measures the ratio of the two alleles of a gene in the same cellular environment (therefore accounting for trans-acting factors and environmental variation) and may be used to identify genes harboring regulatory variants. Application of this and other methods should provide a long and interesting list of genes harboring cis-regulatory variants. However, these studies would fail to identify the causative variant in many, if not most, of the genes because the presence of multiple genetic variants in complete linkage disequilibrium (LD) makes it very difficult to distinguish between causative and nearby neutral variation. Wild-derived mouse inbred strains provide a unique opportunity to overcome this critical limitation and to increase the likelihood of detecting cis-regulatory variation in a systematic and genome-wide manner. The mouse is an exceptional model because it harbors the highest level of genetic diversity described in a mammalian species, multiple sequence variants are found in the transcript of up to 95% of genes and limited LD is found between nearby variants. These three characteristics stem from the phylogenetic history and short generation time of this rodent for which available inbred strains have captured a large fraction of the genetic variation. Our analyses, based on >50,000 genotypes obtained by sequencing ≈2000 new genetic variants in a panel of 25 inbred strains, indicate that the level and distribution of genetic variants and LD, observed among inbred strains has a wide range of variation (one or two orders of magnitude) depending the strains considered. Therefore, it is possible to identify a set of strains with very high levels of diversity (>25 variants per kb) that can be used to generate a panel of F1 mice in which the majority of genes can be screened for cis-regulatory variation using the differential allelic expression method. Then, the association between the allelic ratio and the patterns of alleles observed at different variants among the strains can be used to discriminate between causative and neutral variation. In our preliminary work we reduced the number of candidate variants in the Il9r gene from several hundreds to, on average, 2.5 variants located less than 500bp apart. We hypothesize that mapping resolution may be increased further with the selection of the optimal strains. To establish proof of principle for a scalable method by which to identify causative allelic variants influencing gene expression, we propose the following Specific Aims: 1. Identify an optimal panel of inbred strains for identification of cis-regulatory variants in a genome-wide manner. To accomplish this aim we propose to: 1.1. Estimate the mapping resolution in our panel of 25 strains for a previously described, but not yet identified, regulatory variant responsible for differential allelic expression of the Il9r gene. 1.2. Estimate the level of genetic diversity, the fraction of informative genes, and the level and extent of LD in our initial panel of inbred strains. 1.3. Define the optimal panel of strains for high-resolution mapping of cisregulatory variants. 2. Provide proof of principle for this approach by the identification and validation of the causal variants for multiple genes (e.g., Il9r, Comt, Ccnf, Uros, etc). To accomplish this aim we propose to: 2.1. Generate the mapping panel(s). Establish priority criteria for highresolution mapping. 2.2. Statistical methods. 2.3. Comprehensive analysis of cis-regulatory variation in the Il9r gene, including identification of the genetic variant(s) responsible for the differential allelic expression in the spleen. 2.4. High-resolution mapping of cis-regulatory variants in high-priority genes. Successful completion of these Specific Aims will simplify, accelerate and reduce the cost of sorting through hundreds or thousand of neutral polymorphisms in the identification of causative cis-regulatory variants. The method proposed here may be applied to any autosomal gene subject to regulatory variation and allows prioritization of genes based on the probability of success. It will complement ongoing efforts to estimate the contribution of trans and cis regulation to phenotypic variation. Because the mapping method is independent of any prior knowledge of functional elements it is likely to identify novel regulatory sequences. The data obtained may be integrated with predicted regulatory elements identified in comparative genomic analyses. Furthermore, the variants uncovered in this study may be used to generate highly informative and robust microarray-based assays to test allele specific gene expression.