Specific Aims Studies by our group and others have established that the etiology of Systemic Lupus erythematosus (SLE) is not dominated by a few major genes; rather, numerous genes contribute to the pathogenesis of the disease, each gene contributing modest effects. Using genome wide association studies (GWAS) in combination with a Bayesian methodology for the development of a gene selection tool to increase the power of genetic association studies, we have identified, replicated, and confirmed that 9 new genes are associated with SLE and belong to two major biological pathways, apoptosis and the NFkB inflammatory pathway. We argue that further fine mapping using a denser SNP coverage of the candidate genes is unlikely to add much information to the already established association because of the strong linkage disequilibrium (LD) expected between the new and the previously used SNPs. Instead, we hypothesize that resequencing these candidate genes has a much higher chance of identifying the causal variant(s). Furthermore, we hypothesize that using sufficient numbers of affected subjects (preferentially selecting subjects with disease-associated SNPs within these genes) will identify novel variants, including the causal variant(s). Therefore, the goal of this application is to identify causal variants responsible for the associations by re-sequencing these candidate genes and re-evaluating the newly discovered variants in our extensive and well characterized SLE population. The specific aims are as follows: Specific Aim 1: Identification of causal genetic variants within the 9 candidate genes (already known from GWAS findings to be strongly associated with SLE) by the application of targeted DNA capture and massively parallel sequencing technologies. Specific aim 2: Selective genotyping of DNA samples from large well-phenotyped populations to validate the newly discovered genetic variants. For the purpose of these studies we have already assembled an unparalleled population of 6,500 adult-onset and 1,000 childhood-onset cases of SLE and a large multidisciplinary collaborative team of investigators. This sequential strategy will ensure the characterization of a complete set of causal variants contributing to disease heritability and etiology. This approach will move the field much faster towards realistically being able to propose and implement new targets for early diagnostics and therapy. 2. Background Studies by our group and many of our collaborators have established that the etiology of autoimmune diseases in general—and the prototypic multi-organ systemic disease SLE in particular—is not dominated by a few major genes. Instead, numerous genes contribute to the pathogenesis of the disease, each gene contributing modest effects. Using genome wide association studies (GWAS) in combination with a novel Bayesian methodology for the development of a gene selection tool to increase the power of genetic association studies, we have been successful in identifying, replicating, and confirming that a large number of SNPs within 9 previously unassociated genes are associated with SLE. These genes belong to two major biological pathways, apoptosis and the NFkB inflammatory pathway. The majorities of these SNPs are located in intronic sequences, and thus are highly unlikely to be the causal variants responsible for the disease phenotype. Even in cases in which the variants are not intronic, it remains unclear whether these polymorphisms represent the causal alleles or rather are simply in linkage disequilibrium (LD) with the real causal variant(s). Until recently, the only method to further advance the search for the causal variants was to conduct fine mapping studies to narrow the region of significant association by using additional SNPs from databases and conducting a focused association study. However, in many cases fine mapping within identified haplotype blocks is unlikely to add much additional information because of the strong LD between the additional SNPs. Therefore, in most cases this approach will not lead directly to the identification of the causal variant, as the causal variant may not yet be a known SNP variant (unavailable in the databases since these data bases were obtained from sequencing subjects that are unlikely to be afflicted with SLE). Without a systematic search for additional polymorphisms within a susceptibility gene it is premature to suggest that a given variant is causal even when functional consequences can be attributed to the variant. However, with the ongoing development of next generation sequencing technologies it is now possible to evaluate a large number of the polymorphisms within a gene in a timely and cost-effective manner as we suggest here. Moreover, the technologies we will utilize make it very cost-efficient to investigate multiple genes and subjects simultaneously. We argue that identifying the actual disease-causing genetic variants in SLE requires large scale DNA sequencing of significant number individuals from well phenotyped populations. This need to be followed by selective genotyping of DNA samples from large well-phenotyped populations to validate the newly discovered genetic variants and determine the causal variants. For the purpose of these studies we have already assembled an unparalleled population of 6,500 adult-onset and 1,000 childhoodonset cases of SLE and a large multidisciplinary collaborative team of investigators. 3. Preliminary Studies: The PI, Dr. Jacob is a founding member of SLEGEN, the International Consortium for Genetics of Systemic Lupus Erythematosus. A SLEGEN GWAS published in 2008, provided results from the Illumina 317K platform in European American women with lupus (1). In addition, Dr. Jacob’s research team developed a gene selection tool to increase the power of genetic association studies (2), and devised a microarray platform to discover susceptibility genes contributing to autoimmunity. The candidate pathway genotyping platform developed was applied to a sample of 753 subjects corresponding to 251 childhood onset SLE trios (patients and both parents). Childhood-onset SLE presents a unique subgroup of patients for a genetic study because an earlier disease onset, a more severe disease course, and a greater frequency of family history of SLE imply an increased likelihood of expressing the genetic etiology. Most previous genetic studies used adult-onset disease. The results of our TDT study of ~1,200 genes and ~10,000 Adult Childhood SNPs were published in Gene FDR SNP FDR SNP December 2007 (3). We have SELP 5.24E-02 rs6125 8.75E-06 rs6125 since corroborated, replicated TNFSF4 5.04E-11 rs1234315 1.64E-05 rs1234315 and extended the results in a STAT4 2.46E-24 rs7582684 Adult1.01E-08 rs10181656 Ethnicity Childhood Control case-control approach using a FAIM 2.70E-24 rs13095734 1.91E-27 rs13095734 second cohort of ~800 AA 9893.73E-25 147 rs17849502 1053 NCF2 1.02E-22 rs17849502 childhood-onset and a very EA 29122.32E-03 216 rs1121401 3114 KLRG1 1.03E-02 rs1121401 large cohort of ~ 5,500 adultTLR8 2.02E-04 rs17256081 HA 6037.86E-04 247 rs17256081 266 onset SLE cases with a much IRAK1 4.76E-09 rs763737 1.27E-03 rs763737 higher SNP density. In AsA 793 159 869 IL16 Table 2. 4.38E-02 rs7170924 Total 1.04E-03 5297 769 rs7170924 5302 Most significant SNPs in Adult- and Childhood- onset SLE. The FDR values shown correspond to an Table estimation the False Discovery Rate calculated 1: ofDemographic distribution of the using the Storey q value procedure from the p values of four ethnicities replication studies. Abbreviations: AA, combined African using the Fisher method, thus American; correcting EA, for multiple testing effects. European American; HA,Populations Hispanic were corrected for stratification using PCAAsA priorAsian to analysis. American; American. summary, our follow up and replication studies included 11,368 participants (6,066 independent SLE cases and 5,302 healthy subjects matched for sex, age and ethnicity).Table 1 depicts the demographics of the study population in the replication studies. A portion of the results obtained from these follow up studies have been now published in three manuscripts during 2009 (4-6). Based on these studies we propose 9 new genes that have been convincingly detected and replicated, with the overall false discovery rate (FDR) q values < 0.05 from all independent cases and controls after correction for multiple testing and principal component analyses (PCA) for differences in population stratification between cases and controls. Table 2 shows examples of most significantly associated SNPs within these 9 genes in childhood-and adult-onset SLE cases. Table 4 below depicts the disease-relevant functions of each of these genes which make them very attractive candidates, since they belong to biological pathways likely to be involved in the pathogenesis of SLE. We would like to emphasize that even in cases in which significant SNPs are not within introns it is unlikely that we have identified the causal variant. For example, Fig 1 presents the distribution of significant SNPs in the STAT4 gene. It is evident that significant SNPs are widely spread throughout the gene, and we do not see a single major SNP which is massively significant and present in all of the ethnicities which show association at a particular gene. Therefore, we believe that only parallel sequencing in regions surrounding these nine genes will ensure the identification of causal variants. The decision to go directly to sequencing in these areas rather than first further narrow the area of significant association (via fine mapping) is based on the practical argument that the sequencing technology is advanced enough to be able to sequence fragments of DNA that encompass the relevant neighboring genes and is cost effective. Furthermore, even if the association could be further narrowed by fine mapping, we would still need to resequence given that additional SNPs (from databases) are unlikely to add much to the significance, unless the causal SNP is identified, because of the strong LD among the additional SNPs. Furthermore, our approach is to use 100 subjects from each of two ethnicities, Hispanic American and European American (for a total of 200 subjects). According to the power calculation shown below this number of subjects is sufficient to provide 87% power to identify alleles that are present in 1% of the patient population. We are using two ethnicities because some disease-associated variants are more commonly found in one ethnicity and not the other. This problem is exemplified in Table 4. In order to locate the causal variant in linkage disequilibrium with the disease-associated variant, we need to sequence the regions containing the disease-associated variant, which is significantly associated only in specific ethnicities. For example, in the case of rs17256081 (TLR8), only Hispanic Americans show significant disease-association with this allele, whereas in the case of rs17849502 (NCF2), only European Americans show significant disease association with this allele (Table 3). For reasons of cost-efficiency coupled with the power necessary to have a high likelihood to detect causal variants, we do not propose at this time sequencing the two other ethnicities we have in our cohorts and used in our study. Doing so would require us to sequence more than 400 samples to attain equivalent power in those populations, at a prohibiting increase in cost. 4. Experimental Design and Methods: Overall Rationale: Recent genome wide assocation and candidate pathway studies followed by replication studies, clearly establish that the following 9 genes, SELP, TNFSF4, STAT4, FAIM, NCF2, KLRG1, TLR8, IRAK1, and IL16 are significantly associated with SLE in multiple populations. Specific Aim 1: identification of causal genetic variants within 9 genes strongly associated with SLE by the application of targeted DNA capture and massively parallel sequencing technologies. The genes targeted are: SELP, TNFSF4, STAT4, FAIM, NCF2, KLRG1, TLR8, IRAK1, and IL16 Sample size considerations for sequencing: According to Glatt et. al. (7) we use the following equation for sample size estimation: Power =1-(1-f)^(2n) where f is the minor allele frequency (MAF), 2n is the number of chromosomes; Power is the probability of finding at least one copy of an allele with MAF=f. Based on this equation, if we sequence 50 subjects (namely 100 chromosomes) there should be 99% power to identify an allele with 5% allele frequency, and 63% for an allele with 1% frequency. Sequencing 100 subjects would give us 87% power to identify alleles that are present in 1% of Childhood the patient population and 99.9% power Gene SNP Ethnicity Adult FDR FDR for alleles present at 5%. As is evident EA 9.86E-2 2.78E-1 from our preliminary studies (5, 6), we TLR8 rs17256081 HA 5.07E-4 1.39E-3 appear to be dealing with allele EA 5.30E-24 3.44E-25 frequencies that are more frequent than NCF2 rs17849502 HA 1.17E-1 6.40E-2 these used for the power calculation. EA 2.98E-5 9.81E-4 Hence, sample size should not be a NCF2 rs10911363 limiting factor. HA 1.04E-1 2.45E-2 EA 1.44E-2 2.92E-2 Regions to be sequenced around our 9 NCF2 rs34037871 HA 4.58E-1 2.51E-1 genes are depicted in Table 4. For example, for IRAK1, the region from Table 3. Different SNPs are significant in different ethnicities. approximately base 152,881,000 to FDR is the estimate of the False Discovery Rate as calculated by 153,027,000 of chromosome X (NCBI the Benjamini and Hochberg procedure (15). EA are Europena Americans, HA are Hispanic Americans. reference assembly build 36 version 3) will be sequenced, which encompasses the neighboring genes of IRAK1, including MECP2 (which was suggested independently of us as a candidate SLE gene and probably describes the same genetic association). We overestimate the region that is likely to contain a causative allele which is in LD with SNPs we have found to be associated because the cost difference between sequencing a more precisely defined region and an overestimated region, using high-throughput sequencing, is not worth the risk of missing a causative allele which may be located outside of a region of currently known SNPs in LD with the associated SNPs. Furthermore, the available information regarding LD in these regions is not necessarily accurate and is primarily available in Caucasians, and the extrapolation of LD from Caucasians to the Hispanic ethnicity which we also plan to sequence is unknown. In order to optimize the probability of finding the causal mutations, we will preferentially select subjects with disease-associated SNPs within these genes and specifically those who developed SLE in childhood, because childhood-onset SLE has stronger penetrance of the disease alleles. For the controls, we will utilize known sequences of these regions, including data from HapMap, Celera, JCVI, and NCBI. As starting material each subject has currently available at least 25 microgram genomic DNA prepared from peripheral blood for the purpose of these studies. The major bottleneck in the next generation sequencing approaches is enriching the target DNA prior to sequencing (8). Until very recently the most common approach for target DNA isolation included short and long PCR. As the size and number of regions of interest increases, other approaches have been developed that rely on parallel capture and enrichment (8, 9). These methods have significant advantages over PCR, including use of less input DNA per region, parallel capture of a large number of regions, less need for optimization, and quicker isolation of captured DNA. Accordingly, we will use such method available from NimbleGen to capture targeted genomic regions for sequencing. Gene Region for Sequencing Significant Chromosome SNPs Start (kb) Stop (kb) Region Function SELP 1q24 167824 167877 6 Adhesion receptor for neutrophils, monocytes and T-cells, respponsibl for the migration of these cells for the initiation and perpetuations of NFkB inflamatory response. TNFSF4 1q25 171419 171474 3 Interaction between this gene and its receptor is inviolved in co-stimulation of T and B lymphocytes and in the adhesion of T cells to endothelial cells for the induction of the NFkB inflamatory pathway STAT4 2q32.2-3 191602 191725 30 Signal Transducer and Activator of Transcription essential for Th1 and interferon activation involve din essential cellular events of differentiation, proliferation and apoptosis following cytokines and growth factor signaling FAIM 3q22 139790 139835 2 FAS apoptosis inhibitor molecule functions as an anti-apoptotic molecule and also has NFkB activating functions NCF2 1q25 181698 181837 13 Essential component of the NADPH oxidase enxyme complex in phagocytic leukocytes. It is importat for host innate immunity KLRG1 12p12-13 8993 9055 7 Lectin-like receptor involved in differentiation, proliferation, and apoptosis of a T cell subset, including NK cells. TLR7/ TLR8 Xp22 12785 12852 2 A member of the Toll-like receptor (TLR) familiy which is intimately involved in the activation of NFkB IRAK1 Xq28 152881 153027 6 Serine/threonine protein kinase involved in the signalling cascade of the TOLL/IL-1 receptor family. Considered the ON/OFF switch of the receptor complex to the activator adaptor protein TRAF6 whcih is responsible for the activation of several inflamatory IL16 15q26.3 79266 79403 3 A pleiotropic cytokine that functions a s a chemoattractant for immune cells and a modulator of T cell activation Table 4. Regions of nine genes proposed for sequencing in 100 individuals with SLE in each of two ethnicities (Hispanic American and European American) utilizing targeted DNA capture and massively parallel sequencing technologies. Significant SNPs are those which have an estimated False Discovery Rate less than 0.05 (after multi-test correction) in either adult- or childhood-onset populations. Regions are indicated in kilobases using the NCBI reference assembly build 36 version 3. Functions of the genes are summarized for convenience. We will take advantage of the newly developed indexing system that involves the addition of a specific “indexing” sequence to the end of the primer sequence. The sequence of this “indexing” segment will be determined at the initiation of each sequencing reaction, thus allowing the subsequently determined sequence to be assigned to a specific sample, even when fragments derived from several different “indexed” samples are being analyzed in the same sequencing run. For our specific application, assuming that we intend to capture about 0.81 Megabases of genomic DNA containing our candidate genes (Table 4), it is reasonable to predict that 10 samples could be individually fragmented, ligated with indexed sequencing primers, pooled and hybridized to a single custom NimbleGen genomic sequence capture chip, and the eluted fragments could be sequenced as a single sample. Assuming a conservative estimate of ~50-fold sequence coverage of our selected genes, we should expect a fairly robust 40.9 Megabase sequence data for each sample, or 8.18 Gigabases for the identification of allelic sequence variations in all the samples. Obviously, the efficiency of this system and the amount of raw sequence information required for a thorough analysis of both alleles in each sample from a pool of ten will be impacted by several technical factors, including: 1) the proportion of sequence information obtained for each sample in the pool; 2) the relative efficiency of genome sequence capture for various segments in our selected genes during the hybridization to the NimbleGen chip; and 3) the complexity of the genetic polymorphisms revealed. Nevertheless, given the relative maturity of the technology, we expect no major problems in achieving our goal. Using the novel sequencing methodology and a large number of available subjects, we expect to find novel SNPs previously unidentified in publicly available SNP databases. To demonstrate the feasibility of this approach, we have used NimbleGen arrays to capture targeted genomic regions for sequencing to perform a single sample experiment (on a Hispanic American childhood-onset SLE subject) using a NimbleGen custom array followed by sequencing. Roughly 1.2 million sequencing reads were obtained of which 94% mapped uniquely to the human genome reference. 99% of the bases mapped to the genome with 52% of the reads falling within the targeted region corresponding to a significant enrichment over the result for shotgun sequencing of the full genome. 75% of the targeted bases exhibited >100-fold coverage. This sequencing run discovered approximately 18 as-of-yet undescribed SNP variants in the region sequenced. Beyond demonstrating feasibility, Figure 1. –log10 of STAT4 p values versus the position on this preliminary study chromosome 2 in adult and childhood-onset lupus separated by allows our support ethnicity. Exons of STAT4 are shown below in green with exon personnel to asses the numbers above. technology. Specific aim 2: selective genotyping of DNA samples from large well-phenotyped populations to validate the newly discovered genetic variants. Rationale: Based on our as well as many other investigators resequencing experience, it is very unlikely that no new SNPs will be identified. A more likely scenario is that many new SNPs will be identified and the question will be which of the new SNPs are relevant to SLE. For this purpose the re-evaluation of the new SNPs is an accepted and logical phase, unless an absolutely obvious causal SNP emerges in the previous step. Therefore, in this aim we will re-evaluate the new SNPs identified through parallel sequencing in case-control association studies, encompassing all 4 different ethnicities utilizing the entire population of adult- and childhood- onset SLE samples as well as an extensive collection of unrelated controls. It is important to emphasize that in deciding which SNPs to use for genotyping we will evaluate the regions sequenced to locate all new SNPs (that are not sequencing artifacts) prioritizing those new SNPs that have a variant found more commonly among the SLE subjects. In addition, we will genotype SNPs that were previously known (in the databases) where a single variant is found preferentially in the case populations sequenced and that have not previously been genotyped in our association studies. Study population: Through elaborate collaborative agreements (successfully employed in the preliminary study and in additional collaborative undertakings), we will have available for the studies proposed here at least 6,500 adult onset SLE (at least 1,000 SLE subjects from each of the 4 ethnic/racial populations) and over 1,100 childhood onset cases [estimated as 300 subjects from each of European American (EA), Hispanic American (HA) Asian Americans (AsA) and 200 African Americans (AA) subjects]. In addition we will have approximately 6,450 controls (at least 1,000 controls for EA, AA, AsA and approximately 700 HA controls). Controls will be matched for gender, ethnicity and age in adult-onset cases. For childhood SLE cases, the controls to be used will be matched for gender, ethnicity and geographic location, but will not be matched for age. In fact, all controls will be adults. We believe that adults that have no SLE or any other rheumatic disease are better controls than age-matched children that may develop disease later in life. Specifically, we have available the cohorts used for the replication studies (Table 1 above, and letter of collaboration from Dr. John Harley in the name of various OMRF collaborators). We have now available from UCLA an additional ~1000 adult onset SLE subjects and 816 controls which have not been used in any of the preliminary studies (letters of collaboration from Dr. Betty Tsao attached). The childhood-onset SLE resource developed by the PI via a recruitment network for pediatric SLE patients include the Children’s Hospital of Los Angeles (CHLA), the Children Hospital of Orange County, CA (CHOC), the Children’s Memorial Medical Center, in Chicago (CMMC), Texas Children’s Hospital of Houston (TCH) and the Hospital of Sick Children (HSC) in Toronto. Letters from pediatric rheumatologists participating are attached. All protocols were approved by the Institutional Review Boards at each respective institution. All patients met the revised 1997 ACR criteria for the classification of SLE. All procedures, methodologies and collection of data were consistent in a standard manner in all participating sites. Self-reported ethnicity was verified by parental and grandparental ethnicity, when possible. Statistical analyses will be performed by Dr. Zidovetzki and Dr. Armstrong, both of who collaborated with the PI on this and related projects for a number of years. Dr. Langefeld will serve as consultant (letter of collaboration attached). Both publicly available (e.g., PLNK, Haploview 4.0), software, and software specifically developed by Drs. Zidovetzki and Armstrong will be utilized. Analyses will be done as described in detail in our recent studies (46). Potential pitfalls Confounding by Population Stratification Genetic association studies are potentially susceptible to confounding by population stratification. In our studies the problem of population heterogeneity may be most problematic regarding the Hispanic population which may have Mestizo, European and African ancestry. However, our collection that comes from California and Texas are Mexican Americans (MA); therefore eliminating the additional complexity present when the samples include Cuban and Puerto Ricans or different South American Hispanic populations. Similarly the AA population is also an admixture population. Although population stratification has been the subject of vigorous debate in the epidemiologic literature, it is now generally accepted that potential confounding may be adequately controlled using a set of unlinked genetic markers to estimate the underlying population structure. The effectiveness of this approach depends on the ability of the set of SNPs to capture the underlying genetic substructure via the coefficient of ancestry (10). To ensure successful implementation of this method of control, we have identified 233 SNPs specifically selected to differentiate parental populations (African, European, American Indian, Mexican, and Asian) and for making precise ancestry estimates (11). These SNPs have previously been genotyped in a multiethnic sample. In our studies, we will use the set of 233 SNPs to estimate a coefficient of ancestry, (12) and identify key principal components (13) for each individual. These variables will then be used as a covariate model to ensure unbiased effect estimates and valid tests of association. Other potential confounding: As for potential concerns regarding heterogeneity of the study population (in diagnosis and classification) and the multi-site nature of the resources, we would like to emphasize that standardized procedures were followed by the investigators in all clinical-sites to ensure that cases can be pooled between centers without the introduction of diagnostic heterogeneity. These included diagnostic, disease progression, and damage index forms filled out for every recruited SLE patient. The same ACR criteria for SLE have been used and rigorously applied in all cases. Diagnosis was confirmed in each case by at least one rheumatologist. Subjects used in our studies are recruited from several medical centers throughout the US. We acknowledge that different geographic sites of recruitment can introduce various environmental factors that may interact with genetic risk factors. We would like to argue, however, that in our specific case these factors may not play an important role. The environmental factors recognized in SLE, namely exposure to sunlight and EBV exposure are ubiquitous. The successes in the identification and replication of several SLE susceptibility genes using cohorts from several continents (for example IRF5) support our argument for multi-center recruitment strategy. Sample size: We acknowledge that our childhood-onset SLE cohorts, when subdivided into the various ethnicities, have relatively moderate size. However, based on our experience in the preliminary studies, we would argue that subjects that develop disease in childhood should be enriched for genetic effects: first, because early disease onset may be an indication for increased genetic predisposition and penetrance, and second, because sex hormonal influences are less likely to play a significant role in the onset of disease in this age group. Finally, childhood-onset SLE is often a more severe disease than adult-onset and has been shown to have a more aggressive clinical course. Multiple testing: The standards of statistical proof that are commonly used in biomedical literature have been questioned when applied to large SNP-based genetic association studies. The problem of multiple testing pervades the discipline without a clear consensus about how it should be solved (14). The classical Bonferroni correction is both too strict and inappropriate in the case of genetic studies because it assumes that each test is independent, whereas in actuality a complex and unknown mutual dependence is present among genes, and even more prominently among SNPs of the same gene. The false discovery rate (FDR) (15) approach is currently widely used in genetic microarray and association studies. We adapted a variation of FDR (16) for the multitest correction in our case, with q<0.05 (corresponding to less than 5% false positives), considered to be significant. At the juncture that all (or at least most) polymorphisms have been identified within the genes of interest and re-evaluated in relevant patient populations, systematic functional studies should be most appropriate toward discovering the mechanisms through which the genetic variants are involved in the causation or perpetuation of SLE. Thus, the completion of this work will significantly advance our understanding of the genetic foundation of SLE (and autoimmunity in general), and will provide a foundation for development of diagnostics and therapy for this devastating disease. Concluding remarks: We would like to emphasize some of the unique strengths of this proposal to significantly impact the field: 1) A solid and extensive preliminary work on the genetics of SLE and integration of the previously separate approaches in childhood-onset and adult-onset SLE into one combined effort which is a natural extension of our ongoing work on human SLE. 2) A unique feature of our application is an unparalleled access to a very large SLE study population of over 6,500 adult-onset and 1,000 childhood-onset SLE cases and similar numbers of matched controls. Furthermore, our study sample includes a large population of ethnic minorities which enables us to identify and characterize important variants predisposing to SLE that are common to multiple ethnic groups. 3) A great strength of the application is the integration of the talent of a large multidisciplinary team of clinical and basic research scientists from multiple institutions, each of whom brings a great depth of experience to the application; and 4) Access to state-of the art-technology in gene mapping, next generation sequencing and bioinformatics that are constantly being practiced and improved upon by our team of experts. 5) Timelines and Milestones 1. Sequence Samples (Months 1-12) a. Identify most appropriate subjects to sequence. (Subjects that are enriched for the SNP variant associated with SLE with enough DNA for sequencing.) b. Design Nimblegen capture array c. Design Nimblegen indexing for the sequencing of multiple subjects simultaneously d. Perform Nimblegen capture array e. Analyze results from capture array f. Perform next generation high-throughput sequencing 2. Analyze Sequence Results (Months 13-14) a. Identify novel SNP variants b. Determine which previously un-genotyped SNP variants (including novel and non-novel variants) are enriched in the patient populations sequenced. c. Contribute new SNP variants to NCBI 3. Genotype Large well-phenotyped population (Months 15-20) a. Design array for genotyping including new SNPs to be evaluated as well as a set of markers informative for ancestry (AIMs) b. Perform genotyping 4. Analyze genotyping results (Months 21-24) a. Correct for population stratification b. Correct for multiple testing effects The project title is Targeted DNA capture and parallel sequencing to identify causal mutations in SLE. 6. Significance The sequential strategy of large scale targeted DNA capture and parallel sequencing followed by selective genotyping of all relevant polymorphisms will have the highest probability to identify the complete set of causal variants contributing to SLE heritability and such approach will move the field much faster towards realistically being able to evaluate new targets for early diagnosis and therapy. These studies will significantly advance our understanding of this disease and establish new key steps in the pathogenesis of SLE. The results will provide the immediate justification for the development of therapeutic approaches targeting these molecules or other molecules within their biochemical pathways. For diagnostic purposes, we will have the causal genetic markers which will identify those SLE patients most likely to benefit from such therapeutic approaches. Furthermore, the proposed study can serve as a paradigm for studying other common diseases with complex genetic associations in general. 7. Relevance to mission of ALR NEED CHANGE Autoimmune diseases are common chronic conditions which affect approximately 5% of the US population. Systemic Lupus erythematosus (SLE) is the prototypical human multi-system autoimmune disease. It is a disorder of generalized autoimmunity characterized by an immune system attack against multiple organ systems. While SLE affects approximately 0.15% of the US population, there is increasing evidence that different inflammatory/autoimmune diseases share multiple predisposing genetic effects. Thus, the genes affecting SLE development are highly relevant to a very large population affected by a variety of inflammatory and autoimmune diseases. Although SLE is a global disease, nevertheless it affects minorities such as African Americans and Hispanic Americans more severely than other ethnic populations. Young women are most commonly affected, though SLE is also found in men, and as a chronic disease is responsible for significant morbidity and suffering of hundreds of thousands of patients resulting in massive expenditures on health care and severe impact on the quality of life of individuals with the disease.