alr_sequencing_narrative

advertisement
Specific Aims
Studies by our group and others have established that the etiology of Systemic Lupus
erythematosus (SLE) is not dominated by a few major genes; rather, numerous genes
contribute to the pathogenesis of the disease, each gene contributing modest effects. Using
genome wide association studies (GWAS) in combination with a Bayesian methodology for the
development of a gene selection tool to increase the power of genetic association studies, we
have identified, replicated, and confirmed that 9 new genes are associated with SLE and
belong to two major biological pathways, apoptosis and the NFkB inflammatory pathway. We
argue that further fine mapping using a denser SNP coverage of the candidate genes is
unlikely to add much information to the already established association because of the strong
linkage disequilibrium (LD) expected between the new and the previously used SNPs. Instead,
we hypothesize that resequencing these candidate genes has a much higher chance of
identifying the causal variant(s). Furthermore, we hypothesize that using sufficient numbers of
affected subjects (preferentially selecting subjects with disease-associated SNPs within these
genes) will identify novel variants, including the causal variant(s). Therefore, the goal of this
application is to identify causal variants responsible for the associations by re-sequencing
these candidate genes and re-evaluating the newly discovered variants in our extensive and
well characterized SLE population.
The specific aims are as follows:
Specific Aim 1: Identification of causal genetic variants within the 9 candidate genes (already
known from GWAS findings to be strongly associated with SLE) by the application of targeted
DNA capture and massively parallel sequencing technologies.
Specific aim 2: Selective genotyping of DNA samples from large well-phenotyped populations
to validate the newly discovered genetic variants. For the purpose of these studies we have
already assembled an unparalleled population of 6,500 adult-onset and 1,000 childhood-onset
cases of SLE and a large multidisciplinary collaborative team of investigators.
This sequential strategy will ensure the characterization of a complete set of causal variants
contributing to disease heritability and etiology. This approach will move the field much faster
towards realistically being able to propose and implement new targets for early diagnostics and
therapy.
2. Background
Studies by our group and many of our collaborators have established that the etiology of
autoimmune diseases in general—and the prototypic multi-organ systemic disease SLE in
particular—is not dominated by a few major genes. Instead, numerous genes contribute to the
pathogenesis of the disease, each gene contributing modest effects. Using genome wide
association studies (GWAS) in combination with a novel Bayesian methodology for the
development of a gene selection tool to increase the power of genetic association studies, we
have been successful in identifying, replicating, and confirming that a large number of SNPs
within 9 previously unassociated genes are associated with SLE. These genes belong to two
major biological pathways, apoptosis and the NFkB inflammatory pathway.
The majorities of these SNPs are located in intronic sequences, and thus are highly unlikely to
be the causal variants responsible for the disease phenotype. Even in cases in which the
variants are not intronic, it remains unclear whether these polymorphisms represent the causal
alleles or rather are simply in linkage disequilibrium (LD) with the real causal variant(s). Until
recently, the only method to further advance the search for the causal variants was to conduct
fine mapping studies to narrow the region of significant association by using additional SNPs
from databases and conducting a focused association study. However, in many cases fine
mapping within identified haplotype blocks is unlikely to add much additional information
because of the strong LD between the additional SNPs. Therefore, in most cases this
approach will not lead directly to the identification of the causal variant, as the causal variant
may not yet be a known SNP variant (unavailable in the databases since these data bases
were obtained from sequencing subjects that are unlikely to be afflicted with SLE).
Without a systematic search for additional polymorphisms within a susceptibility gene it is
premature to suggest that a given variant is causal even when functional consequences can
be attributed to the variant. However, with the ongoing development of next generation
sequencing technologies it is now possible to evaluate a large number of the polymorphisms
within a gene in a timely and cost-effective manner as we suggest here. Moreover, the
technologies we will utilize make it very cost-efficient to investigate multiple genes and
subjects simultaneously. We argue that identifying the actual disease-causing genetic
variants in SLE requires large scale DNA sequencing of significant number individuals
from well phenotyped populations. This need to be followed by selective genotyping of
DNA samples from large well-phenotyped populations to validate the newly discovered
genetic variants and determine the causal variants. For the purpose of these studies we
have already assembled an unparalleled population of 6,500 adult-onset and 1,000 childhoodonset cases of SLE and a large multidisciplinary collaborative team of investigators.
3. Preliminary Studies:
The PI, Dr. Jacob is a founding member of SLEGEN, the International Consortium for Genetics
of Systemic Lupus Erythematosus. A SLEGEN GWAS published in 2008, provided results
from the Illumina 317K platform in European American women with lupus (1).
In addition, Dr. Jacob’s research team developed a gene selection tool to increase the power
of genetic association studies (2), and devised a microarray platform to discover susceptibility
genes contributing to autoimmunity. The candidate pathway genotyping platform developed
was applied to a sample of 753 subjects corresponding to 251 childhood onset SLE trios
(patients and both parents). Childhood-onset SLE presents a unique subgroup of patients for a
genetic study because an earlier disease onset, a more severe disease course, and a greater
frequency of family history of SLE imply an increased likelihood of expressing the genetic
etiology. Most previous genetic studies used adult-onset disease.
The results of our TDT study
of ~1,200 genes and ~10,000
Adult
Childhood
SNPs were published in
Gene
FDR
SNP
FDR
SNP
December 2007 (3). We have
SELP
5.24E-02
rs6125
8.75E-06
rs6125
since corroborated, replicated
TNFSF4
5.04E-11
rs1234315
1.64E-05
rs1234315
and extended the results in a
STAT4
2.46E-24
rs7582684 Adult1.01E-08
rs10181656
Ethnicity
Childhood
Control
case-control approach using a
FAIM
2.70E-24
rs13095734
1.91E-27
rs13095734
second cohort of ~800
AA
9893.73E-25
147 rs17849502
1053
NCF2
1.02E-22
rs17849502
childhood-onset and a very
EA
29122.32E-03
216 rs1121401
3114
KLRG1
1.03E-02
rs1121401
large cohort of ~ 5,500 adultTLR8
2.02E-04
rs17256081
HA
6037.86E-04
247 rs17256081
266
onset SLE cases with a much
IRAK1
4.76E-09
rs763737
1.27E-03
rs763737
higher SNP density. In
AsA
793
159
869
IL16
Table 2.
4.38E-02
rs7170924
Total
1.04E-03
5297
769
rs7170924
5302
Most significant SNPs in Adult- and Childhood- onset SLE. The FDR
values shown correspond to an Table
estimation
the False Discovery
Rate calculated
1: ofDemographic
distribution
of the
using the Storey q value procedure
from the
p values
of four ethnicities
replication
studies.
Abbreviations:
AA, combined
African
using the Fisher method, thus American;
correcting EA,
for multiple
testing
effects.
European
American;
HA,Populations
Hispanic
were corrected for stratification using
PCAAsA
priorAsian
to analysis.
American;
American.
summary, our follow up and replication studies included 11,368 participants (6,066
independent SLE cases and 5,302 healthy subjects matched for sex, age and ethnicity).Table
1 depicts the demographics of the study population in the replication studies. A portion of the
results obtained from these follow up studies have been now published in three manuscripts
during 2009 (4-6). Based on these studies we propose 9 new genes that have been
convincingly detected and replicated, with the overall false discovery rate (FDR) q values <
0.05 from all independent cases and controls after correction for multiple testing and principal
component analyses (PCA) for differences in population stratification between cases and
controls. Table 2 shows examples of most significantly associated SNPs within these 9 genes
in childhood-and adult-onset SLE cases. Table 4 below depicts the disease-relevant functions
of each of these genes which make them very attractive candidates, since they belong to
biological pathways likely to be involved in the pathogenesis of SLE.
We would like to emphasize that even in cases in which significant SNPs are not within introns
it is unlikely that we have identified the causal variant. For example, Fig 1 presents the
distribution of significant SNPs in the STAT4 gene. It is evident that significant SNPs are
widely spread throughout the gene, and we do not see a single major SNP which is massively
significant and present in all of the ethnicities which show association at a particular gene.
Therefore, we believe that only parallel sequencing in regions surrounding these nine genes
will ensure the identification of causal variants. The decision to go directly to sequencing in
these areas rather than first further narrow the area of significant association (via fine
mapping) is based on the practical argument that the sequencing technology is advanced
enough to be able to sequence fragments of DNA that encompass the relevant neighboring
genes and is cost effective. Furthermore, even if the association could be further narrowed by
fine mapping, we would still need to resequence given that additional SNPs (from databases)
are unlikely to add much to the significance, unless the causal SNP is identified, because of
the strong LD among the additional SNPs.
Furthermore, our approach is to use 100 subjects from each of two ethnicities, Hispanic
American and European American (for a total of 200 subjects). According to the power
calculation shown below this number of subjects is sufficient to provide 87% power to identify
alleles that are present in 1% of the patient population. We are using two ethnicities because
some disease-associated variants are more commonly found in one ethnicity and not the
other. This problem is exemplified in Table 4. In order to locate the causal variant in linkage
disequilibrium with the disease-associated variant, we need to sequence the regions
containing the disease-associated variant, which is significantly associated only in specific
ethnicities. For example, in the case of rs17256081 (TLR8), only Hispanic Americans show
significant disease-association with this allele, whereas in the case of rs17849502 (NCF2),
only European Americans show significant disease association with this allele (Table 3).
For reasons of cost-efficiency coupled with the power necessary to have a high likelihood to
detect causal variants, we do not propose at this time sequencing the two other ethnicities we
have in our cohorts and used in our study. Doing so would require us to sequence more than
400 samples to attain equivalent power in those populations, at a prohibiting increase in cost.
4. Experimental Design and Methods:
Overall Rationale: Recent genome wide assocation and candidate pathway studies followed
by replication studies, clearly establish that the following 9 genes, SELP, TNFSF4, STAT4,
FAIM, NCF2, KLRG1, TLR8, IRAK1, and IL16 are significantly associated with SLE in
multiple populations.
Specific Aim 1: identification of causal genetic variants within 9 genes strongly associated
with SLE by the application of targeted DNA capture and massively parallel sequencing
technologies. The genes targeted are: SELP, TNFSF4, STAT4, FAIM, NCF2, KLRG1, TLR8,
IRAK1, and IL16
Sample size considerations for sequencing: According to Glatt et. al. (7) we use the following
equation for sample size estimation: Power =1-(1-f)^(2n) where f is the minor allele frequency
(MAF), 2n is the number of chromosomes; Power is the probability of finding at least one copy
of an allele with MAF=f. Based on this equation, if we sequence 50 subjects (namely 100
chromosomes) there should be 99% power to identify an allele with 5% allele frequency, and
63% for an allele with 1% frequency. Sequencing 100 subjects would give us 87% power to
identify alleles that are present in 1% of
Childhood
the patient population and 99.9% power
Gene
SNP
Ethnicity Adult FDR
FDR
for alleles present at 5%. As is evident
EA
9.86E-2
2.78E-1
from our preliminary studies (5, 6), we
TLR8
rs17256081
HA
5.07E-4
1.39E-3
appear to be dealing with allele
EA
5.30E-24 3.44E-25
frequencies that are more frequent than
NCF2
rs17849502
HA
1.17E-1
6.40E-2
these used for the power calculation.
EA
2.98E-5
9.81E-4
Hence, sample size should not be a
NCF2
rs10911363
limiting factor.
HA
1.04E-1
2.45E-2
EA
1.44E-2
2.92E-2
Regions to be sequenced around our 9
NCF2
rs34037871
HA
4.58E-1
2.51E-1
genes are depicted in Table 4. For
example, for IRAK1, the region from Table 3. Different SNPs are significant in different ethnicities.
approximately base 152,881,000 to FDR is the estimate of the False Discovery Rate as calculated by
153,027,000 of chromosome X (NCBI the Benjamini and Hochberg procedure (15). EA are Europena
Americans, HA are Hispanic Americans.
reference assembly build 36 version 3)
will be sequenced, which encompasses
the neighboring genes of IRAK1, including MECP2 (which was suggested independently of us
as a candidate SLE gene and probably describes the same genetic association). We
overestimate the region that is likely to contain a causative allele which is in LD with SNPs we
have found to be associated because the cost difference between sequencing a more
precisely defined region and an overestimated region, using high-throughput sequencing, is
not worth the risk of missing a causative allele which may be located outside of a region of
currently known SNPs in LD with the associated SNPs. Furthermore, the available information
regarding LD in these regions is not necessarily accurate and is primarily available in
Caucasians, and the extrapolation of LD from Caucasians to the Hispanic ethnicity which we
also plan to sequence is unknown.
In order to optimize the probability of finding the causal mutations, we will preferentially select
subjects with disease-associated SNPs within these genes and specifically those who
developed SLE in childhood, because childhood-onset SLE has stronger penetrance of the
disease alleles. For the controls, we will utilize known sequences of these regions, including
data from HapMap, Celera, JCVI, and NCBI.
As starting material each subject has currently available at least 25 microgram genomic DNA
prepared from peripheral blood for the purpose of these studies.
The major bottleneck in the next generation sequencing approaches is enriching the target
DNA prior to sequencing (8). Until very recently the most common approach for target DNA
isolation included short and long PCR. As the size and number of regions of interest increases,
other approaches have been developed that rely on parallel capture and enrichment (8, 9).
These methods have significant advantages over PCR, including use of less input DNA per
region, parallel capture of a large number of regions, less need for optimization, and quicker
isolation of captured DNA. Accordingly, we will use such method available from NimbleGen to
capture targeted genomic regions for sequencing.
Gene
Region for Sequencing
Significant
Chromosome
SNPs
Start (kb) Stop (kb)
Region
Function
SELP
1q24
167824
167877
6
Adhesion receptor for neutrophils, monocytes and T-cells, respponsibl for the
migration of these cells for the initiation and perpetuations of NFkB inflamatory
response.
TNFSF4
1q25
171419
171474
3
Interaction between this gene and its receptor is inviolved in co-stimulation of T
and B lymphocytes and in the adhesion of T cells to endothelial cells for the
induction of the NFkB inflamatory pathway
STAT4
2q32.2-3
191602
191725
30
Signal Transducer and Activator of Transcription essential for Th1 and interferon
activation involve din essential cellular events of differentiation, proliferation and
apoptosis following cytokines and growth factor signaling
FAIM
3q22
139790
139835
2
FAS apoptosis inhibitor molecule functions as an anti-apoptotic molecule and also
has NFkB activating functions
NCF2
1q25
181698
181837
13
Essential component of the NADPH oxidase enxyme complex in phagocytic
leukocytes. It is importat for host innate immunity
KLRG1
12p12-13
8993
9055
7
Lectin-like receptor involved in differentiation, proliferation, and apoptosis of a T
cell subset, including NK cells.
TLR7/
TLR8
Xp22
12785
12852
2
A member of the Toll-like receptor (TLR) familiy which is intimately involved in the
activation of NFkB
IRAK1
Xq28
152881
153027
6
Serine/threonine protein kinase involved in the signalling cascade of the TOLL/IL-1
receptor family. Considered the ON/OFF switch of the receptor complex to the
activator adaptor protein TRAF6 whcih is responsible for the activation of several
inflamatory
IL16
15q26.3
79266
79403
3
A pleiotropic cytokine that functions a s a chemoattractant for immune cells and a
modulator of T cell activation
Table 4. Regions of nine genes proposed for sequencing in 100 individuals with SLE in each of two
ethnicities (Hispanic American and European American) utilizing targeted DNA capture and massively
parallel sequencing technologies. Significant SNPs are those which have an estimated False Discovery
Rate less than 0.05 (after multi-test correction) in either adult- or childhood-onset populations. Regions
are indicated in kilobases using the NCBI reference assembly build 36 version 3. Functions of the genes
are summarized for convenience.
We will take advantage of the newly developed indexing system that involves the addition of a
specific “indexing” sequence to the end of the primer sequence. The sequence of this
“indexing” segment will be determined at the initiation of each sequencing reaction, thus
allowing the subsequently determined sequence to be assigned to a specific sample, even
when fragments derived from several different “indexed” samples are being analyzed in the
same sequencing run. For our specific application, assuming that we intend to capture about
0.81 Megabases of genomic DNA containing our candidate genes (Table 4), it is reasonable to
predict that 10 samples could be individually fragmented, ligated with indexed sequencing
primers, pooled and hybridized to a single custom NimbleGen genomic sequence capture chip,
and the eluted fragments could be sequenced as a single sample. Assuming a conservative
estimate of ~50-fold sequence coverage of our selected genes, we should expect a fairly
robust 40.9 Megabase sequence data for each sample, or 8.18 Gigabases for the identification
of allelic sequence variations in all the samples. Obviously, the efficiency of this system and
the amount of raw sequence information required for a thorough analysis of both alleles in
each sample from a pool of ten will be impacted by several technical factors, including: 1) the
proportion of sequence information obtained for each sample in the pool; 2) the relative
efficiency of genome sequence capture for various segments in our selected genes during the
hybridization to the NimbleGen chip; and 3) the complexity of the genetic polymorphisms
revealed.
Nevertheless, given the relative maturity of the technology, we expect no major problems in
achieving our goal.
Using the novel sequencing methodology and a large number of available subjects, we expect
to find novel SNPs previously unidentified in publicly available SNP databases. To
demonstrate the feasibility of this approach, we have used NimbleGen arrays to capture
targeted genomic regions for sequencing to perform a single sample experiment (on a
Hispanic American childhood-onset SLE subject) using a NimbleGen custom array followed by
sequencing.
Roughly
1.2
million
sequencing reads were
obtained of which 94%
mapped uniquely to the
human genome reference.
99% of the bases mapped
to the genome with 52% of
the reads falling within the
targeted
region
corresponding
to
a
significant enrichment over
the result for shotgun
sequencing of the full
genome. 75% of the
targeted bases exhibited
>100-fold coverage. This
sequencing
run
discovered approximately
18 as-of-yet undescribed
SNP variants in the region
sequenced.
Beyond
demonstrating feasibility, Figure 1. –log10 of STAT4 p values versus the position on
this
preliminary
study chromosome 2 in adult and childhood-onset lupus separated by
allows
our
support ethnicity. Exons of STAT4 are shown below in green with exon
personnel to asses the numbers above.
technology.
Specific aim 2: selective genotyping of DNA samples from large well-phenotyped populations
to validate the newly discovered genetic variants.
Rationale: Based on our as well as many other investigators resequencing experience, it is
very unlikely that no new SNPs will be identified. A more likely scenario is that many new
SNPs will be identified and the question will be which of the new SNPs are relevant to SLE.
For this purpose the re-evaluation of the new SNPs is an accepted and logical phase, unless
an absolutely obvious causal SNP emerges in the previous step.
Therefore, in this aim we will re-evaluate the new SNPs identified through parallel sequencing
in case-control association studies, encompassing all 4 different ethnicities utilizing the entire
population of adult- and childhood- onset SLE samples as well as an extensive collection of
unrelated controls. It is important to emphasize that in deciding which SNPs to use for
genotyping we will evaluate the regions sequenced to locate all new SNPs (that are not
sequencing artifacts) prioritizing those new SNPs that have a variant found more commonly
among the SLE subjects. In addition, we will genotype SNPs that were previously known (in
the databases) where a single variant is found preferentially in the case populations
sequenced and that have not previously been genotyped in our association studies.
Study population: Through elaborate collaborative agreements (successfully employed in the
preliminary study and in additional collaborative undertakings), we will have available for the
studies proposed here at least 6,500 adult onset SLE (at least 1,000 SLE subjects from each
of the 4 ethnic/racial populations) and over 1,100 childhood onset cases [estimated as 300
subjects from each of European American (EA), Hispanic American (HA) Asian Americans
(AsA) and 200 African Americans (AA) subjects]. In addition we will have approximately 6,450
controls (at least 1,000 controls for EA, AA, AsA and approximately 700 HA controls). Controls
will be matched for gender, ethnicity and age in adult-onset cases. For childhood SLE cases,
the controls to be used will be matched for gender, ethnicity and geographic location, but will
not be matched for age. In fact, all controls will be adults. We believe that adults that have no
SLE or any other rheumatic disease are better controls than age-matched children that may
develop disease later in life.
Specifically, we have available the cohorts used for the replication studies (Table 1 above, and
letter of collaboration from Dr. John Harley in the name of various OMRF collaborators). We
have now available from UCLA an additional ~1000 adult onset SLE subjects and 816 controls
which have not been used in any of the preliminary studies (letters of collaboration from Dr.
Betty Tsao attached). The childhood-onset SLE resource developed by the PI via a recruitment
network for pediatric SLE patients include the Children’s Hospital of Los Angeles (CHLA), the
Children Hospital of Orange County, CA (CHOC), the Children’s Memorial Medical Center, in
Chicago (CMMC), Texas Children’s Hospital of Houston (TCH) and the Hospital of Sick
Children (HSC) in Toronto. Letters from pediatric rheumatologists participating are attached.
All protocols were approved by the Institutional Review Boards at each respective institution.
All patients met the revised 1997 ACR criteria for the classification of SLE. All procedures,
methodologies and collection of data were consistent in a standard manner in all participating
sites. Self-reported ethnicity was verified by parental and grandparental ethnicity, when
possible.
Statistical analyses will be performed by Dr. Zidovetzki and Dr. Armstrong, both of who
collaborated with the PI on this and related projects for a number of years. Dr. Langefeld will
serve as consultant (letter of collaboration attached). Both publicly available (e.g., PLNK,
Haploview 4.0), software, and software specifically developed by Drs. Zidovetzki and
Armstrong will be utilized. Analyses will be done as described in detail in our recent studies (46).
Potential pitfalls
Confounding by Population Stratification
Genetic association studies are potentially susceptible to confounding by population
stratification. In our studies the problem of population heterogeneity may be most problematic
regarding the Hispanic population which may have Mestizo, European and African ancestry.
However, our collection that comes from California and Texas are Mexican Americans (MA);
therefore eliminating the additional complexity present when the samples include Cuban and
Puerto Ricans or different South American Hispanic populations. Similarly the AA population is
also an admixture population.
Although population stratification has been the subject of vigorous debate in the epidemiologic
literature, it is now generally accepted that potential confounding may be adequately controlled
using a set of unlinked genetic markers to estimate the underlying population structure.
The effectiveness of this approach depends on the ability of the set of SNPs to capture the
underlying genetic substructure via the coefficient of ancestry (10). To ensure successful
implementation of this method of control, we have identified 233 SNPs specifically selected to
differentiate parental populations (African, European, American Indian, Mexican, and Asian)
and for making precise ancestry estimates (11). These SNPs have previously been genotyped
in a multiethnic sample.
In our studies, we will use the set of 233 SNPs to estimate a coefficient of ancestry, (12) and
identify key principal components (13) for each individual. These variables will then be used as
a covariate model to ensure unbiased effect estimates and valid tests of association.
Other potential confounding: As for potential concerns regarding heterogeneity of the study
population (in diagnosis and classification) and the multi-site nature of the resources, we would
like to emphasize that standardized procedures were followed by the investigators in all
clinical-sites to ensure that cases can be pooled between centers without the introduction of
diagnostic heterogeneity. These included diagnostic, disease progression, and damage index
forms filled out for every recruited SLE patient. The same ACR criteria for SLE have been used
and rigorously applied in all cases. Diagnosis was confirmed in each case by at least one
rheumatologist. Subjects used in our studies are recruited from several medical centers
throughout the US. We acknowledge that different geographic sites of recruitment can
introduce various environmental factors that may interact with genetic risk factors. We would
like to argue, however, that in our specific case these factors may not play an important role.
The environmental factors recognized in SLE, namely exposure to sunlight and EBV exposure
are ubiquitous. The successes in the identification and replication of several SLE susceptibility
genes using cohorts from several continents (for example IRF5) support our argument for
multi-center recruitment strategy.
Sample size: We acknowledge that our childhood-onset SLE cohorts, when subdivided into the
various ethnicities, have relatively moderate size. However, based on our experience in the
preliminary studies, we would argue that subjects that develop disease in childhood should be
enriched for genetic effects: first, because early disease onset may be an indication for
increased genetic predisposition and penetrance, and second, because sex hormonal
influences are less likely to play a significant role in the onset of disease in this age group.
Finally, childhood-onset SLE is often a more severe disease than adult-onset and has been
shown to have a more aggressive clinical course.
Multiple testing: The standards of statistical proof that are commonly used in biomedical
literature have been questioned when applied to large SNP-based genetic association studies.
The problem of multiple testing pervades the discipline without a clear consensus about how it
should be solved (14). The classical Bonferroni correction is both too strict and inappropriate in
the case of genetic studies because it assumes that each test is independent, whereas in
actuality a complex and unknown mutual dependence is present among genes, and even more
prominently among SNPs of the same gene. The false discovery rate (FDR) (15) approach is
currently widely used in genetic microarray and association studies. We adapted a variation of
FDR (16) for the multitest correction in our case, with q<0.05 (corresponding to less than 5%
false positives), considered to be significant.
At the juncture that all (or at least most) polymorphisms have been identified within the genes
of interest and re-evaluated in relevant patient populations, systematic functional studies
should be most appropriate toward discovering the mechanisms through which the genetic
variants are involved in the causation or perpetuation of SLE. Thus, the completion of this work
will significantly advance our understanding of the genetic foundation of SLE (and
autoimmunity in general), and will provide a foundation for development of diagnostics and
therapy for this devastating disease.
Concluding remarks: We would like to emphasize some of the unique strengths of this
proposal to significantly impact the field: 1) A solid and extensive preliminary work on the
genetics of SLE and integration of the previously separate approaches in childhood-onset and
adult-onset SLE into one combined effort which is a natural extension of our ongoing work on
human SLE. 2) A unique feature of our application is an unparalleled access to a very large
SLE study population of over 6,500 adult-onset and 1,000 childhood-onset SLE cases and
similar numbers of matched controls. Furthermore, our study sample includes a large
population of ethnic minorities which enables us to identify and characterize important variants
predisposing to SLE that are common to multiple ethnic groups. 3) A great strength of the
application is the integration of the talent of a large multidisciplinary team of clinical and basic
research scientists from multiple institutions, each of whom brings a great depth of experience
to the application; and 4) Access to state-of the art-technology in gene mapping, next
generation sequencing and bioinformatics that are constantly being practiced and improved
upon by our team of experts.
5) Timelines and Milestones
1. Sequence Samples (Months 1-12)
a. Identify most appropriate subjects to sequence. (Subjects that are enriched for
the SNP variant associated with SLE with enough DNA for sequencing.)
b. Design Nimblegen capture array
c. Design Nimblegen indexing for the sequencing of multiple subjects
simultaneously
d. Perform Nimblegen capture array
e. Analyze results from capture array
f. Perform next generation high-throughput sequencing
2. Analyze Sequence Results (Months 13-14)
a. Identify novel SNP variants
b. Determine which previously un-genotyped SNP variants (including novel and
non-novel variants) are enriched in the patient populations sequenced.
c. Contribute new SNP variants to NCBI
3. Genotype Large well-phenotyped population (Months 15-20)
a. Design array for genotyping including new SNPs to be evaluated as well as a set
of markers informative for ancestry (AIMs)
b. Perform genotyping
4. Analyze genotyping results (Months 21-24)
a. Correct for population stratification
b. Correct for multiple testing effects
The project title is Targeted DNA capture and parallel sequencing to identify causal mutations
in SLE.
6. Significance
The sequential strategy of large scale targeted DNA capture and parallel sequencing followed
by selective genotyping of all relevant polymorphisms will have the highest probability to
identify the complete set of causal variants contributing to SLE heritability and such approach
will move the field much faster towards realistically being able to evaluate new targets for early
diagnosis and therapy. These studies will significantly advance our understanding of this
disease and establish new key steps in the pathogenesis of SLE. The results will provide the
immediate justification for the development of therapeutic approaches targeting these
molecules or other molecules within their biochemical pathways. For diagnostic purposes, we
will have the causal genetic markers which will identify those SLE patients most likely to
benefit from such therapeutic approaches. Furthermore, the proposed study can serve as a
paradigm for studying other common diseases with complex genetic associations in general.
7. Relevance to mission of ALR
NEED CHANGE Autoimmune diseases are common chronic conditions which affect
approximately 5% of the US population. Systemic Lupus erythematosus (SLE) is the
prototypical human multi-system autoimmune disease. It is a disorder of generalized
autoimmunity characterized by an immune system attack against multiple organ systems.
While SLE affects approximately 0.15% of the US population, there is increasing evidence that
different inflammatory/autoimmune diseases share multiple predisposing genetic effects. Thus,
the genes affecting SLE development are highly relevant to a very large population affected by
a variety of inflammatory and autoimmune diseases.
Although SLE is a global disease, nevertheless it affects minorities such as African Americans
and Hispanic Americans more severely than other ethnic populations. Young women are most
commonly affected, though SLE is also found in men, and as a chronic disease is responsible
for significant morbidity and suffering of hundreds of thousands of patients resulting in massive
expenditures on health care and severe impact on the quality of life of individuals with the
disease.
Download