Association mapping approach to identify genomic regions

advertisement
Association mapping approach to identify genomic regions associated with oil
content in sunflower based on SNPs.
Marcos Kaspar, María Valeria Paccapelo, Teresa Galella, Martín Grondona, Roberto Reid,
Alberto León, and Andrés Zambelli.
Balcarce Biotechnology Center, Advanta Semillas SAIC, Ruta 226 Km 60.5 (7620) Balcarce, Buenos
Aires, Argentina. E-mail: andres.zambelli@advantasemillas.com.ar





Seed oil yield is together with grain yield one of the main features covered in sunflower breeding
programs. Seed oil percentage has a medium to high heritability (ranging from 0.5 to 0.9 in
previous studies) and predominantly additive gene action that facilitates selection in early
generations. For taking advantage of an agronomic character it is very important to know the
location of the QTL associated to it. Linkage analysis in biparental families typically localizes
QTLs to 10 to 20 cM intervals because of the limited number of recombination events that occur
during the construction of mapping populations and the cost for propagating and evaluating a
large number of lines.
Association Mapping, also known as linkage disequilibrium mapping, is a relatively
new and promising genetic method for genetic dissection of complex traits. This approach
allows to obtain higher mapping resolution through exploitation of historical recombination
events occurred when analyzed unrelated inbred lines population. Association mapping takes
advantage of ancestral recombinations and natural genetic diversity within a population to
dissect quantitative traits and is built on the basis of linkage disequilibrium concept. Due to
higher genome density, lower mutation rate, and better amenability to high-throughput detection
systems, SNP (single nucleotide polymorphism) are rapidly becoming the marker of choice for
dissecting complex genetic traits by Association Mapping.
The objective of the present study was to perform a whole genome association mapping
analysis for oil content in a set of sunflower inbred lines to identify the genomic regions
associated to this trait.
A set of 74 conventional lines from Advanta's germplasm was selected aiming to represent most
of the total genetic variation of the Advanta Breeding Program. The lines were evaluated for oil
content in one environment by Soxhlet method and genotyped using 6821 SNPs covering the 17
linkage groups (LGs) of the sunflower genome. Associations between oil content and every SNP
markers were analyzed using an additive linear regression model estimated by the mlreg function
of the GenABEL package for R statistical software. With the aim of controlling for spurious
associations, the model takes into account the population structure investigated using
STRUCTURE software.
Statistical models analysis for oil content identified signals of association with SNPs at different
LGs regions and with different degrees of significance. Some of the regions identified are in
agreement with previous QTL analysis. Among all the regions identified, four were chosen for
their high significance. For each region, the most significant SNP was selected, explaining each
one of them between 15-21% of the total phenotypic variation. Analyzing the haplotype of the
74 lines it was found that lines with the unfavorable haplotype (unfavorable allele for the 4
selected SNPs) had an oil content average 10 % lower than lines with the favorable haplotype.
Using an Association Mapping approach, four regions associated with oil content were
identified. On AM the accuracy of the phenotypic data is crucial to obtain clear results so, future
work will be focused on increasing the number of environments in which the oil content will be
measured.
Several studies had identified QTLs associated with oil content. This approach will contribute to
a better definition of the QTL or even the identification of candidate genes associated with oil
content in sunflower, leading to a better understanding of this complex trait. The results of this
study will be useful not only as a source of information about the genetics of the trait but also in
marker-assisted breeding programs.
Keywords: sunflower, oil content, association mapping, QTL
INTRODUCTION
Seed oil yield is together with grain yield one of the two main features covered in
sunflower breeding programs. Seed oil percentage is a complex trait having a medium to high
heritability (ranging from 0.5 to 0.9) and predominantly additive gene action that facilitates
selection in early generations. Most traits of agricultural or evolutionary importance are
controlled by multiple quantitative trait loci (QTL). Genetic mapping and molecular
characterization of these functional loci facilitates genome-aided breeding for crop
improvements. Identification of QTL requires mapping them in the genome using molecular
markers. Association mapping (AM) also known as linkage disequilibrium (LD) mapping, has
emerged as a tool to resolve complex trait variation down to the sequence level by exploiting
historical and evolutionary recombination events at the population level (Zhu et al., 2008). As a
new alternative to traditional linkage analysis, association mapping offers three advantages, (i)
increased mapping resolution, (ii) reduced research time, and (iii) greater allele number.
In contrast to linkage based studies, linkage disequilibrium based genetic association
studies offer a potentially powerful approach for mapping causal genes with modest effects.
While linkage analysis is based upon detection of non-random association between a genotype
and a phenotype in well characterized pedigrees, association mapping focuses on associations
within populations of unrelated individuals (Ersoz et al., 2007).
Two AM methodologies are in use: candidate gene and whole genome scan. The first
one assumes good understanding of the biochemistry and genetics of the trait while genome
scan involves testing for association with the trait of most segments of the genome. The
hypothesis under consideration is: ‘one (or more) of the genetic loci being considered is either
causal for the trait or in linkage disequilibrium with the causal loci’. Of course, the first step in
this process is the discovery of a large number of genetic markers, typically single nucleotide
polymorphisms (SNPs), as a reference resource. The application of AM takes advantage of the
use of the measures of pairwise LD statistics to infer the predictive value of a marker locus for
its association with the phenotype of interest. The high-LD chromosomal region around a
marker locus defines the predictive range of a certain genetic marker. If LD within this genomic
range is complete, any polymorphism within this range will have the same predictive value for
the association with the phenotype. Hence, as a result of a significant marker-phenotype
association, it can be concluded that the causative polymorphism resides within this high LD
region around the marker locus (Ersoz et al., 2007).
AM exploits recombination events occurred in the evolutionary history supplying
higher mapping resolution (Kolkman et al., 2007). However, the uncontrolled population design
can result in spurious signals of association in downstream analysis. False positive association
between markers and traits can arise due to population structure caused by selection, plant
improvement, etc. Thus, taking this into consideration the population structure is a critical
prerequisite in association analyses. Moreover, if pedigree data from inbred lines could be
reconstructed and used, control of type I and II error rates improves (Yu et al., 2006). Other
source that increases the false positive rate is the occurrence of alleles in very low frequency.
There are five main stages for association studies: (1) Selection of population samples, (2)
Determination of the level and influence of population structure on the sample, (3) Phenotyping
the population sample for traits of interest (4) Genotyping the population, for either candidate
genes/regions or as a genome-wide scan and (5) Testing the genotypes and phenotypes for their
associations.
The objective of the present study was to perform a whole genome association mapping
analysis for oil content in a set of sunflower inbred lines to identify the genomic regions
associated to this trait.
MATERIAL AND METHODS
A subset of 74 conventional lines from Advanta's germplasm was selected aiming to
represent most of the total genetic variation of the Advanta Breeding Program. The selection
was based on allele sharing distances obtained with SSR-genotype and pedigree information.
Twenty plants of each line were sown in 2008 in Balcarce location, Argentina. After
harvesting, 50 g of seeds per line were used to measure the oil content (soxhlet method).
A set of SNPs were discovered and validated by a multi-company Consortium led by
The University of Georgia (Athens, GA, US). The 74 conventional inbred lines from Advanta's
germplasm were genotyped by the Consortium with 6984 SNPs using a high-throughput
genotyping system (Illumina, Golden Gate).
Associations between oil content and every SNP markers were analyzed using an
additive linear regression model estimated by the mlreg function of the GenABEL package for
R statistical software (Aulchenko et al., 2007). With the aim of controlling for spurious
associations, the model takes into account the population structure investigated using
STRUCTURE software (Pritchard et al., 2000). Such model has the following expression:
OilContentijk = µ + Structurei + SNPj + Ɛijk
where µ is a general oil content mean, Structurei represents the effect of the ith subgroup present
in the population, SNPj is effect of the jth genotype over the mean oil content. Finally, Ɛijk is the
residual error.
Once regions were detected, the SNP of a particular region which had the highest
association with the trait was selected. For every selected SNP, the favorable/unfavorable
allele’s effect was studied and the percentage of total variation explained by the SNP was
determined by the following model:
OilContentij = µ + SNPi + Ɛij
where µ is mean oil content, SNPi is effect of ith genotype over the mean oil content. Finally, Ɛij
is the residual error. Additionally, models combining 2, 3 or 4 SNPs were studied.
RESULTS
Oil content evaluation
Percentage of oilseed content was measured in a set of 74 inbred lines and ranged from
20.8 % to 53.1 % with a mean of 39.9 % and a median of 40.8 %. Value distribution is shown in
Fig. 1.
Fig. 1. Distribution of percentage of oilseed content among 74 inbred lines analyzed.
SNP genotyping
Seventy-four Advanta’s inbreed lines were genotyped with all the SNPs generated by
the Consortium. The selection quality of those SNP was based on the rate of missing (no data)
and heterozygous genotypes. After the analysis, 4076 SNPs were selected and used for the
genotype-phenotype association analysis.
Population Structure
Population structure was taken from previous analysis that consisted on the application
of STRUCTURE software to a greater subset of lines genotyped with SSRs. The analysis found
that there were four sub-populations among the Advanta inbred lines. Proportion of membership
of each line to every sub-population was used in order to classify lines. So, a factor of
subpopulation was considered as a covariate in the association mapping model.
Association Mapping Analysis
Statistical models for analysis of oil content identified signals of association with SNPs
located on different linkage groups (LGs) regions and with different degrees of significance
(Fig. 2). LG map was constructed on the base of SSR consensus map previously published
(Tang et al., 2002). Among all the regions identified, four were chosen due to their higher
significance (LGs 10, 11, 13 and 14). For each region, the most significant SNP was selected.
All possible models were adjusted considering 1, 2, 3 and 4 of the selected SNPs. The
coefficient of determination (R2) of each possible model is represented in Figure 3. The model
with the highest R2 is the one with the four SNPs previously selected.
Fig. 2. Results of the AM analysis of Oil Content across the genome.
Fig. 3. Coefficient of determination from models with combinations of the selected SNPs.
Each of these SNP explained between 15-21 % of the total phenotypic variation. The
analysis of the haplotype of the 74 lines showed that lines with the favorable haplotype
(favorable allele for the 4 selected SNPs) had an oil content average 10 point higher than lines
with the unfavorable haplotype (Fig. 4).
Fig. 4. Increase of mean seed oil content as favorable SNP alleles accumulate up to 4
homozygous alleles.
DISCUSSION
Advances in high-throughput genotyping (particularly SNPs) and sequencing
technologies have markedly reduced the cost per data point of molecular markers for which
researchers are moving toward genome-wide association analyses of complex traits (Zhu et al.,
2008). Even seed oil percentage has a medium to high heritability and predominantly additive
gene action that facilitates selection in early generations, the identification of genomics regions
associated with the trait is important not only for breeding purposes but also for the
understanding of the genetics of this complex trait. Using an AM approach we identified several
regions from the sunflower genome associated with seed oil content. Of these, four were notable
due to their higher significance. A model with SNPs located in the selected regions (one SNP
per region) explains 55.8 % of the phenotypic variation. Future work will be focused on
increasing the number of environments in which the oil content will be measured in order to
detect possible genotype by environment interaction.
REFERENCES
Aulchenko, Y.S., Ripke, S., Isaacs, A. and van Duijn, C.M. 2007. GenABEL: an R package for
genome-wide association analysis. Bioinformatics 23:1294-6.
Ersoz, E.S., Yu, J., Buckler, E.S. 2007. Applications of linkage disequilibrium and association
mapping in crop plants. In: R.K. Varshney and R. Tuberosa (eds.), Genomics Assisted Crop
Improvement Vol. 1: Genomics Approaches and Platforms. pp. 97-119.
Kolkman, J.M., Berry, S.T., Leon, A.J., Slabaugh, M.B., Tang, S., Gao, W., Shintani, D.K.,
Burke, J.M. and Knapp, S.J. 2007. Single nucleotide polymorphisms and linkage disequilibrium
in sunflower. Genetics 177:457-468.
Pritchard, J.K., Stephens, M. and Donnelly, P. 2000. Inference of population structure using
multilocus genotype data. Genetics 155:945-959.
Tang, S., Yu, J.K., Slabaugh, M.B., Shintani, D.K. and Knapp, S.J. 2002. Simple sequence
repeat map of the sunflower genome. Theor. Appl. Genet. 105:1124-1136.
Yu, J., Pressoir, G., Briggs, W.H., Vroh-Bi, I., Yamasaki, M., Doebley, J.F., McMullen, M.D.,
Gaut, B.S., Nielsen, D.M., Holland, J.B., Kresovich, S. and Buckler, E. 2006. A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nat.
Genet. 38:203-208.
Zhu, C., Gore, M., Buckler, E.S. and Yu, J. 2008. Status and prospects of association mapping
in plants. Plant Genome 1:5-20.
Download