Association mapping approach for analysis of QTL determining fatty acid composition and oil content in sunflower seeds Zambelli, A.1, Kaspar, M.1, Grondona, M.1, Reid, R.1 and León A.1 1 Biotech Research Center, Advanta Semillas-Nutrisun Business Unit, Ruta 226 Km 60.5 (7620) Balcarce, Argentina. andres.zambelli@advantasemillas.com.ar Most traits of agricultural importance are controlled by multiple quantitative trait loci (QTL). The aim of many genetic mapping studies is to identify QTL responsible for phenotypic variation facilitating genome aided breeding for crop improvement. Association mapping (AM) is a powerful genetic mapping tool which refers to the analysis of statistical associations between genotypes and phenotypes in a collection of individuals. Genetic mapping has usually been done in specific populations such as a progeny of parents which differ for the trait of interest. By contrast, AM could involve analyzing a set of individuals such as natural populations or germplasm collections. In the present work, an AM approach was assayed in a collection of sunflower inbred lines which were genotyped with a set of 4000 SNP and phenotyped for oilseed fatty acid composition and oil content. This allowed for the identification of two main QTL associated with stearic and oleic acid in agreement with previous data, validating the use of genome-wide association (GWA) techniques for QTL analysis associated with this quality trait. On the other hand, this mapping approach also allowed the identification of QTL associated with oil content. Use of AM will contribute to a better definition of QTL or even to the identification of candidate genes associated with fatty acid composition and oil content in sunflower, leading to a better understanding of these complex traits and to developing new qualities of oil. Key words: genetic diversity; genome-wide association; oil quality; mapping INTRODUCTION Even though regular sunflower oil has been traditionally appreciated, emerging oilseed markets are demanding for new oil qualities both for food and non-food applications. Several sunflower mutants with modified fatty acid composition have been generated by treatment with ionizing radiations or by chemical mutagenesis. Among them, high stearic, high oleic and high stearic-high oleic sunflower mutants appear as the most important ones (Fernández-Martínez et al., 2007; Garcés et al., 2009). Seed oil yield, together with grain yield, are the two main features attempted to cover in sunflower breeding programs. Seed oil percentage has a medium to high heritability and predominantly additive gene action that facilitates selection in early generations. Most traits of agricultural importance are controlled by multiple quantitative trait loci (QTL). The aim of many genetic mapping studies is to identify QTL responsible for phenotypic variation facilitating genome aided breeding for crop improvement. A new powerful genetic mapping tool is association mapping (AM) (Abdurakhmonov and Abdukarminov, 2008; Myles et al., 2009). AM refers to the analysis of statistical associations between genotypes and phenotypes determined in a collection of individuals. Genetic mapping has usually been done in specific populations such as a progeny of parents which differ for the trait of interest. By contrast, AM could involve analyzing a set of individuals such as natural populations or germplasm collections. This becomes an advantage over linkage or family mapping (Myles et al., 2009). Two AM methodologies are in use: candidate gene and genome-wide association (GWA). The former assumes good understanding of the biochemistry and genetics of the trait, while GWA involves testing for association with the trait of most segments of the genome. The hypothesis under consideration is: ‘one (or more) of the genetic loci being considered is either causal for the trait or in linkage disequilibrium with the causal loci’. The strategy of a GWA study is to genotype enough markers across the genome so that functional alleles will likely be in linkage disequilibrium (LD) with at least one of the genotyped markers. Of course, the first step in this process is the discovery of a large number of genetic markers, typically single nucleotide polymorphisms (SNPs), as a reference resource. LD is the basis of genetic mapping as it is a requisite to detect markers closely linked to QTL. LD is the association between loci and refers to the recombination of specific alleles at different loci. The key difference between association and family mapping is the control the experimenter has over recombination. In AM, genotype and phenotype data is collected from a population in which relatedness is not manipulated, so there is not control over LD (Myles et al., 2009). Investigations carried out in different crops showed that LD decreases with the distance between loci and also, the rate of decay is slower in inbred lines than in wild populations. It was proposed that the decay in modern sunflower was sufficient for very high density genetic mapping and high-resolution AM, which can be achieved with marker densities lower than those usually reported in the literature (Kolkman et al., 2007; Fusari et al., 2008). AM exploits recombination events occurred in the evolutionary history supplying higher mapping resolution (Myles et al., 2009). However, the uncontrolled population design can result in spurious signals of association in downstream analysis. False positive association between markers and traits can arise due to population structure caused by selection, plant improvement, etc. Thus, taking this into consideration the population structure is a critical prerequisite in association analyses. Moreover, if pedigree data from inbred lines could be reconstructed and used, control of type I and II error rates improves the analysis (Yu et al., 2006). Another source that increases the false positive rate is the occurrence of alleles in very low frequency. Successful application of AM also requires comprehensive phenotypic data. In fact, this turns into another benefit of AM because it can be based on historical breeding trials which abound in companies and crop improvement centers. The methods for marker-trait association may differ for discrete or quantitative traits. Thus, different statistical analyses have been applied: contingency tables, ANOVA, general linear model, mixed linear models, among others. In this way, AM is a complementary approach to linkage analysis in terms of providing prior knowledge, cross-validation and statistical power for QTL investigation. The objective of the present study has been to perform a whole genome association mapping analysis for oil quality and oil content in a set of sunflower inbred lines to identify the genomic regions associated to these traits. MATERIAL AND METHODS Phenotype data A set of 89 inbred lines (high linoleic conventional sunflower) was selected aiming to represent the total genetic variability present in the Advanta sunflower breeding program. The selection was based on allele sharing distances obtained with SSR-genotype and pedigree information. For oil quality analysis twenty plants from each line were sown in 2008 in Balcarce (two planting dates) and Venado Tuerto, Argentina. Seeds from each line were harvested in bulk and the fatty acid composition was evaluated by gas chromatography. Out of the 89 lines sown at the second date in Balcarce, 74 produced enough seeds (50 g) for measuring oil content by soxhlet method by duplicate. Genotype data A set of SNPs were discovered and validated by a multi-company Consortium led by The University of Georgia (Athens, GA, US). The 89 conventional inbred lines from Advanta's germplasm were genotyped by the Consortium with 6984 SNPs using a high-throughput genotyping system (Illumina Golden Gate). Population structure With the aim of controlling for spurious associations, population structure was investigated using STRUCTURE software (Pritchard et al., 2000). The analysis showed that there were four sub-populations among the Advanta inbred lines. A proportion of membership of each line to every sub- population was used in order to classify lines. So, a factor of subpopulation is considered as a covariate in the association mapping model. Association analysis Associations between both traits and all SNP markers were analyzed using an additive linear regression model estimated by the mlreg function of the GenABEL package (http://www.genabel.org/) for R statistical software. Particularly for oil content, once regions were detected, the SNP of a particular region which had the highest association with the trait was selected. For every selected SNP, the favorable/unfavorable allele’s effect was studied and the percentage of total variation explained by the SNP was determined by the following model: OilContentij = µ + SNPi + Ɛij Where µ is mean oil content, SNPi is effect of ith genotype over the mean oil content. Finally, Ɛij is the residual error. Additionally, models combining 2, 3 or 4 SNPs were studied. RESULTS SNP genotyping Eighty-nine Advanta’s inbreed lines were genotyped with all the SNPs generated by the Consortium. The selection quality of those SNP was based on the rate of missing (no data) and heterozygous genotypes (see table below). After the analysis, 4076 SNPs were selected and used for the genotype-phenotype association analysis (Table 1). Table 1. Selection of SNP used for genotyping 89 inbred lines for AM. The criterion followed was to include those SNP with single locus, homozygous and with missing data lower than 10% of the individuals genotyped and allele frequency higher than 10%. SNPs mapped in one position 6984 SNPs not genotyped (missing data for all the lines) 163 (-2.33%) SNPs with 10% of missing data or more 391 (-5.6%) SNPs with less than 10% of missing data 6430 SNPs with 50% or more in heterozygosis 774 (-12.04%) SNPs with 25-50% in heterozygosis 612 (-9.52%) SNPs with 10-25% in heterozygosis 409 (-6.36%) SNPs with less than 10% in heterozygosis 4635 SNPs with one genotype in 95-100% of the lines 334 (-7.2%) SNPs with one genotype in 90-95% of the lines 225 (-4.85%) SNPs effectively used for the AM analysis 4076 Oil quality Statistical analyses for stearic and oleic acid content presented signals of association with markers located at different linkage groups (LG) which were always the same among the environments assayed. In particular, strong signals were found for LGs 1 and 14, in agreement with candidate genes and QTL previously reported (Fig. 1). On LG 1 it was identified a chromosome region with strong signal of association with stearic content and closely located to the mapped stearoyl-ACP desaturase locus (PérezVich et al., 2002); this enzyme is involved in the conversion of stearic acid into oleic acid. The AM approach also allowed for identification of a chromosome region on LG 14 with a strong signal for oleic content, in coincidence with the mapped oleoyl-PC desaturase (Hongtrakul et al., 1998) which catalyzes the desaturation of oleic acid into linoleic acid. Fig. 1. Association analysis between different SNPs located on LG 1 (a) and LG 14 (b) with stearic and oleic fatty acids content expressed as minus logarithm of p-value (-log(p-value)) corresponding to marker effect in individual environment analyses. SNP chromosome position is indicated in centiMorgans (cM). The -log(p-value) scale is indicated in the colored right column. Oil content Statistical models analysis for oil content identified signals of association with SNPs at different LGs regions and with different degrees of significance. Some of the regions identified are in agreement with previous QTL analysis. Among all the regions identified, four were chosen due to their high significance. For each region, the most significant SNP was selected. Each of these SNP explained between 15-21% of the total phenotypic variation. The analysis of the haplotype of the 74 lines demonstrated that lines with the unfavorable haplotype (unfavorable allele for the 4 selected SNPs) had an oil content average 10% lower than lines with the favorable haplotype (Fig. 2). Fig. 2. Mean seed oil content increases as favorable SNP alleles accumulate up to 4 homozygous alleles. DISCUSSION The use of AM approach allowed for the identification of two main QTL associated with stearic and oleic acid in agreement with previous data, validating the use of GWA techniques for QTL analysis associated with fatty acid oil composition in sunflower. Some minor QTL were also detected which should be validated. In addition, AM analysis identified QTL associated with oil content. As large-scale genotyping is becoming cost affordable, it is clear that the collection of high-quality phenotype data will be the main bottleneck of a given mapping study. It is highly recommended that experimental design begins selecting germplasm of appropriate levels of relatedness and to generate highquality phenotype data, as these factors will be major determinants of the power to identify QTL. Use of AM approach will contribute to a better definition of QTL or even the identification of candidate genes associated with fatty acid composition and oil content in sunflower, leading to a better understanding of these complex traits. The results of this study will be useful not only as a source of information about the genetics of the traits but also in marker-assisted breeding programs. REFERENCES Abdurakhmonov, I.Y. and Abdukarminov, A. 2008. Application of association mapping to understanding the genetic diversity of plant germplasm resources. Int. J. Plant Genomics 2008:574927. Fernández-Martínez, J.M., Pérez-Vich, B., Velasco, L. and Domínguez, J. 2007. Breeding for specialty oil types in sunflower. Helia, 30:75-84. Fusari, C.M., Lia, V.V., Hopp, H.E., Heinz, R.A. and Paniego, N.B. 2008. Identification of single nucleotide polymorphisms and analysis of linkage disequilibrium in sunflower elite inbred lines using the candidate gene approach. BMC Plant Biol. 8:7. Garcés, R., Martínez-Force, E., Salas J.J. and Venegas-Calerón M. 2009. Current advances in sunflower oil and its applications. Lipid Technol. 21:79-82. Hongtrakul, V., Slabaugh, M.B. and Knapp, S.J. 1998. A seed specific delta-12 oleate desaturase gene is duplicated, rearranged and weakly expressed in high oleic acid sunflower lines. Crop Sci. 38:1245-1249. Kolkman, J.M., Berry, S.T., Leon, A.J., Slabaugh, M.B., Tang, S., Gao, W., Shintani, D. K., Burke, J.M. and Knapp, S.J. 2007. Single nucleotide polymorphisms and linkage disequilibrium in sunflower. Genetics 177:457-468. Myles, S., Peiffer, J., Brown, P.J., Ersoz, E.S., Zhang, Z., Costich, D.E. and Buckler, E.S. 2009. Association mapping: critical considerations shift from genotyping to experimental design. Plant Cell 21:2194-2202. Pérez-Vich, B., Fernández-Martínez, J.M., Grondona, M., Knapp, S.J. and Berry, S.T. 2002. StearoylACP and oleoyl-PC desaturase genes cosegregate with quantitative trait loci underlying high stearic and high oleic acid mutant phenotypes in sunflower. Theor. Appl. Genet. 104:338-349. Pritchard, J.K., Stephens, M. and Donnelly, P. 2000. Inference of population structure using multilocus genotype data. Genetics 155:945-959. Yu, J., Pressoir, G., Briggs, W.H., Vroh-Bi, I., Yamasaki, M., Doebley, J.F., McMullen, M.D., Gaut, B.S., Nielsen, D.M., Holland, J.B., Kresovich, S. and Buckler, E. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38:203-208.