Association mapping approach to identify genomic regions associated with oil content in sunflower based on SNPs. Marcos Kaspar, María Valeria Paccapelo, Teresa Galella, Martín Grondona, Roberto Reid, Alberto León, and Andrés Zambelli. Balcarce Biotechnology Center, Advanta Semillas SAIC, Ruta 226 Km 60.5 (7620) Balcarce, Buenos Aires, Argentina. E-mail: andres.zambelli@advantasemillas.com.ar Seed oil yield is together with grain yield one of the main features covered in sunflower breeding programs. Seed oil percentage has a medium to high heritability (ranging from 0.5 to 0.9 in previous studies) and predominantly additive gene action that facilitates selection in early generations. For taking advantage of an agronomic character it is very important to know the location of the QTL associated to it. Linkage analysis in biparental families typically localizes QTLs to 10 to 20 cM intervals because of the limited number of recombination events that occur during the construction of mapping populations and the cost for propagating and evaluating a large number of lines. Association Mapping, also known as linkage disequilibrium mapping, is a relatively new and promising genetic method for genetic dissection of complex traits. This approach allows to obtain higher mapping resolution through exploitation of historical recombination events occurred when analyzed unrelated inbred lines population. Association mapping takes advantage of ancestral recombinations and natural genetic diversity within a population to dissect quantitative traits and is built on the basis of linkage disequilibrium concept. Due to higher genome density, lower mutation rate, and better amenability to high-throughput detection systems, SNP (single nucleotide polymorphism) are rapidly becoming the marker of choice for dissecting complex genetic traits by Association Mapping. The objective of the present study was to perform a whole genome association mapping analysis for oil content in a set of sunflower inbred lines to identify the genomic regions associated to this trait. A set of 74 conventional lines from Advanta's germplasm was selected aiming to represent most of the total genetic variation of the Advanta Breeding Program. The lines were evaluated for oil content in one environment by Soxhlet method and genotyped using 6821 SNPs covering the 17 linkage groups (LGs) of the sunflower genome. Associations between oil content and every SNP markers were analyzed using an additive linear regression model estimated by the mlreg function of the GenABEL package for R statistical software. With the aim of controlling for spurious associations, the model takes into account the population structure investigated using STRUCTURE software. Statistical models analysis for oil content identified signals of association with SNPs at different LGs regions and with different degrees of significance. Some of the regions identified are in agreement with previous QTL analysis. Among all the regions identified, four were chosen for their high significance. For each region, the most significant SNP was selected, explaining each one of them between 15-21% of the total phenotypic variation. Analyzing the haplotype of the 74 lines it was found that lines with the unfavorable haplotype (unfavorable allele for the 4 selected SNPs) had an oil content average 10 % lower than lines with the favorable haplotype. Using an Association Mapping approach, four regions associated with oil content were identified. On AM the accuracy of the phenotypic data is crucial to obtain clear results so, future work will be focused on increasing the number of environments in which the oil content will be measured. Several studies had identified QTLs associated with oil content. This approach will contribute to a better definition of the QTL or even the identification of candidate genes associated with oil content in sunflower, leading to a better understanding of this complex trait. The results of this study will be useful not only as a source of information about the genetics of the trait but also in marker-assisted breeding programs. Keywords: sunflower, oil content, association mapping, QTL INTRODUCTION Seed oil yield is together with grain yield one of the two main features covered in sunflower breeding programs. Seed oil percentage is a complex trait having a medium to high heritability (ranging from 0.5 to 0.9) and predominantly additive gene action that facilitates selection in early generations. Most traits of agricultural or evolutionary importance are controlled by multiple quantitative trait loci (QTL). Genetic mapping and molecular characterization of these functional loci facilitates genome-aided breeding for crop improvements. Identification of QTL requires mapping them in the genome using molecular markers. Association mapping (AM) also known as linkage disequilibrium (LD) mapping, has emerged as a tool to resolve complex trait variation down to the sequence level by exploiting historical and evolutionary recombination events at the population level (Zhu et al., 2008). As a new alternative to traditional linkage analysis, association mapping offers three advantages, (i) increased mapping resolution, (ii) reduced research time, and (iii) greater allele number. In contrast to linkage based studies, linkage disequilibrium based genetic association studies offer a potentially powerful approach for mapping causal genes with modest effects. While linkage analysis is based upon detection of non-random association between a genotype and a phenotype in well characterized pedigrees, association mapping focuses on associations within populations of unrelated individuals (Ersoz et al., 2007). Two AM methodologies are in use: candidate gene and whole genome scan. The first one assumes good understanding of the biochemistry and genetics of the trait while genome scan involves testing for association with the trait of most segments of the genome. The hypothesis under consideration is: ‘one (or more) of the genetic loci being considered is either causal for the trait or in linkage disequilibrium with the causal loci’. Of course, the first step in this process is the discovery of a large number of genetic markers, typically single nucleotide polymorphisms (SNPs), as a reference resource. The application of AM takes advantage of the use of the measures of pairwise LD statistics to infer the predictive value of a marker locus for its association with the phenotype of interest. The high-LD chromosomal region around a marker locus defines the predictive range of a certain genetic marker. If LD within this genomic range is complete, any polymorphism within this range will have the same predictive value for the association with the phenotype. Hence, as a result of a significant marker-phenotype association, it can be concluded that the causative polymorphism resides within this high LD region around the marker locus (Ersoz et al., 2007). AM exploits recombination events occurred in the evolutionary history supplying higher mapping resolution (Kolkman et al., 2007). However, the uncontrolled population design can result in spurious signals of association in downstream analysis. False positive association between markers and traits can arise due to population structure caused by selection, plant improvement, etc. Thus, taking this into consideration the population structure is a critical prerequisite in association analyses. Moreover, if pedigree data from inbred lines could be reconstructed and used, control of type I and II error rates improves (Yu et al., 2006). Other source that increases the false positive rate is the occurrence of alleles in very low frequency. There are five main stages for association studies: (1) Selection of population samples, (2) Determination of the level and influence of population structure on the sample, (3) Phenotyping the population sample for traits of interest (4) Genotyping the population, for either candidate genes/regions or as a genome-wide scan and (5) Testing the genotypes and phenotypes for their associations. The objective of the present study was to perform a whole genome association mapping analysis for oil content in a set of sunflower inbred lines to identify the genomic regions associated to this trait. MATERIAL AND METHODS A subset of 74 conventional lines from Advanta's germplasm was selected aiming to represent most of the total genetic variation of the Advanta Breeding Program. The selection was based on allele sharing distances obtained with SSR-genotype and pedigree information. Twenty plants of each line were sown in 2008 in Balcarce location, Argentina. After harvesting, 50 g of seeds per line were used to measure the oil content (soxhlet method). A set of SNPs were discovered and validated by a multi-company Consortium led by The University of Georgia (Athens, GA, US). The 74 conventional inbred lines from Advanta's germplasm were genotyped by the Consortium with 6984 SNPs using a high-throughput genotyping system (Illumina, Golden Gate). Associations between oil content and every SNP markers were analyzed using an additive linear regression model estimated by the mlreg function of the GenABEL package for R statistical software (Aulchenko et al., 2007). With the aim of controlling for spurious associations, the model takes into account the population structure investigated using STRUCTURE software (Pritchard et al., 2000). Such model has the following expression: OilContentijk = µ + Structurei + SNPj + Ɛijk where µ is a general oil content mean, Structurei represents the effect of the ith subgroup present in the population, SNPj is effect of the jth genotype over the mean oil content. Finally, Ɛijk is the residual error. Once regions were detected, the SNP of a particular region which had the highest association with the trait was selected. For every selected SNP, the favorable/unfavorable allele’s effect was studied and the percentage of total variation explained by the SNP was determined by the following model: OilContentij = µ + SNPi + Ɛij where µ is mean oil content, SNPi is effect of ith genotype over the mean oil content. Finally, Ɛij is the residual error. Additionally, models combining 2, 3 or 4 SNPs were studied. RESULTS Oil content evaluation Percentage of oilseed content was measured in a set of 74 inbred lines and ranged from 20.8 % to 53.1 % with a mean of 39.9 % and a median of 40.8 %. Value distribution is shown in Fig. 1. Fig. 1. Distribution of percentage of oilseed content among 74 inbred lines analyzed. SNP genotyping Seventy-four Advanta’s inbreed lines were genotyped with all the SNPs generated by the Consortium. The selection quality of those SNP was based on the rate of missing (no data) and heterozygous genotypes. After the analysis, 4076 SNPs were selected and used for the genotype-phenotype association analysis. Population Structure Population structure was taken from previous analysis that consisted on the application of STRUCTURE software to a greater subset of lines genotyped with SSRs. The analysis found that there were four sub-populations among the Advanta inbred lines. Proportion of membership of each line to every sub-population was used in order to classify lines. So, a factor of subpopulation was considered as a covariate in the association mapping model. Association Mapping Analysis Statistical models for analysis of oil content identified signals of association with SNPs located on different linkage groups (LGs) regions and with different degrees of significance (Fig. 2). LG map was constructed on the base of SSR consensus map previously published (Tang et al., 2002). Among all the regions identified, four were chosen due to their higher significance (LGs 10, 11, 13 and 14). For each region, the most significant SNP was selected. All possible models were adjusted considering 1, 2, 3 and 4 of the selected SNPs. The coefficient of determination (R2) of each possible model is represented in Figure 3. The model with the highest R2 is the one with the four SNPs previously selected. Fig. 2. Results of the AM analysis of Oil Content across the genome. Fig. 3. Coefficient of determination from models with combinations of the selected SNPs. Each of these SNP explained between 15-21 % of the total phenotypic variation. The analysis of the haplotype of the 74 lines showed that lines with the favorable haplotype (favorable allele for the 4 selected SNPs) had an oil content average 10 point higher than lines with the unfavorable haplotype (Fig. 4). Fig. 4. Increase of mean seed oil content as favorable SNP alleles accumulate up to 4 homozygous alleles. DISCUSSION Advances in high-throughput genotyping (particularly SNPs) and sequencing technologies have markedly reduced the cost per data point of molecular markers for which researchers are moving toward genome-wide association analyses of complex traits (Zhu et al., 2008). Even seed oil percentage has a medium to high heritability and predominantly additive gene action that facilitates selection in early generations, the identification of genomics regions associated with the trait is important not only for breeding purposes but also for the understanding of the genetics of this complex trait. Using an AM approach we identified several regions from the sunflower genome associated with seed oil content. Of these, four were notable due to their higher significance. A model with SNPs located in the selected regions (one SNP per region) explains 55.8 % of the phenotypic variation. Future work will be focused on increasing the number of environments in which the oil content will be measured in order to detect possible genotype by environment interaction. REFERENCES Aulchenko, Y.S., Ripke, S., Isaacs, A. and van Duijn, C.M. 2007. GenABEL: an R package for genome-wide association analysis. Bioinformatics 23:1294-6. Ersoz, E.S., Yu, J., Buckler, E.S. 2007. Applications of linkage disequilibrium and association mapping in crop plants. In: R.K. Varshney and R. Tuberosa (eds.), Genomics Assisted Crop Improvement Vol. 1: Genomics Approaches and Platforms. pp. 97-119. Kolkman, J.M., Berry, S.T., Leon, A.J., Slabaugh, M.B., Tang, S., Gao, W., Shintani, D.K., Burke, J.M. and Knapp, S.J. 2007. Single nucleotide polymorphisms and linkage disequilibrium in sunflower. Genetics 177:457-468. Pritchard, J.K., Stephens, M. and Donnelly, P. 2000. Inference of population structure using multilocus genotype data. Genetics 155:945-959. Tang, S., Yu, J.K., Slabaugh, M.B., Shintani, D.K. and Knapp, S.J. 2002. Simple sequence repeat map of the sunflower genome. Theor. Appl. Genet. 105:1124-1136. Yu, J., Pressoir, G., Briggs, W.H., Vroh-Bi, I., Yamasaki, M., Doebley, J.F., McMullen, M.D., Gaut, B.S., Nielsen, D.M., Holland, J.B., Kresovich, S. and Buckler, E. 2006. A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38:203-208. Zhu, C., Gore, M., Buckler, E.S. and Yu, J. 2008. Status and prospects of association mapping in plants. Plant Genome 1:5-20.