Genome wide association study (GWAS) for morphological and yield-related traits in oil palm hybrids (Elaeis oleifera x Elaeis guineensis)

Abstract Background The genus Elaeis has two species of economic importance for the oil palm agroindustry: Elaeis oleifera (O), native to the Americas, and Elaeis guineensis (G), native to Africa. The work presented herein provides, to our knowledge, the rst association mapping study in an interspecic OxG hybrid population of oil palm which presents tolerance to pests and diseases, high oil quality, and acceptable fruit bunch production. Results Using genotyping-by-sequencing (GBS), we identied a total of 3,776 single nucleotide polymorphisms (SNPs) that were used to perform a genome-wide association analysis (GWAS) in 378 OxG hybrids for 10 agronomic traits. Twelve genomic regions were located near candidate genes implicated in multiple functional categories, such as tissue growth, cellular tracking, and physiological processes. Conclusions We provide new insights on candidate genes that mapped on genomic regions involved in plant architecture and yield; however, these potential candidate genes need to be conrmed for future targeted functional analysis. The associated markers may be valuable resources for the development of marker-assisted selection in oil palm breeding. located at the STYK gene, which is involved in the control of stomatal movement in response to CO 2 [49]. Recent studies also showed the role of STYK in stem diameter by increasing the number of xylary bers in species such as Bambusa balcooa [50]. Moreover, Pootakham et al. [20] found a region that controls the phenotypic variance of HT near this region on chromosome 15. The other two regions were observed on chromosomes 5 and 10 for yield-related traits BN and BW. The gene p5.00_sc00003_p0367 potentially associated with BN, codes for a cation/H(+) antiporter. Antiporter proteins function as regulators of monovalent ions, pH homeostasis, and developmental processes in plants [51]. On chromosome 10, the gene p5.00_sc00004_p0097, potentially associated with BW, encodes a zinc nger protein (ZFP). The ZFP is a large protein family, involved in plant development, regulation of plant height, root development, ower development, seed germination, secondary wall thickening, anther development, and fruit ripening [52]. Studies conducted by Wu et al. [53] demonstrated that silencing a gene related to ZFP hampered fruit development in Nicotiana benthamiana [53]. This ZFP gene might play an important role on yield-related traits in oil palm, as shown in other plants, where overexpression of zinc nger proteins are related with higher yields [54].

HT was correlated with morphological-related traits such as FA, LA, LDW and trunk diameter (TD) (r ³ 0.6), while yield and bunch number (BN) showed a high correlation between them (r = 0.91), but a lower correlation (r = 0.57) with bunch weight (BW).
To evaluate the phenotypic similarity among the 378 OxG hybrids, a hierarchical cluster analysis was performed (Figure 2; supplementary Table S1). We found phenotypic differences between the two clusters in agreement to the variability of the morphological-related traits. Overall, the Group II showed the highest mean values for all the morphological-related traits (supplementary Figure S1). For example, OxG from Group II were signi cantly taller (HT = 269 ± 21 cm) compared to OxG from Group I (HT = 238 ± 28 cm) (p ≤ 0.0001). However, yield-related traits did not show any signi cant differences between groups.

SNP calling
A total of 1,058,182,456 raw Illumina sequencing reads from seven Illumina HiSeq lanes were generated for 471 accessions (62 E. oleifera (O), 31 E. guineensis (G), and 378 F 1 OxG). The genotyping of the collection detected 131,825 SNPs covering the 16 oil palm chromosomes. After ltering, 3,776 SNPs with an average of 236 SNPs per chromosome were retained (supplementary Table S2).

Cluster and association analyses
The neighbor-joining (NJ) of the entire population (471 accessions) ( Figure 3A) showed two main groups containing E. oleifera and E. guineensis, as well as three groups within the hybrid population; one group more similar to E. guineensis, another one more similar to E. oleifera, and a largest group that showed intermediate similarity between the two parental species. The PCA analysis of the OxG population (378 hybrids) showed that the rst three components comprised approximately 15.47% of the total variation and categorized the population in three groups supporting the results observed in the NJ tree in accordance with the hybrid nature of our population and its genetic similarity with each parental species ( Figure 3B).
We performed the association analysis with the 378 OxG hybrids and the 3,776 SNPs for seven morphological traits and three yield-related traits (Table 1).
Twelve SNPs were the most signi cantly associated with the traits measured based on p-values across different genomic regions of the oil palm genome before the false discovery rate (FDR) correction (Table 2). Common SNPs for rachis length (RL) and lea et per leaf (LXL) were observed, as well as for HT and LA and between yield and BN, in accordance with the results from the phenotypic correlations between traits. The Q-Q plots ( Figure 4) signi cantly supported the evidence of SNP associations with the traits (p ≤ 0.005) and suggested that population strati cation in the GWAS model was adequately controlled.
The availability of the oil palm genome sequence [29] enabled to associate genomic regions with speci c QTLs into the physical map and to explore potential candidate genes and their possible function. On chromosomes 3, 13, and 15, we identi ed 10 signi cant SNPs located on genomic regions harboring genes associated with the morphological-related traits before FDR correction ( Figure 4 - Table 2). For yield-related traits, we observed two SNPs into two candidate genes on chromosomes 5 and 10, which were non-signi cant after FDR correction ( Figure 4, Table 2). We evaluated if the SNPs found in association with traits were in chromosomes with a larger number of markers, to assess if our results could have been obtained by biases in genotyping. The associated SNPs found in this study (chromosomes 3, 5, 10, 13 and 15) were not located in the chromosomes with the higher number of SNPs identi ed by the GBS approach (supplementary Table S2).
The pair-wise linkage disequilibrium (LD) between the SNPs for the chromosomes that presented genomic regions associated to the evaluated traits is illustrated in supplementary Figure S2. The LD blocks were small for all chromosomes presented, which was expected, considering the out-crossing nature of the species and the hybrid population. The R 2 at or near signi cant SNPs showed that only the genomic region associated on chromosome 15 was in a relatively high LD group (R 2 = 0.58).

Discussion
Improving oil quality and increasing yield per hectare in oil palm is a major concern in the oil processing industry. Agrosavia's palm breeding program has focused on developing interspeci c hybrids of OxG which present heterosis in traits such as resistance to diseases, fruit number, fruit weight, leaf length, and trunk diameter [30]. To our knowledge, this study represents the rst GWAS analysis of an OxG hybrid population.

Phenotypic data
Correlation analysis results among yield-related traits indicated that BN could be a potential and better selection criterion for production than BW in the hybrid population. It is known that the leaf emission rate determines the BN in a palm, which is negatively correlated with the number of leaves [31]. In our study, no signi cant correlations between yield and leaf-related traits (FA, LA, LDW, LXL, RL) were found, however, a previous study in E. oleifera and OxG hybrids found that BN can be higher than the number of leaves just when oil palms produced multiple in orescences [32]. Increases in BN and BW are also expected to correlate with increased mesocarp and kernel oil yields, as shown in other oil palm germplasm studies [33]. Future studies directed to improve oil yields for the OxG hybrid population should be conducted considering their importance in oil palm breeding. GBS technique has been used for identifying candidate genes related to oil bunch [37], bunch weight [20, 37], stem height [21] and oil quality traits studies [38] with the enzymes ApeKI and PstI-MspI. Here, we used the enzyme PstI for the library construction, allowing us to identify relatively high number of SNPs with probably novel genetics variants, which according to Chung et al. [39] is one of the most important advantages of GBS.
In the present study, association mapping resulted in the identi cation of 12 regions related to 10 morphological and yield-related traits (Table 2). However, only ve regions associated with LDW, TD, RL, and LXL remained signi cant (p ≤ 0.05) after the FDR correction. The small LD blocks in the heatmap analysis could suggest that the causal regions are nearby to the most signi cant SNPs; however, a more comprehensive study must be conducted to determine the actual SNPs and or genes controlling the trait of interest. Therefore, we describe the ve most signi cant regions and the genes located within those regions which might represent potential candidate genes involved in the expression of the phenotypic traits evaluated in this study. For morphological traits, a signi cant association was found for LDW on chromosome 3, explaining 10% of the phenotypic variation. The most signi cant SNP in this region was located in a mechanosensitive (MS) ion channel protein 10-like (MSL10) gene. In plants, MS ion channels have been proposed to play a wide array of roles, from the perception of touch and gravity to the osmotic homeostasis of intracellular organelles [40]. Besides, mechanoperception genes are essential for normal cell and tissue growth and development as well as for the proper response to an array of biotic and abiotic stresses [41]. A second signi cant region associated with TD on chromosome 15 was identi ed. Within this region a gene involved in nucleic acid binding and has a C2H2-type Zinc nger domain is located. The C2H2-ZF gene family has been proposed to be involved in the formation of wood, shoot and cambium development in species such as Poplar, as well as playing a role in stress and phytohormone response [42].
For RL and LXL traits, QTLs have been reported on chromosomes 4, 2, 10, and 16 [33]. In our study, three SNPs were associated with three different candidate genes for RL on chromosome 13. The SNP S13_20856724 is the closest to the AGC3 gene and encodes different G proteins. G proteins had been reported to be involved in a wide range of developmental and physiological processes, having a high potential for yield improvement in crops such as rice [43]. The other signi cant association was found with the SNP S13_23674227, which is located in an extracellular ribonuclease gene (RNase gene). RNase genes have been studied for years in plants, playing an important role in plant defense mechanism [44] or plant development due to their ability to modify RNA levels, and thereby in uence protein synthesis [45]. The SNP S13_25522088 was also signi cantly associated with RL and LXL, but further studies are necessary to determine their role, if any, in regulating these traits.
Seven SNPs were not signi cant after FDR correction, possibly due to reduced sample size. The QTL and association studies are limited by the relatively small mapping population sizes resulting in low statistical power, thus rendering small or even medium effect QTL statistically non-signi cant and di cult to detect.
Such statistically underpowered populations may also suffer from severe in ation of effect size estimates (so-called Beavis effect) [46]. Increasing the population size and marker density is required to enable unbiased beavis effect estimation and to achieve higher statistical power [46][47][48], yet for perennial populations (long generation time) and limited offspring numbers this would require a considerable investment.
Although these seven regions were non-signi cant after FDR correction, three of them were located near or alongside genes whose function could be related with the expression of the traits under study. The rst of these regions for HT harbors the SNPs S15_22553489 and S15_22553493 located at the STYK gene, which is involved in the control of stomatal movement in response to CO 2 [49]. Recent studies also showed the role of STYK in stem diameter by increasing In oil palm, harvesting of fruit bunches after a certain age is an arduous labor due to their tallness. For this reason, genotypes with less HT and TD are preferred among oil palm farmers. Likewise, larger foliar (RL and LDW) is related to a greater photosynthetic production which could be involved in higher productivity. But most importantly, increasing the number and weight of fruits means higher productivity per palm and therefore higher incomes for farmers.
For this reason, leveraging QTLs or genes related to these traits, as the ones we identify in this study, could contribute to develop plant breeding strategies such as marker-assisted selection, that helps to select promising accessions in earlier stages (greenhouse conditions), and therefore, reduce the breeding cycle. Further work needs to focus on the biological functions of the set of potential candidate genes found in our research, yet the correlations identi ed in association studies cannot be dubbed as causations.

Conclusions
This is the rst study reporting ve signi cant genomic regions associated with morphological and yield-related traits based upon GWAS on an interspeci c OxG oil palm population. Genes whose functional annotation are potentially related to the corresponding traits are located within those regions and therefore might represent potential candidate genes for these QTLs. Our results will provide the groundwork for the development of marker-assisted breeding in oil palm and will serve as a strong base for future functional studies to dissect high yield production.

Plant material
A total of 471 diverse oil palm accessions (62 E. oleifera (O), 31 E. guineensis (G), and 378 F 1 OxG) were collected from the research centers El Mira and La Libertad of the Colombian Corporation for Agricultural Research (Agrosavia) [55]. The OxG hybrid populations were generated through eight different crossings (eight different E. oleifera accessions as female progenitors were crossed with one E. guineensis accession as male progenitor); however, the parents of these crossings are already deceased. Details of the crosses and origin of accessions are given in supplementary Table S1. The plant material belongs to the National Germplasm Collection of Colombia, maintained by Agrosavia. All samples were collected following national regulations.

Phenotyping
Oil palm seedlings of the hybrid population were planted in a quincunx or triangular system with 10 m between the plants at the Agrosovia's research center El Mira, Tumaco, Colombia. Plants were randomly distributed using a randomized complete block design with four blocks.
Phenotypic data was collected for the subset of 378 F 1 OxG hybrids. A total of 10 traits distributed into two categories, morphological and yield-related, were evaluated ( Table 1)

Statistical analysis of phenotypic data
The correlations among traits were calculated using the Pearson's correlation coe cient (r) at p £ 0.05. To assess the relationship among the studied traits, a principal component analysis (PCA) was calculated. Finally, a hierarchical cluster analysis using the Ward's method was carried out to analyze the relationship among accessions. Differences between clusters by trait were determined using a t-test at p £ 0.0001. All statistical analysis were performed in R v3.42 software [58].

Genotyping
Genomic DNA of the 471 accessions was extracted from leaf tissues using the DNeasy Plant Mini Kit (QIAGEN, Germany). The DNA quality was estimated using the HindIII enzyme and visualized by electrophoresis on 2% agarose gels. The GBS libraries were constructed with the methylation-sensitive restriction enzyme PstI (CTGCAG). Sequencing was performed with 100-bp single-end reads using the Illumina HiSeq 2000 platform (Illumina Inc., United States) at the Institute of Genomic Diversity (Cornell University, Ithaca, NY, United States).

SNP discovery and data processing
Illumina reads were demultiplexed using the standard pipeline from the Tassel v4.5.9 software [59]. Then, reads were mapped to the oil palm reference genome of E. guineensis [60] using Bowtie2 [61] with the very-sensitive option. SNP calling was performed using the following parameters: minor allele frequency (MAF) < 5%, minimum locus coverage (mnLCov) of 0.9, minimum site coverage (mnScov) of 0.7 and minimum taxon coverage (mnTCov) of 0.5.

Cluster and marker-trait association analyses
The clustering analysis for all 471 oil palm accessions was performed by a neighbor-joining algorithm using Tassel  To avoid any possible bias caused by population structure, we included the rst ve principal components of the PCA and a relatedness (kinship) matrix from GAPIT in the mixed linear model. Quantile-quantile (Q-Q) plots from the observed −log 10 p-values and the expected −log 10 p-values were generated to study the appropriateness of the GWAS model. A false discovery rate (FDR) [66] was used to correct for spurious associations.
The heatmap of linkage disequilibrium (LD) was generated with a custom script by plotting pairwise R 2 values against the physical distance (base pairs) between markers on the same chromosome.

Declarations
Ethics approval and consent to participate Not applicable.

Consent to publish
Not applicable.

Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare they have no competing interests.

Funding
Publication of this article has been funded by TV-17 supported by the Ministry of Agriculture and Rural Development of the Republic of Colombia. The funding entities had no role in study design, data collection and analysis, interpretation, decision to publish, or preparation of the manuscript.
Authors' contributions LSB, FEER and SBP conceptualized and conceived the project and its components. SBP, LPD, GAGM, FEER, JAOG, and LPM collected the accessions and the trait data. JAOG and GAGM carried out the genotypic and phenotypic analysis with the supervision of OEC. JAOG and GAGM wrote the manuscript and LSB, OEC and FEER corrected and edited it. All authors reviewed and contributed to draft the manuscript as well as read and approved the nal manuscript. * SNP position relative to the closest candidate gene: SNPs upstream and downstream of candidate genes are specified with "-" and "+", respectively. 0 indicates that SNPs are located within candidate gene.

Additional Files
Additional File Table S1: List of oil palm OxG and parental lines accessions used in the present study.
Additional File Table S2: SNPs identi ed per chromosome using E. guineensis as reference genome.
Additional File Figure S1: Box plots of the two cluster groups for all morphological and yield-related traits. * = signi cant at p £ 0.0001, ns = non-signi cant.
Additional File Figure

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.