Identification of candidate genes responsible for morphological and yield-related traits via genome-wide association analysis (GWAS) in oil palm (Elaeis oleifera x Elaeis guineensis)

Background The genus Elaeis has two species of economic importance for the oil palm agroindustry: Elaeis oleifera (O), native to the Americas, and Elaeis guineensis (G), native to Africa. The breeding program in Colombia relies on interspecific OxG crossing populations with tolerance to pests and diseases, high oil quality, and acceptable fruit bunch production. The identification of loci associated to morphological and yield-related traits and the dissection of their genetic architecture will provide essential insights for oil palm breeding strategies. Results The genotypes of 471 oil palms, including 62 E. oleifera (O), 31 E. guineensis (G) and 378 OxG samples were analyzed in this study. A total of 3,776 single nucleotide polymorphisms (SNP) were detected across the 16 oil palm chromosomes using the genotyping-by-sequencing (GBS) technique. The genetic variation and population structure analyses grouped the samples into two clades according to the parental relatedness. A genome wide association analysis (GWAS) was conducting using the OxG hybrid population, resulting in 12 SNPs significantly associated with ten different morphological and yield-related traits. Conclusions The work presented herein provides to our knowledge the first association mapping study in an interspecific OxG hybrid population of oil palm. We provide new insights on candidate genes involved in tissue development and plant architecture associated to traits such as: rachis length, trunk diameter, bunch number, and bunch weight. The genes identified in our analysis are putative candidates for future targeted functional analysis. They are valuable resources for the development of marker-assisted selection in oil palm breeding. The present study is the first one to report significant SNPs associated with morphological and yield-related traits based on GWAS on an interspecific OxG population. Nine of these SNPs are located within chromosomes reported in previous mapping studies, however, the set of genes presented in our analysis could be of value to locate with more precision the intervals of the reported QTLs on chromosomes 13 and 15. Also, candidate genes discovered on chromosomes 3, 5 and 10 have not been reported for the studied traits. The findings from the present study will provide the groundwork for the development of marker assisted breeding in oil palm and will serve as a strong base for future functional studies to dissect high yield production in oil palm.


Background
The oil palm is an important crop with a higher quality oil and yield potential compared to other vegetable oil crops [1]. Colombia is the fourth-largest producer worldwide with 3.8 tons of palm oil per hectare, which positions Colombia above the world's average yield [2]. Within the Arecaceae family, the oil palm species (Elaeis guineensis), native to West Africa, is the main source of most vegetable oil [3]. However, another oil palm species native to the tropical Central and South America, known as Elaeis oleifera, is recognized for its high yield production [3]. The palm is a perennial monocot with a lifespan of approximately 25 years [4], which results in a slow breeding progress. The Colombian Corporation of Agricultural Research (Agrosavia) breeding program has focused on developing OxG interspecific crosses (E. oleifera x E. guineensis). OxG hybrids are characterized by a slow trunk growth [5], as well as tolerance to bud rot [6][7][8][9] and red ring diseases [10] in comparison to their parents. Additionally, OxG hybrids inherit the parthenocarpic fruit development of E. oleifera species, which allows the production of seedless fruits [11].
Over the last 20 years, a genetic map of the oil palm have been constructed [12].
Saturated genetic linkage maps are important for the identification of genomic regions associated with major genes and quantitative trait loci (QTLs) controlling agronomic traits.
The first marker-based genetic maps for oil palm were generated using restriction fragment length polymorphisms (RFLPs) and amplified fragment length polymorphisms (AFLPs) [13,14]. Dense genetic maps were subsequently constructed using simple sequence repeats (SSRs) and single nucleotide polymorphism (SNP) markers, which have also been used for QTL identification. Thus, Jeennor and Volkaert [15] identified a QTL associated with bunch weight using a mapping population of 69 samples and a genetic map constructed with 89 SSRs and 101 SNPs. Billotte et al. [16], used a multi-parent linkage map generated with 251 SSRs and reported QTLs associated to bunch traits. Similar approaches have enabled the identification of 164 QTLs associated with 21 oil yield component using SSR, AFLP, and RFLP markers [17].
In recent years, advances in next-generation sequencing technology have lowered the cost of DNA sequencing to the point that we can obtain millions of SNPs compared with other technologies [18,19]. The technique genotyping by sequencing (GBS) is a rapid, low-cost, and robust approach to screen breeding populations using hundreds or thousands of SNPs [20]. Pootakham et al. [21] constructed an oil palm map using an F 2 population and 1,085 SNPs derived from GBS and were able to identify QTLs for height and fruit bunch weight. Similarly, a genome-wide association analysis (GWAS), using a larger number of SNPs (4,031) derived from GBS across a diverse panel of E. guineensis, allowed to identify novel QTLs associated with the increasing in the trunk height [22].
GWAS have been proposed as a robust approach over QTL linkage mapping [23]. The use of a wide range of genetic background in GWAS analyses increases the possibility to detect QTL regions associated with traits of interest, compared to the limited genetic variation of a bi-parental mapping population [24]. However, GWAS limitations such as the effect of population structure can lead to spurious associations between a candidate marker and a phenotype [25]. To solve this, the mixed linear model incorporates the structure data (Q) and the relative kinship effects (K), resulting in the reduction of false positive associations [26].
Given the food, industrial, and medical purposes, palm oil has experienced rapid growth in economic importance and nowadays is considered the second most traded vegetable oil crop in the world after soybean [27,28]. The increasing demand of this crop is caused by a shift away from trans-fats to healthier alternatives [29], and because its residues can be processed to produce biofuel [28]. For these reasons, the identification of specific genes involved in morphological traits such as height, foliar area, and its relationship with productivity, is becoming more critical for this crop.
Although previous studies have identified QTLs controlling morphological and yield-related traits in oil palm, these QTLs were detected using F 1 intraspecific populations. Our study is the first report in which molecular markers were mapped through association analysis in an interspecific OxG hybrid populations. Our study aims to: (i) genotype an OxG oil palm mapping population; (ii) analyze genetic diversity, population structure and linkage disequilibrium; (iii) perform GWAS to identify loci or candidate genes involved in yield and other morphological traits for future use in breeding programs.

Plant material
The plant materials used for current GWAS study came from the National Germplasm  [54]. Details of the plant materials can be found in Table S1.

Genotyping and SNP calling
Genomic DNA was extracted from young leaf tissue using the DNeasy Plant Mini Kit (QIAGEN, Germany). All 471 samples were genotyped using the GBS protocol following the procedure cited by Elshire et al. [20]. GBS library preparation and sequencing were performed at the Institute of Genomic Diversity (Cornell University, Ithaca, NY, United States). Briefly, samples were digested with the methylation-sensitive restriction enzyme PstI, which has a six base pair recognition site (CTGCAG). Sequencing was performed with 100-bp single-end reads using the Illumina HiSeq 2000 platform (Illumina Inc., United States).
The raw data was demultiplexed using the standard pipeline from the Tassel v4.5.9 software [55]. Then, reads were mapped to the oil palm reference genome of E. guineensis [56] using Bowtie2 [57] with the very-sensitive option. SNP calling was performed using the following parameters: minor allele frequency (MAF) < 5%, minimum locus coverage (mnLCov) of 0.9, minimum site coverage (mnScov) of 0.7 and minimum taxon coverage (mnTCov) of 0.5. Finally, SNPs were filtered to retain 5% of missing data and biallelic SNPs with VCFtools v0.1.13 software [58]. SNP data was integrated with phenotypic data to perform the GWAS analysis. The physical positions of SNP markers used in the association analysis were obtained from the Genomsawit Website of the International Malaysian Oil Palm Genome Programme (http://gbrowse.mpob.gov.my/fgb2/gbrowse/Eg5_1/).

Genetic diversity and population structure
High quality SNP markers were used to assess oil palm genetic diversity in 471 samples with breeding coefficient (F) values and germplasm relatedness through a Neighbor-Joining tree according to the Nei's genetic distance matrix. Genetic diversity (π), Tajima's D, and population pairwise F-statistics (F ST ) were calculated using VCFtools v0.1.13 [58].
Similarly, the population structure of the 471 samples was estimated using the Admixture v1.3.0 software [59] in both unsupervised and supervised modes.

Phenotypic analysis
The OxG hybrid population (378 samples) has been planted in field and randomly distributed using a randomized complete block design with four blocks. The experimental field is located in the Research Center El Mira. The field was planted in a quincunx or triangular system with 10 meters between the plants; this planting system allowed a density of 115 oil palm trees per hectare. All plants were grown under the same standard agronomic practices.
Phenotypic data of seven morphological measurements and three yield-related traits were used in this study (Table 1). Each trait was measured according to the methodology proposed by Corley et al. [60] and Breure [61]. To assess the relationship among the studied traits, a principal component analysis (PCA) and a Pearson's correlation were calculated. A hierarchical cluster analysis using Ward's method was carried out to analyze the genetic relationship among samples by the use of all phenotypic variables. All analyses were performed using the R statistical package [62].

Marker-trait association analysis
Marker-trait association analysis was performed on morphological and yield-related traits using the OxG hybrid population. A unified mixed linear model with a kinship matrix and PCA results were used in the R package GAPIT (Genome Association and Prediction Integrated Tool) [63]. To remove any possible bias caused by population structure, we included the first five principal components calculated using the R package SNPrelate  the quantile-quantile (QQ) plot supported the appropriateness of the GWAS model (Fig. 3a).
The heatmap of LD was investigated with a custom script by plotting pairwise R 2 values against the physical distance (base pairs) between markers on the same chromosome.

Candidate genes identification
Genes annotations under the candidate gene regions were determined using published genome information of E. guineensis [56]. To assign putative biological functions of significant SNP markers associated with the traits, the flanking sequences of SNPs were queried against databases, such as: HMMER (https://www.ebi.ac.uk/Tools/hmmer/), NCBI

Results
Improving oil quality and increasing yield per hectare in oil palm is a major concern in the oil processing industry. Agrosavia's palm breeding program has focused on interspecific GBS and its combination with GWAS has allowed the genetic dissection of variation in complex traits in many plant species [35][36][37]. Specifically, in oil palm the GBS technique has been used for identifying candidate genes in intraspecific populations related to oil bunch [38], average bunch weight [21,38] and stem height [22] with the enzyme ApeKI, meanwhile the enzymes PstI-MspI have been used for oil quality traits studies [39]. We used GBS with the enzyme PstI in morphological and yield-related traits in an interspecific OxG hybrid population. The use of this enzyme allowed the discovery of new genetic variants, which according to Chung et al. [40] is one of the most important advantages of GBS.
In the present study, association mapping resulted in the identification of 12 SNPs related to 10 morphological and yield-related traits ( Table 2). For morphological traits, a significant association was found for LDW on chromosome 3 explaining 10% of the phenotypic variation. This SNP was located in a mechanosensitive (MS) ion channel protein 10-like (MSL10) gene. In plants, MS ion channels have been proposed to play a wide array of roles, from the perception of touch and gravity to the osmotic homeostasis of intracellular organelles [41]. Besides, mechanoperception genes are essential for normal cell and tissue growth and development as well as for the proper response to an array of biotic and abiotic stresses [42]. A second gene associated with TD was identified on chromosome 15. This gene is involved in nucleic acid binding and has a C2H2-type Zinc finger domain. The C2H2-ZF gene family has been proposed to be involved in the formation of wood and shoot and cambium development in species such as Poplar, as well as playing a role in stress and phytohormone response [43].
For HT trait, different studies have reported associated QTLs in chromosomes 2, 6, 7 and 9 [22,34]. In our study, we reported three candidate genes on chromosome 15, which is similar to the results reported by Pootakham et al. [21]. However, our candidate genes were positioned in the vicinity of the ones reported by Pootakham et al. [21], which highlights the importance of this region (from 19.3 to 23.6 Mbp) in the phenotypic variance of HT ( Table 2). The closest gene to the SNPs S15_22553489 and S15_22553493 SNPs, corresponds to a STYK gene, which is involved in the control of stomatal movement in response to CO 2 [44]. Recent studies also showed the role of STYK gene in stem diameter by increasing the number of xylary fibers in species such as Bambusa balcooa [45].
For RL and LXL traits QTLs have been reported on chromosomes 4, 2, 10 and 16 [34]. In our study, four SNPs were associated with four different candidate genes for RL on chromosome 13. The SNP S13_20856724 is the closest to the AGC3 gene and encodes different G proteins. G proteins had been reported to be involved in a wide range of developmental and physiological processes, having a high potential for yield improvement in crops such as rice [46]. The other significant association was found with the SNP S13_23674227, which is located in an extracellular ribonuclease gene (RNase gene).
RNase genes have been studied for years in plants, playing an important role in plant defense mechanism [47] or plant development due to their ability to modify RNA levels, and thereby influence protein synthesis [48]. Other candidate genes were also found for RL and LXL, but further studies are necessary to determine their role in regulating these traits.
For yield-related traits, previously studies reported associated SNPs in chromosomes 1, 3, chromosomes 5 and 10, explaining 11% of the phenotypic variance. A significant SNP related to Yield and BN was located in the gene p5.00_sc00003_p0367, coding for a cation/H(+) antiporter gene. Antiporter proteins function as regulators of monovalent ions, pH homeostasis, and developmental processes in plants [50]. On chromosome 10, the gene p5.00_sc00004_p0097, associated with BW, encodes the zinc finger protein 8. The zinc finger proteins (ZFP) are a large protein family, involved in plant development, regulation of plant height, root development, flower development, seed germination, secondary wall thickening, anther development, and fruit ripening [51]. Studies conducted by Wu et al. [52] demonstrated that silencing a gene related to ZFP hampered fruit development in Nicotiana benthamiana [52]. This ZFP gene might play an important role on yield-related traits in oil palm, as shown in other plants, where overexpression of zinc finger proteins are related with higher yields in crops [53] although, further analysis are needed to determine its role in bunch weight and yield in oil palm.
In oil palm, harvesting of fruit bunches after certain age is a very difficult labor due to their tallness. For this reason, genotypes with less HT and TD are preferred among oil palm farmers. Likewise, larger foliar (RL and LDW) is related to a greater photosynthetic production which could be involve in higher productivity. But most importantly, increasing the number and weight of fruits means higher productivity per palm and therefore higher incomes for farmers. For this reason, leveraging QTLs or genes related to these traits could contribute to develop plant breeding strategies such as marker-assisted selection (MAS), that helps to select promising accessions in in earlier stages (greenhouse conditions), and therefore reduce the breeding cycle. Further work needs to be focus on the biological functions of the set of candidate genes found in our research, though the correlations identified in association studies cannot be dubbed as causations.

Conclusions
The present study is the first one to report significant SNPs associated with morphological and yield-related traits based on GWAS on an interspecific OxG population. Nine of these SNPs are located within chromosomes reported in previous mapping studies, however, the set of genes presented in our analysis could be of value to locate with more precision the intervals of the reported QTLs on chromosomes 13 and 15. Also, candidate genes discovered on chromosomes 3, 5 and 10 have not been reported for the studied traits. The findings from the present study will provide the groundwork for the development of marker assisted breeding in oil palm and will serve as a strong base for future functional studies to dissect high yield production in oil palm.

Ethics approval and consent to participate
Not applicable.

Consent to publish
Not applicable.

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare they have no competing interests.     * SNP position relative to the candidate gene: SNPs upstream and downstream of candidate genes are specified with "-" and "+", respectively. 0 indicates that SNPs are located within candidate gene.  Genetic differentiation of 471 oil palm accessions using 3,776 SNP markers. A.
Clustering of samples (k=2) using the supervised mode on Admixture software, B.
Principal coordinate analysis (PCA) separating the mapping population into three groups according to their breeding origin.  Hierarchical cluster analysis. Samples were clustered using the Ward's method and the squared Euclidean distance.