Genome-wide association study and its applications in the non-model crop Sesamum indicum

Background Sesame is a rare example of non-model and minor crop for which numerous genetic loci and candidate genes underlying features of interest have been disclosed at relatively high resolution. These progresses have been achieved thanks to the applications of the genome-wide association study (GWAS) approach. GWAS has benefited from the availability of high-quality genomes, re-sequencing data from thousands of genotypes, extensive transcriptome sequencing, development of haplotype map and web-based functional databases in sesame. Results In this paper, we reviewed the GWAS methods, the underlying statistical models and the applications for genetic discovery of important traits in sesame. A novel online database SiGeDiD (http://sigedid.ucad.sn/) has been developed to provide access to all genetic and genomic discoveries through GWAS in sesame. We also tested for the first time, applications of various new GWAS multi-locus models in sesame. Conclusions Collectively, this work portrays steps and provides guidelines for efficient GWAS implementation in sesame, a non-model crop. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-021-03046-x.

productivity, however, face different constraints, including limited numbers of improved varieties, shattering of capsules at maturity, non-synchronous maturity, poor stand establishment, profuse branching, low harvest index, drought stress, waterlogging and diseases [10][11][12]. To accelerate sesame improvement, genomics assisted breeding has been adopted as an efficient approach for developing superior varieties in a short time [13]. Hence, the reference genome sequence of sesame together with numerous essential genomic resources was delivered to the scientific community [14]. The haplotype map of the sesame genome was constructed from a re-sequencing project of 705 worldwide diverse cultivars and two representative genomes were further de novo assembled [15]. These resources are vital to the quick advancement of sesame research, as they expedite the detection of genetic loci that control important agronomic traits using the genome-wide association study (GWAS) approach. Today, hundreds of causative genetic variants associated with important traits such as oil quality, abiotic stress resistance, seed yield have been discovered. These findings facilitate the use of marker-assisted selection and genomic selection to advance genetic improvement and overall productivity of sesame. This makes sesame a rare case of non-model and minor crop for which genomic studies, particularly GWAS, have been very successful.
In this review paper, we first present the GWAS approach and underlying statistical models. Then, the ongoing efforts of genetic discovery through applications of GWAS in sesame are presented in detail. We conclude this paper with important guidelines for better applications of GWAS in sesame.

GWAS approach, underlying statistical models and applications in plants GWAS approach
Genome-wide association study (GWAS) also known as association mapping or linkage disequilibrium (LD) mapping takes the full advantage of high phenotypic variation within a species and the high number of historical recombination events in the natural population. It has become an alternative approach over the conventional quantitative trait locus (QTL) mapping to identify the genetic loci underlying traits at a relatively high resolution [15]. GWAS in general is applicable to study the association between single-nucleotide polymorphisms (SNPs) and target phenotypic traits. Nowadays, SNP identification is becoming much easier using advanced high throughput genotyping techniques. GWAS, quantitatively is evaluated based on LD by genotyping and phenotyping various individuals in a natural population panel. Unlike the traditional QTL mapping approach, which makes the use of bi-parental segregating populations, identification of causal genes for traits of interest in GWAS is performed in natural populations. A key advantage of GWAS is that the same genotyping data and the same population can be used over and over for different traits.
GWAS has been successfully applied to identify associations at a high resolution, detect candidate genes and dissect the quantitative traits in human, animals, and plants [16,17]. GWAS in various economically valuable crops has been used to gain insight into the genetic architecture of important traits, including days to heading, days to flowering panicle architecture, resistance to rice yellow mottle virus, fertility restoration, and agronomic traits in rice [18][19][20][21]; pattern of genetic change and evolution [22,23], compositional and pasting properties [24], stalk biomass [25] and leaf cuticular conductance [26] in maize; plant height components and inflorescence architecture [27], grain size [28] and grain quality [29] in sorghum; harvest index in maize [30], flowering time in canola [31], stress tolerance, oil content and seed quality [32] in brassica; oil yield and quality [15], yield related traits [33,34], drought tolerance [35], vitamin E [36] in sesame.

Statistical models underlying GWAS approach Single-locus models
Marker-trait association using GWAS has been widely detected using one-dimensional genome scans of the population [19,[37][38][39]. In this method, one SNP is evaluated at a time. Following the use of general linear model (GLM) which is described as Y = β 0 + β 1 X [40] (where Y = dependent/predicted/ explanatory/response variable, β 0 = the intercept; β 1 = a weight or slope (coefficient); X = a variable), a popular model referred as a Mixed Linear Model (MLM) (Q+K method) which is described as Y = Xβ + Zu + e [41], (where Y = vector of observed phenotypes; β = unknown vector containing fixed effects, including the genetic marker, population structure (Q), and the intercept; u = unknown vector of random additive genetic effects from multiple background QTL for individuals/lines; X and Z = known design matrices; and e = unobserved vector of residuals) was developed to control the multiple testing effects and bias of population stratification in GWAS. Then, the accuracy of association mapping has been reported partially improved [17,42,43]. Subsequently, numerous advanced statistical methods based on the MLM have also been suggested to resolve certain limitations such as false-positive rates, large computational consequences, and inaccurate predictions [44]. Efficient mixed model association (EMMA) [45], compressed mixed linear model (CMLM) and population parameters previously determined (P3D) [46], and random-SNP-effect mixed linear model (MRMLM) [47] are some of the latest improved single-locus genome scans MLM-based approaches proposed so far. Such advanced statistical models are powerful, flexible, and computationally efficient. EMMA was proposed to minimize the computational load exhibited in the MLM probability functions by considering the quantitative trait nucleotide (QTN) effect as a fixed effect [17,44,45]; while CMLM was proposed to control the size of huge genotype data by grouping individuals into groups and, thus, the group kinship matrix is derived from the clustered individuals [46]. Generally, despite its limitation for efficient estimation of marker effects in complex traits, the single-locus model approach has a good ability to handle several markers [47], and this is one of its worthy reported features.
Although the single-locus model analysis was a common approach for association analysis between each SNP and phenotype in GWAS, some earlier reports suggested that the use of a single-locus model analysis has limitations to resolve potential effects caused by multiple tests, historical genotype effects and pleiotropic effects [17,48]. They reported that the interaction between the available genetic variants throughout the genome is not profoundly explored when only on SNP is tested at a time. Similarly, the Bonferroni correction employed to control the false-positive error (FDR) due to multiple testing is also very stringent in this approach, hence significant numbers of important loci may not be identified by the single-locus models particularly for large errors due to phenotypic data and multi-locus effects [49,50]. Thus, it has been suggested that these single-locus genome scan methods are not convenient to test quantitative traits regulated by a few and/or many genes with large and minor effects, respectively [17,49]. Besides, the genetic epistatic effects generated within close genes could not be explored in single-locus methods [51].

Haplotype-based models
To address some of the limitations in the single-locus model analysis, haplotype-based models, which is conducted based on a random SNP effect mixed linear model (MRMLM) described as: Y =Xβ + Z k y k + u + e (where Y = a vector of estimated genotypic value for all lines is an incident matrix for fixed effects as population structure, β is a vector of the fixed effect, Z k = a vector of genotype indicators for k th SNP, Y k = random effect of marker k with ~N (0, Kσ 2 k ), u= vector of polygenic effects described by the kinship matrix (K) with ~N (0, σ 2 a ) and e = vector of residuals errors with ~N (0, Iσ 2 e )), was developed and implemented for some major crops such as wheat, rice, and soybean [52,53]. Several neighboring markers in high LD are clustered into a single multi-locus haplotype in this multivariate method, thus the haplotypes are evaluated in a multiple GLM system rather than individual SNPs, and the associations between the haplotypes and the traits under selection have been observed [48,52,54]. The haplotype-based model is relatively more efficient and reliable than the traditional single-locus models in GWAS as it helps to accurately capture the allelic diversity, optimize the use of high-density marker data, enhance the power of epistatic interactions discovery and minimize multiple testing [51,52].

Multi-locus models
Multi-locus models are newly developed alternative methods in GWAS involving two-stage algorithms [55][56][57] consisting of a single locus scan of the entire genome to detect all possible associated SNPs (QTNs) and then testing all associated SNPs using a multi-locus GWAS model to detect true QTNs. These newly developed multi-locus GWAS models are ideal for testing complex quantitative traits regulated by multiple genes/ loci and less influenced by population structure. Some advantages of multi-locus models over single-locus models are for example, the detection of multiple genes governing a given trait with high power and efficiency, low false-positive rate and no need of Bonferroni correction for multiple testing known to potentially exclude important loci [17,47,58,59]. Multi-locus models have also resulted in substantial improvements of the quality and depth of the association results in GWAS [17,42,53,57,60,61]. The models currently largely implemented in GWAS include a multi-locus mixed model (MLMM) [57], multi-locus random SNP-effect mixed linear model (mrMLM) [47], integrative sure independence screening expectation-maximization Bayesian least absolute shrinkage and selection operator model (ISIS EM-BLASSO) [50], fast multi-locus random-SNP-effect efficient mixed model association (FASTmrEMMA) [17], polygene-background-control-based least angle regression plus Empirical Bayes (pLARmEB) [62], Kruskal-Wallis test with empirical Bayes under polygenic background control (pKWmEB) [58] and fast multi-locus random-SNP-effect mixed linear model (FASTmrMLM) [59,63]. Among the numerous multi-locus models recorded to date, Segura et al. [57] proposed a MLMM method which has an advantage over other existing multi-locus methods, including penalized logistical regression [64], Stepwise regression [65], Bayesian-inspired penalized maximum likelihood, computational efficiency, false discovery rate detection and addressing the problems of population structure in GWAS. Similarly, Korte et al. [66] also proposed a mixed model method referred to as a multi-trait mixed model (MTMM) that detects the causal loci for precisely correlated multiple phenotype traits and simultaneously deals with both intra-trait and inter-trait variance components. Likewise, Klasen et al. [61] suggested a Quantitative Trait Cluster Association Test (QTCAT) analysis of multi-locus associations without employing population correction techniques and this model showed better results in limiting the false positive/negative associations due to correction strategies to mitigate confounding impacts. Multi-Trait Analysis of GWAS (MTAG) was also another specific approach developed by Turley et al. [67] to analyze summary statistics (meta-analysis) in GWAS. Zhan et al. [68] also proposed another method, named Dual Kernel Association Test (DKAT) that includes two individual kernel matrices to explain phenotype and genotype similarities. Some of DKAT's advantages over existing methods include being able to test the relationship between multiple traits and multiple SNPs without making parametric assumptions, correcting Type I error rates, being statistically highly efficient and computationally scalable [60,68].
Recently, different comparative studies have been conducted to assess the capacity of these different GWAS models in detecting marker-trait associations in different plant species. Globally, it has been found that the multilocus models were more efficient and powerful than the single-locus models to detect highly significant association results for the traits of interest (Table 1). However, integrating both single-locus and multi-locus models have been proved to enhance the power and validity of the association analysis of complex traits in GWAS because single-locus models could detect some loci that multi-locus models fail to identify [54,70].

Use of pan-genome vs single reference genome for GWAS
The common approach to study a given population's genetic variation relies on the interpretation of genes and variants annotated from the sequences of the existing reference genome [74]. Currently, reference genome sequences of many crops, including rice [75][76][77], sorghum [78], maize [79], Brassica rapa [80], barely [81,82], millet [83], potato [84], tomato [85], and sesame [14] have been reported. Following the generation of highquality reference genome sequences, several GWAS have been carried out to discover the natural variation among diverse populations. However, the reference-genomebased GWAS approach may not be sufficient to distinguish any difference between or within the population in which certain relevant genes may be inactive in the reference genome but may be expressed in the studied populations [86].
Since the discovery of pan-genome in Streptococcus agalactiae [87], different pan-genomes have been constructed through comparison of multiple genomes derived from de novo sequences assembly of various individuals of the same species including, rice [88,89], maize [90]), soybean [91], B. napus [92], wheat [93] and recently in sesame [94] (Table 2). Unlike the reference genome sequencing-based GWAS approach which depends on SNPs among the entire panel under investigation, the pan-genome approach is more inclusive and could detect copious variation including structural variation (SV), copy number variation (CNV), present/absent variation, inversion and translation variations [30,86]. In this regard, Song et al. [96] reported a direct detection of causal structural variation for the target traits (silique length, seed weight and flowering time) in Brassica napus based on the PAV-based genome-wide association study (PAV-GWAS) using the pan-genome assembled from eight high-quality genomes. They also reported that the SNP-GWAS approach that involves the single reference genome indicated no detection of causal structural variation for the same population. The result of their study indicates that the pan-genome based association study is a powerful approach that can complement the singlereference genome approach in detecting new SNP-trait associations. Likewise, the physical position of the sugarcane mosaic virus resistance gene (ZmTrxh) in maize was discovered using a pan-genome assembled from three different genotypes, but not with the use of the single reference genome [90]. Other pan-genomes based GWAS have been conducted in important crops such as rice and pigeon pea [89,97].

Diversity and development of GWAS populations in sesame Morphological and genetic diversity
Sesame is a diploid species and belongs to the division Spermatophyta, subdivision Angiospermae, class Dicotyledoneae, order Tubiflorae, family Pedaliaceae, and genus Sesamum. Pedaliaceae is a small family of 16 genera and 60 species of which 37 species belong to Sesamum genus and only Sesamum indicum L. is the most commonly cultivated species [10,39,[98][99][100]. A high number of varieties and ecotypes are reported with high adaptation to various ecological conditions in the world. There are three cytogenetic groups in Sesamum of which 2n = 26 consists of the cultivated S. indicum along with S. alatum, S. capense, S. schenckii, S. malabaricum; 2n = 32 consists of S. prostratum, S. laciniatum, S. angolense, S. angustifolium; while S. radiatum, S. occidentale and S. schinzianum belong to 2n = 64 [101][102][103]. So far, extensive morphological variations including plant height, height to the first capsule, height to first branch, number of branches, flowering period, flower color, number of flowers per axil, number of capsule per axil, capsule edge number days to maturity, number of seeds per capsule, number Multi-locus of capsule per plant, seed coat color, seed size, seed oil content, seed yield, and branching habit have been reported in the cultivated sesame [11,14,[104][105][106][107].

Development of GWAS populations
In China, there are over 8,000 accessions of sesame deposited in the National Mid-term Gene Bank of China located in the Oil Crops Research Institute of Chinese Academy of Agricultural Sciences (OCRI-CAAS) [14]. Similarly, about 4,500 sesame accessions conserved in the National Long-term Gene bank in Beijing [107] (Fig. 1). Based on these large collections, strategies to build a sesame core collection have started early in the year 2000 using morphological descriptors and later, molecular tools [14,15,106,107,137]. Ultimately, a sesame core collection encompassing 705 diverse accessions including 405 landraces, 95 cultivars from China, and 205 accessions from 28 other countries was established at OCRI   (Fig. 2). This panel shows ideal characteristics for the implementation of GWAS, including high phenotypic variability, low population structure and genetic differentiation among groups, and a moderate decline in LD (~88 kb) [15]. However, most of the accessions (70.1%) included in this panel represent only one country while the other  [135]. Overall, to explore the genetic bases of economically important agronomic traits and identify possible causative genes, these developed GWAS panels need to be updated by providing more materials reflecting diverse agro-ecological backgrounds worldwide.

Advantages and limitations for GWAS implementation in sesame Advantages
Implementation of GWAS based on high-quality genome sequences results generally in a more accurate prediction and mining of potential causative genes. The high-resolution positioning of SNPs in the genome along the entire chromosomes can unravel the genetic architecture of target traits; hence, GWAS can detect more significant associations, candidate genes, and genomic locations with high power and efficiency. Since 2014, the development of a high-quality draft genome of the sesame genotype 'Zhongzhi13' [14] has opened the door for genomic research in sesame. Sesame has a small diploid genome estimated at 350 Mb, of which 274 Mb draft genome was assembled, and 27,148 protein-coding genes were predicted. Another genome sequence was also published during the same period from the modern cultivar 'Yuzhi1' [138]. Progresses in genome sequencing technologies associated with the reduction of sequencing costs have created opportunities for additional genome sequencing projects in sesame. The reference genome was updated to have a higher resolution [39] and the genome sequences of different sesame landraces including 'Baizhima' and 'Mishuozhima' [15] and a modern cultivar 'Swetha' [139] were also published. Furthermore, the assembly of a sesame pan-genome from five different genomes identified 15,890 dispensable genes, providing a rich resource for comprehensive gene discovery and superior allele mining through GWAS [94]. Similarly, the availability of tremendous transcriptome data from diverse sesame tissues, various growth conditions and from wild Sesamum species such as S. radiatum and S. mulayanum (  [15,105,140]. To further facilitate the exploitation of GWAS results as well as all genetic discoveries available in sesame, we have developed a novel database named Sesamum indicum Genetic Discovery Database (SiGeDiD) (http:// siged id. ucad. sn/). SiGeDiD is a flexible online catalog of all genetic and genomic discoveries including, candidate genes, QTLs and functional molecular markers in sesame (Fig. 3). It is an essential platform for comparative analysis of GWAS projects in sesame and facilitates gene discovery, particularly the identification of pleiotropic genomic regions/genes that have been identified from different GWAS and other genetic/genomic studies.
The website is user-friendly and we integrated a module allowing researchers to upload directly their findings in SiGeDiD. Currently, the BLAST functionality is unavailable but SiGeDiD will be updated to make it more interactive and fully functional.
Collectively, the availability of enormous genomic resources, the small genome size of sesame, comprehensive GWAS panels, diverse mapping populations, high genetic diversity, low population structure, and relatively low LD are advantageous for GWAS implementation in sesame.

Limitations
While GWAS provides an opportunity to investigate a range of novel genes associated with important agronomic traits, this method does not necessarily identify causal variants and genes [141]. When GWAS is completed, it is often necessary to take additional steps to investigate the functional and causal variants and their target genes in which transgenic experiments may ultimately be implemented. Sesame, however, is a recalcitrant plant for genetic transformation, so there are limited validations of GWAS-identified SNPs using a transgenic approach. Besides, although the LD decay rate in sesame is relatively lower than that of other selfpollinating crops, including rice (~100-350 kb) [142,143], soybean (~574 kb) [144,145] and brassica (~405 kb) [146], it showed a higher LD decay rate than other crosspollinating species, including maize (~5.39-15.53 kb) [147]. Consequently, the modest level of LD decay rate (88 kb) reported in sesame suggests that GWAS resolution may not easily resolve to the causative gene unless a high marker density is used. GWAS, therefore, could have a limited efficiency on trait-based QTL regions or causative genes detection in the absence of high marker density. Another limitation of GWAS in sesame is that many sesame cultivars are highly photosensitive, so field phenotyping and collecting reliable data in various regions of the world is difficult.

GWAS applications in sesame
From 2015, several GWAS projects have been successfully implemented in sesame to uncover the genetic bases of key agronomic traits such as oil content, oil nutrient composition, seed yield, and yield-related components, seed coat color, morphological characteristics, disease resistance salt tolerance, waterlogging resistance, drought tolerance, root traits and nutritional values [15, 33-36, 135, 136, 148]. As to our knowledge, all GWAS projects conducted so far in sesame were based on a single-locus method (EMMA) and the majority was implemented on the GWAS panel developed at OCRI-CAAS. In this work, we summarize all of the results of GWAS reported by different groups of sesame researchers (Table 6 and Fig. 4). A large scale GWAS was conducted by investigating the natural variation of 705 sesame accessions based on 169 sets of phenotypic data including, oil content, nutrient composition, yield components, morphological characteristics, growth cycle, coloration and disease resistance. In total, 1,805,413 SNPs were used. This has led to the identification of 446 significantly associated SNPs with the phenotypic variation. Following in-depth analyses of the major loci, a total of 46 causative genes including genes related to flower lip color (SiGL3), petiole color (SiMYB113 and SiMYB23), oil content (SiPPO), fatty acid biosynthesis (CXE17 and GDSL-like lipase) and yield (SiACS) were identified [15]. Similarly, GWAS of 39 yield-related traits was also conducted [34] using the same population as the previous study [15]. In total, 646 loci associated with traits of interest and 48 potential genes significantly associated with the functional loci were identified. They reported several candidate homologs genes involved in seed formation and some novel candidate genes (SiLPT3 and SiACS8) which may control capsule length and capsule number [34]. Likewise, variations in PEG-induced drought stress and salt stress tolerance were investigated in 490 diverse sesame accessions (representing 33 countries in Asia, Africa, America and Europe) based on GWAS [33]. A total of 132 significant SNPs resolved to nine QTLs and 151 total genes of which SiEMF1, SiGRV2, SiCYP76C7, SiGRF5, SiCCD8, SiGPAT3, SiGDH2, SiRABA1D were detected as potential genes regulating drought stress while for salt tolerance, a total of 120 significant SNPs resolved to 15 QTLs and 241 genes of which of  SiLHCB6, SiMLP31, SiPOD, SiHSFA1, SiDUF538, SiCC-NBS-LRR, SiUDG, SiGPAT3, SiNAC43, SiGDH2, SiCP24, SiWRKY14, SiXXT5, SiXTH15, and SiG6PD1 were detected as potential genes [33]. Later on, GWAS was conducted to investigate genetic variants governing drought tolerance in 400 sesame accessions [35]. A total of 140 reliable and stable QTLs were identified and resolved to 10 QTLs. Similarly, 120 genes, of which SiABI4, SiTTM3, SiGOLS1, SiNIMIN1, and SiSAM having high potentials to modulate drought tolerance in sesame, were identified [35]. Their study was the first to validate the function of a candidate gene from GWAS using transgenic approach. They demonstrated that sesame accessions originated from drought-prone agroecological regions have fixed several drought-tolerant alleles, though alleles contributing to high yielding under drought conditions are far from being fixed. Hence, sesame is mostly considered as a resilient crop because of the long-term adaptation to drought-prone agro-ecological regions. Additional new GWAS results were also reported recently [36,135,136] (Table 6). Based on genotyping by sequencing (GBS) method, [36] conducted GWAS on vitamin E and identified eight strongly linked SNPs and 12 genes with various regulatory functions, including transcription regulator HTH, zinc ion binding protein, glycosylphosphatidylinositol (GPI)-anchor biosynthesis and ribosome protein. They also identified, two loci, LG_03_13104062 containing seven genes (SIN_1022039-SIN_1022045) and LG_08_6621957 containing five genes (SIN_1001936-SIN_1001940), detected simultaneously on LGs 3 and 8, respectively, by employing two different models (GLM and MLM). Hence, the authors suggested that these two simultaneously detected loci have high potentials to control vitamin E in sesame. However, due to the limited numbers of SNPs (5,962) and small panel size used in this GWAS, potential loci for this important trait may have been missed [136]. used genotype data from 42,781 SNPs and seed coat color trait from an association-mapping panel consisting of 366 sesame germplasms to identify 224 significantly associated SNPs. Based on the four most stable peaks/SNPs significantly associated with sesame seed coat color, they retained 92 candidate genes. Of these genes, SIN_1016759 (encoding predicted PPO) was also reported in previous GWAS by [15] and QTL mapping study by [39]. Using a mapping association of 87 sesame accessions and 8,883 SNPs, a GWAS on phytophthora blight resistance was conducted [135]. The result of this study suggested that SIN_1019016 was one of the candidate genes identified closely associated with phytophthora blight resistance in sesame. The limited SNP numbers called from the GBS approach and relatively small size of sesame accessions used in this study could have affected the GWAS output associated with trait under investigation. More recently, a comprehensive GWAS conducted by Dossa et al. [148] unraveled the genetic basis of seven root related traits. They reported 409 significant signals, 19 QTLs containing 32 candidate genes associated with sesame root traits. More importantly, they discovered an orphan gene named 'Big Root Biomass' (SIN_1025576) which modulates sesame root biomass through the auxin pathway [148]. In addition to the published GWAS findings, the OCRI-CAAS sesame research group has also several unpublished GWAS outputs on various agronomic traits including, waterlogging, chlorophyll, salt stress at the seedling stage and interestingly a metabolite based GWAS has been completed. These results will illuminate the genetic basis of important metabolites such as sesamin/sesamolin variation in sesame. All candidate genes, QTLs and SNPs will be regularly loaded into SiGeDiD (http:/sigedid.ucad.sn/) for further uses in sesame breeding projects.

Potential of new statistical models to improve the accuracy and power of GWAS in sesame
To our knowledge, multi-locus models have not yet been employed in sesame GWAS research and no previous study has compared different GWAS models (single locus and multi-locus models) in sesame. Herein, we tested the applications of new GWAS models in sesame based on quantitative (root length) and qualitative (seed coat color) traits. Natural variation in root length of 350 sesame accessions was collected from a field experiment following the methodology developed by Su et al. [149], and the genotypic data were obtained from 1,000,000 common SNPs. For the seed coat color GWAS, the 600 sesame accessions, and 1,000,000 common SNPs were used [15]. To investigate the phenotypic natural variation for the seed coat color, matured seeds from five capsules per genotype were collected and photographed with a high-resolution digital camera and the seed -coat color data, which was based on the red, green, and blue (RGB) values, were recorded following the methodological approach adopted by Zhang et al. [150]. Subsequently, three separate GWAS models, including two multi-locus models (mrMLM FASTmrEMMA and mrMLM ) and one single locus model (EMMAX) were selected (mainly because they do not require extensive phenotypic and genotypic data formatting) and were implemented using the phenotypic and genotypic data. We further compared the results of these three models to evaluate their potentials to reveal higher number of marker-trait associations and discover more candidate genes. Our GWAS results for the two traits showed that a total of 190, 181 and 162 significant SNPs (-log10(p) > 6) associated with root length were detected by FASTm-rEMMA, mrMLM and EMMAX, respectively. Similarly, 67, 492 and 143 significant SNPs associated with seed coat color were detected by FASTmrEMMA, mrMLM and EMMAX, respectively (Fig. 5a-f; Table 7; Table S1). Of the significant SNPs associated with root length, 163 SNPs were identified simultaneously by all three models; all the SNPs identified by EMMAX were also identified simultaneously by both multi-locus models, while 18 SNPs were simultaneously and only detected by FAST-mrEMMA and mrMLM (Fig. 5g). For the seed coat color associated SNPs, 67 and 27 SNPs were detected by all the three models and by two models (mrMLM and EMMAX), respectively (Fig. 5h). By considering all SNPs co-clustered with peak SNPs within a window Fig. 5 Application of new statistical multi-locus models in sesame. a and b Negative log10 P-values for association of root length (Y-axis) are plotted against SNP positions (X-axis) using the multi-locus models, mrMLM and FASTmrEMMA, respectively; c Negative log10 P-values for association of root length (Y-axis) are plotted against SNP positions (X-axis) using the single-locus model, EMMAX; d and e Negative log10 P-values for association of seed coat color (Y-axis) are plotted against SNP positions (X-axis) using the multi-locus models, mrMLM and FASTmrEMMA, respectively; f Negative log10 P-values for association of seed coat color (Y-axis) are plotted against SNP positions (X-axis) using the single-locus model, EMMAX. For both traits, a horizontal dash-dot line indicates the significant P-value threshold (10 -6 ) and the significant SNPs are highlighted by red color, vertical line indicates overlapped most significant peaks at least in two models; g Venn diagrams showing the shared and uniquely detected significant SNPs by each model for root length GWAS respectively; h, Venn diagrams depicting the shared and uniquely detected significant SNPs by each model for seed coat color GWAS. The phenotypic and genotypic data for this analysis were obtained from 350 sesame accessions and 1,000,000 common SNPs for root length and data from 705 sesame accessions and 1,805,413 common SNPs for seed coat color GWAS study of 200 kb as QTLs [35], a total of 19 and 34 QTLs were detected for root length and seed coat color, respectively, by all the three models (Table S1). Within these QTLs, we retrieved 26 and 47 genes for root length and seed coat color, respectively. Based on the robust QTLs codetected by different models identified for root length, nine potential candidate genes, including SIN_1017810, SIN_101781, SIN_1017812, SIN_1017815, SIN_1017843, SIN_1007064, SIN_1007065, SIN_1020072 and SIN_1017818 are proposed for further functional studies to pinpoint the causative gene (s). Regarding the seed coat color, the potential candidate genes identified in our study include SIN_1007188, SIN_1007221, SIN_1023226, SIN_1023227 and SIN_1023228. Interestingly, three genes detected in this study were previously reported by Mei et al. [136].
Collectively, the analysis of different GWAS models indicates the potential of using an integrated approach (single and multi-locus models) to improve the capacity and power of GWAS in sesame. This will help to detect more and novel marker-trait associations and candidate genes, particularly when investigating quantitative traits. It is also important to note that significantly associated regions simultaneously detected by more models in GWAS are more likely to be highly associated with the traits under investigation as compared with regions detected only by a single model. Hence, developing diagnostic markers for the co-detected associated regions could speed up sesame molecular breeding programmes.

Conclusions
Over the last five years, GWAS have been successfully implemented in sesame and is illuminating the genetic basis of many important agronomic traits. Even though a list of QTLs (~300) and candidate genes (~250) have been identified for qualitative and quantitative traits, more traits, including chlorophyll-yield, metabolite-GWAS, waterlogging, heat tolerance are under investigation. We envision that all these results will lead to the development of allele-specific diagnostic markers to be used as daily molecular tools in sesame breeding programmes. Though a high-quality sesame reference genome sequence has been developed, more often, there are limitations to find any candidate gene around the peak SNPs from GWAS. To overcome these limitations, we need to use the recently developed sesame pan-genome [94] for future GWAS implementations. The diversity of recently available sesame GWAS panels should be improved by integrating more accessions and wild species from different agro-ecological origins mainly from Africa. For this, an international collaboration between sesame researchers is highly required. Furthermore, collaboration between researchers for generating comprehensive germplasm characterization data using precise phenotyping platforms and in contrasting environments will permit more accurate dissection of the genetic architecture of complex traits in sesame. Efforts towards sharing genetic materials between research institutes are crucial for accelerating gene discovery. For example, the re-sequencing data of the 705 fully sequenced GWAS panel generated by OCRI is publicly available and if the germplasm, at least partly, could be shared with partners, more GWAS projects could be implemented on sesame, particularly on traits highly affected by environments. Similarly, working to develop an SNP chip can be an alternative for quick, lowcost, and easy genotyping of novel sesame collections to be used for future GWAS projects.
The application of new multi-locus GWAS models and integration of single-and multi-locus models will provide more efficiency and power in future GWAS implementation in sesame. Up to date, very few studies have validated the numerous GWAS findings in sesame. Therefore, follow-up studies are needed for further validating the favorable alleles identified from GWAS in independent populations and using other approaches (classical bi-parental QTL mapping, QTLseq, etc.). Validation of GWAS findings using transgenic approach is also instrumental in several plant species. In sesame, genetic transformation protocols using tissue culture techniques have been reported [151]. More studies on this topic are needed in order to develop a more effective genetic transformation protocol in sesame, for example using the flower dip technique [152]. Hairy Table 7 Summary of significant SNPs associated with root length and seed coat color within the linkage groups (LG) identified by each model during GWAS in sesame

Total 67 491 143
root genetic transformation is also a flexible and rapid technique widely adopted in several recalcitrant plants to study gene functions [153]. We propose to develop a hairy root genetic transformation protocol in sesame combined with new genome editing technologies to confirm some important GWAS findings. Finally, projects aiming at developing diagnostic molecular markers based on GWAS peak SNPs and their favorable alleles should be instigated. This will considerably accelerate sesame molecular breeding.
Additional file 1 : Table S1. Summary list of total QTLs and candidate genes identified in GWAS for root length and seed coat color along the linkage groups in sesame by multi-locus and single-locus models. Table S2. Summary of QTL and candidate genes detected by each GWAS model. Table S3. Candidate genes detected in each LG for each model.