Multi-locus genome-wide association studies for five yield-related traits in rice

Background Improving the overall production of rice with high quality is a major target of breeders. Mining potential yield-related loci have been geared towards developing efficient rice breeding strategies. In this study, one single-locus genome-wide association studies (SL-GWAS) method (MLM) in conjunction with five multi-locus genome-wide association studies (ML-GWAS) approaches (mrMLM, FASTmrMLM, pLARmEB, pKWmEB, and ISIS EM-BLASSO) were conducted in a panel consisting of 529 rice core varieties with 607,201 SNPs. Results A total of 152, 106, 12, 111, and 64 SNPs were detected by the MLM model associated with the five yield-related traits, namely grain length (GL), grain width (GW), grain thickness (GT), thousand-grain weight (TGW), and yield per plant (YPP), respectively. Furthermore, 74 significant quantitative trait nucleotides (QTNs) were presented across at least two ML-GWAS methods to be associated with the above five traits successively. Finally, 20 common QTNs were simultaneously discovered by both SL-GWAS and ML-GWAS methods. Based on genome annotation, gene expression analysis, and previous studies, two candidate key genes (LOC_Os09g02830 and LOC_Os07g31450) were characterized to affect GW and TGW, separately. Conclusions These outcomes will provide an indication for breeding high-yielding rice varieties in the immediate future. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-021-03146-8.


Background
Rice (Oryza sativa L.) is one of the three major food crops supporting more than 50% of the whole population worldwide [1]. In 2018, the total rice output accounted for 32.24% of the total grain production in China followed by maize (http://www.stats.gov.cn/). While in the world, the average production of rice from 1994 to 2019 is 654.78 million tonnes per year, accounting for 27.28% of total cereals output (http://www.fao.org/faostat/en/#data/QC/ visualize). The growing global population and the deteriorating environment issue new challenges to the breeding of high-yielding crops [2]. Rice yield is a complex quantitative agronomic trait multiplicatively governed by three major components as the number of grains per panicle, thousand-grain weight, and the number of panicles per plant [3]. Besides, grain size (including grain length, width, and thickness) are also closely related to rice productivity [4]. A previous study reported that rice yield was a representative quantitative trait regulated by several minor genes, and a more efficient tool was needed to develop for exploiting these minor QTLs [5]. Many genes have been reported controlling grain size, grain number, and yield. For example, the GS3 [6] is a major gene controlling rice grain length (GL). A mutation in the second exon changes a cysteine codon (TGC) to a termination codon (TGA) at the protein level, resulting in a diversity of rice GL. The GW5 [7] is an IQ calmodulin-binding motif family protein, which regulates rice grain width and weight. Loss-offunction gw5 showed wider grain compared to the wild type. The WTG1/ OsOTUB1 [8] encodes an otubain-like protease with deubiquitination activity, which expresses in developing grains and panicles. The overexpression of WTG1 showed narrow, thin, and long rice grains as a result of slim cells. The OsSPL13 is a SQUAMOSA promoter-binding-like protein [9,10], which was reported controlling rice grain length, grain number, grain size, and yield.
The genome-wide association study (GWAS) has become a powerful tool for mining QTL associated with complex traits [11,12]. Single-locus GWAS (SL-GWAS) methods such as mixed linear model (MLM) [13], efficient mixed-model association eXpedited (EMMAX) [14], and factored spectrally transformed linear mixed models (FaST-LMM) [15] have been widely used to investigate tremendous genetic variants for agronomic traits. However, these SL-GWAS methods are limited in detecting marginal effects QTNs influenced by the polygenic background and stringent Bonferroni correction [16].
To address the shortcomings of SL-GWAS, multi-locus GWAS (ML-GWAS) has been developed as a multidimensional genome scan method in which the effects of all markers are estimated at the same time [17]. In particular, to solve the problem associated with co-factor selection in the ML-GWAS model when there are many markers, the mrMLM package was proposed, which containing the following six ML-GWAS methodologies: mrMLM (multilocus random-SNP-effect MLM) [16], FASTmrMLM (fast mrMLM) [18], ISIS EM-BLASSO (iterative modified-sure independence screening expectation-maximization-Bayesian least absolute shrinkage and selection operator) [19], pKWmEB (integration of Kruskal-Wallis test with empirical Bayes) [20], FASTmrEMMA (fast multi-locus random-SNP-effect efficient mixed model analysis) [21], and pLARmEB (polygenic-background-control-based least angle regression plus empirical Bayes) [22]. ML-GWAS also has a lower false-positive rate and has been applied successfully to identify significant QTNs with subtle contributions for several agronomic [23][24][25][26]. But no studies have focused on ML-GWAS for yield-related traits in rice as yet. In general, the QTL (Quantitative Trait Loci) refers to the signal identified by single-locus methods, such as GLM, MLM, etc. In such QTL, it mostly contained numerous associated SNPs (Single nucleotide polymorphisms). While in multilocus GWAS methods, when all the potentially associated markers were identified in the first step. These markers were submitted into a model in further analysis and true QTNs (Quantitative Trait Nucleotides) were further confirmed by the likelihood ratio test [16].
In the current study, a large-scale natural population of 529 rice accessions with five yield-related traits and 607,201 SNPs was conducted by a hybrid method of one SL-GWAS (MLM) algorithm and five ML-GWAS (mrMLM, FASTmrEMMA, pLARmEB, pKWmEB, and ISIS EM-BLASSO) models. We aim to investigate common QTNs via multiple methodologies and then deduce potential candidate genes to accelerate molecular marker-assisted breeding and boost rice production.

Phenotypic variation
Five yield-related traits (including GL, GW, GT, TGW, and YPP) were selected to examine whether significant phenotypic variances exist in the yield among the 529 rice varieties. The results manifested that the parameters were varied for accessions to their corresponding traits (Table S1). For instance, the GL ranged from 6.13 to 10.97 mm, with a mean of 8.57 mm. The YPP had a great variation ranging from 4.41 to 92.66 g, whereas the GT possessed the smallest range from 1.61 to 2.45 mm with a CV of 6.86%. The CV of GL, GW, TGW, and YPP were 10.25, 12.36, 16.09, and 44.90%, respectively. Also, the frequency distributions of all five traits obeyed approximately the normal distributions.
Furthermore, the Pearson correlation coefficients (PCC) among the five traits were also estimated ( Fig. 1). All paired traits showed statistically significant differences at the p-value< 0.001 except the relationship between TGW and YPP (p-value< 0.05). GL and GW had a negative relation with PPC = − 0.40, which was corresponding with a previous study [27]. GL was also associated negatively with GT (PCC = -0.21) while positively with TGW (PCC = 0.41) and YPP (PCC = 0.17), respectively. In addition, TGW was observed to positively correlate to GT (PCC = 0.54), GL (PCC = 0.41), and GW (PCC = 0.36), indicating grain size might make a major contribution to grain weight. These results exhibited that there was a close relationship among the five rice traits, which played an important role in regulating the rice grain shape and productivity.

Population structure and linkage disequilibrium analysis
To understand the population structure of the panel, PCA analysis was performed using 607,201 SNPs, which was mentioned in the Methods section. Five conceivable subpopulations were respectively distinguished via PC1, PC2, and PC3 ( Fig. 2a and b). Next, a maximum likelihood phylogenetic tree was analysed by their genetic distances, which was derived from the SNP differences in these genotypes. The population could be divided into six distinct subgroups, 95 indica I (IndI), 74 indica II (IndII), 43 tropical japonica (TrJ), 93 temperate japonica (TeJ), 46 aus, and 178 admixture of the others (Adm), respectively (Fig. 2c). Based on the results from both the phylogenetic tree and PCA, the panel was separated into six groups.
The LD decay distance was further estimated using the identified SNPs. As delineated in Fig. 2d, the genome-wide LD decay rate of all populations was approximately 43 kb, where the r 2 dropped to half of the maximum value. Due to that, the theoretical average marker density was one SNP per 9 kb. In fact, the actual value of SNP density in the genome already reached 0.6 (kb/SNP) as the distribution of these SNPs within the whole genome was summarized in Fig. S1. Therefore, we concluded that these markers were sufficiently dense for detecting the associated QTNs.

SL-GWAS and ML-GWAS analyses
All five yield-related traits were analyzed using one SL-GWAS (MLM) to identify QTL and five ML-GWAS (mrMLM, FASTmrMLM, pLARmEB, pKWmEB, and ISIS EM-BLASSO) methods to identify QTNs (Fig. S2). As for MLM, 152, 106, 12, 111, and 64 SNPs corresponding to 3,19,8,53, and 56 QTLs were found to be tightly associated with GL, GW, GT, TGW, and YPP under the cut-off criterion of p-value =1.65 × 10 − 6 , respectively (Table S2). As figured in QQ plots ( Fig. S2f  and 2i), the curves of GL and TGW were consistent with optimal trends, implying that the false-positive errors were controlled well and the results of the MLM model were reliable.
In addition, we compared the results of ML-GWAS and SL-GWAS and a total of 20 common QTNs were identified (Table S4). Four, four, seven, and five QTNs were discovered associated with GL, GW, TGW, and YPP, respectively. While no common QTN for GT was identified in this study.

Prediction of potential candidate genes
Only the QTNs simultaneously detected by both ML-GWAS and SL-GWAS were further analysed. The qTGW-7-1, which was located at 18,639,992 bp on chromosome 7, was identified associated with TGW using both SL-GWAS and ML-GWAS methods. This QTN was detected tightly related to TGW with LAR-mEB, pKWmEB, and ISIS EM-BLASSO methods with the LOD ranged from 3.57 to 11.64 (Table 1). In the MLM method, this SNP was also significantly (p-value = 3.62× 10 − 9 ) associated with TGW with an R 2 of 7.79% (Table S2). Then, a local LD block (18,623,910-18,644,333 bp) was defined (Fig. 4a) with the step we mentioned in the Method section. Based on genomewide annotation information, LOC_Os07g31440 and LOC_Os07g31450 were extracted from this region. Among them, LOC_Os07g31440 encodes an expressed   (Fig. 4b). These findings revealed that the grain weights of the accessions with favorable haplotype variations were predominantly improved compared to those with unfavourable variations. A previous study reported that CHR729 expresses ubiquitously, such as in stems, leaves, leaf sheaths, young panicles, and flower organs [28]. To further explore the expression pattern of LOC_Os07g31450 in different tissues, we utilized the CREP database to analyse and found LOC_Os07g31450 had the highest expression level in the young panicles (< 1 mm, 3-5 mm, and 10-15 mm) (Fig. 4c). The qGW-9-2 was another QTN simultaneously detected by both SL-GWAS and ML-GWAS. This QTN was identified significantly associated with GW using pLARmEB and pKWmEB methods with the LOD value OSMADS78 has been confirmed to be an important regulator of early seed developmental transition and impacts both rice seed size and quality [29]. Nevertheless, the biological function of OsMADS78 is far from being understood.
Here, our study speculated that OsMADS78 might contribute to rice grain width regulation.

Comparison of SL-GWAS and ML-GWAS results
The conventional single-locus methods like the general linear model (GLM) and MLM have been widely implemented to identify genetic variants in many cereals [30][31][32]. However, these models have certain shortcomings as they neglect the overall effects of multiple loci and suffer from the problem of multiple test corrections for critical values. For example, the stringent threshold leads to missing many robust QTLs, particularly small-effect QTLs in MLM [16]. ML-GWAS methodologies therewith have been developed, such as Fig. 3 Comparison of location of candidate QTNs and cloned genes on chromosomes. The common QTNs mapped in this study are labeled on the right side of chromosomes, and different colors display different traits: black, GL; red, TGW; green, YPP; brown, GW; olive, GT. Location of known genes is marked on chromosomes, and pink highlights genes covered by QTNs mrMLM, FASTmrMLM, LASSO [33], and FarmCPU (Fixed and random model Circulating Probability Unification) [34]. After comparing the statistical power of ML-GWAS with SL-GWAS methods, several studies demonstrated that multi-locus methods have lower false-positive error and higher statistical power than single-locus ones [17,35,36]. According to each ML-GWAS algorithm has its own characteristics and different QTL detection, investigators generally combine the merits of several ML-GWAS methods to mine target QTL for complex agronomic traits [26,37,38].
Likewise, 161, 136, 160, 189, and 171 significant QTNs were identified using ML-GWAS methods linked with the above five traits successively. We noted that the number of QTLs mapped by MLM was less than the QTNs identification of five ML-GWAS algorithms, especially of those about GT and YPP. The previous study observed similar findings in GWAS analysis of soybean seed size, suggesting that the recognition results of the five ML-GWAS methods outperformed those of the two SL-GWAS programs. In addition, the QTNs distribution detected by ML-GWAS approaches were more dispersed compared with MLM. For instance, as described in Fig.  S2a, the significant loci identified by MLM were concentrated near GS3 on chromosome 3. Whereas, many loci on other chromosomes failed to meet the threshold, indicating that it was difficult to find new loci from other chromosomes when applied traditional MLM. Afterward, a lot of significant QTNs presented across ML-GWAS models were not only situated on chromosome 3, but also widely distributed on other chromosomes, among which QTNs examined by at least three methods are worthy of further research. These data explicated that the ML-GWAS algorithms are considered more effective, powerful, and robust when applying to investigate the small-effect QTNs for yield-related traits.

Comparison of QTLs or QTNs detected in our study and previous studies
Over the past decade, numerous rice yield-related genes such as GS3, GW2, and GW5 have been identified and their functional roles were deeply elucidated [39,40]. Among them, GS3, the first molecular characterized QTL for grain size, controls grain weight and length, with minor impacts on grain width and thickness [41]. GW2 (Grain width 2), negatively regulates grain width [42] and GW2 homologs in common wheat plays a critical role in the genetic control of grain weight and protein content traits [43]. GW5 influences grain width and weight acting in the brassinosteroid pathway and overexpression of GSE5/GW5 resulted in narrow grains [44,45].
In the current study, we characterized 74 QTNs for five yield-related traits that were simultaneously identified using two or more ML-GWAS methods. Compared with the mapped loci of the previous studies, 23 QTNs and its ±300 kb genomic ranges overlapped the known genes. On chromosome 3, the QTN hotspot (qGL-3-3, qGL-3-4, and qGL-3-5) was located near the region of GS3 for grain length. The QTN cluster on chromosome 5 (qGW-5-1, qGW-5-2, and qGL-5-1) was mapped nearby GW5. Moreover, there were 51 novel QTNs excluded in the genomic regions of the past reports. Therefore, these identified makers may be the potential QTNs controlling rice grain and productivity.

Dissecting two candidate genes of yield-related traits
Using the efficient mixed-model association, two robust QTLs (qTGW-7-1 and qGW-9-2) were validated with major effects associated with yield-related traits using both SL-and ML-GWAS approaches. In the candidate qTGW-7-1 related to grain weight, LOC_Os07g31450/ CHR729 is a kind of CHD protein, which encoded protein contains 2259 amino acids, belonging to the CHD3 subfamily [28]. CHR729 has been reported to play an essential role in multiple aspects of rice root and seed development. As an example, CHR729 can control seedling development through the gibberellin pathway [28] and affect crown root formation through the auxin signalling pathway [46]. In this study, GWAS results inferred that CHR729 might be related to grain weight, and transcriptional expression analysis reflected that CHR729 was highly expressed in young panicles. The role played by CHR729 in regulating rice grain and even productivity is worthwhile for further study and confirmation.
In the candidate qGW-9-2 involved in grain width, gene LOC_Os09g02830 (OsMADS78) belongs to the MADS transcription factor family. MADS family is a large family with conserved MADS-box domains, whose members widely take part in the key regulatory pathways of plant growth and reproduction (including flower formation) [47]. The OsMADS family in rice participates in controlling flowering time, development of root and seed, especially of flower organs [48,49]. For instance, OsMADS16/SPW1, which is homologous to APETALA3 in Arabidopsis, belongs to the Class B in the ABC model of flower organ development, determining the properties of slates and stamens in rice flower organs [50]. Moreover, OsMADS13 controls ovule identity [51]. OsMADS26 negatively regulates resistance to rice blast and drought tolerance [52]. And OsMADS23, OsMADS25, OsMADS27, OsMADS57, and OsMADS61 determine root development [48]. Although there are 75 members in the OsMADS family [53], nearly half of the members' functions are still unknown. In this study, our findings give a clue that OsMADS78 may be related to grain width in rice. Recently, Paul et al. reported that OSMADS78 modulates early seed developmental transition and impacts rice grain length, grain width, 1000 grain weight, and grain quality [29]. This conclusion showed the high reliability of our results, corresponding with this published research which verified the biological function of OSMADS78 by overexpression experiment [29].

Conclusions
In this study, a total of 74 QTN hotspots were simultaneously detected for five yield-related traits by two or more ML-GWAS methods. Among them, qTGW-7-1 and qGW-9-2, closely associated with TGW and GW separately, were presented across both SL-GWAS and ML-GWAS analyses. Besides, two key annotated genes (LOC_Os07g31450 and LOC_Os09g02830) underlying the above two target genomic ranges were mined. In summary, many robust QTLs and two candidate genes were supposed to potentially modulate grain shape and productivity in rice. This research made a beneficial attempt by a combinatory approach of ML-GWAS methods and will facilitate the detection of yield-related QTNs.

Phenotyping data and statistical analyses
The complete phenotypic records of 529 rice core accessions were downloaded from RiceVarMap v2.0 [54], an integrated dataset of rice genomic variations denoted by Huazhong Agricultural University. This set of germplasm contains diverse rice cultivars. Thereinto, the materials were classified into 299 indica (95 indica I, 74  indica II, 13  The yield-related agronomic characters include grain length (GL), grain width (GW), grain thickness (GT), thousand-grain weight (TGW), and yield per plant (YPP) were downloaded from RiceVarMap2 website (http:// ricevarmap.ncpgr.cn/phenos/). The detailed information, including experimental design, years, replicate, could be found in a study [55]. Meanwhile, the minimum, maximum, mean, standard deviation, range, and coefficient of variation (CV) for each trait were calculated in Table S1. Pearson correlation analysis for phenotypic data was conducted using the SAS 9.4 software (http://www.sas.com/).

Genotyping data processing
High-quality re-sequencing raw data of 529 germplasms were derived from RiceVarMap (http://ricevarmap. ncpgr.cn/v1/) [54]. The detailed information of data processing and BCF files could be found on the RiceVarMap webpage. Raw single nucleotide variants (SNV) were processed by PLINK 1.9 software with parameter --maf 0.05 --geno 0.05 --snps-only. Then the genotypic data was imputed by Beagle 5.0 and a total of 607,201 SNPs were left for further analysis [56].
Clustering analysis, population structure, and LD analysis The matrix of pairwise genetic distance derived from 607,201 SNPs was implemented to construct phylogenetic trees by SNPhylo with the default parameter [57]. Principal component analysis (PCA) and kinship matrix were performed by the Tassel 5.2 program to estimate the population structure [58]. Linkage disequilibrium (LD) between SNPs was estimated as the squared correlation coefficient (r 2 ) of alleles, meanwhile the whole population and sub-populations were implemented by software PopLDdecay [59].

Genome-wide association study
In our case, GWAS was performed in 529 rice varieties with 607,201 high-quality SNPs. A mixed linear model (MLM) was carried out for single-locus method to evaluate the trait-SNP association analysis for agriculture traits using the Tassel software. The first three principal components (PCs) and kinship matrix were used as covariates to correct population structure for decreasing false-positive rate in MLM. The genome-wide significance threshold (p-value = 1.65 × 10 − 6 ) was calculated by negative log(1/n, n is the number of SNPs).
Five ML-GWAS methods within the mrMLM R package (https://cran.r-project.org/web/packages/mrMLM/ index.html) were used to map candidate QTNs, including mrMLM, FASTmrMLM, FASTmrEMMA, pLAR-mEB, and ISIS EM-BLASSO. All parameters were set at default values, and the critical LOD score was set to 3 for robust QTNs at the last stage. All these five methods used the PCA and kinship matrices in our study. The Manhattan and QQ plots for GWAS were displayed using the R package CMplot (https://github.com/ YinLiLin/R-CMplot). QTNs were named as Q + the initial letter of traits name abbreviations + chromosome number + occurrence sequence [60].

Identification of putative genes
The QTNs identified by at least two different ML-GWAS methods were regarded as the putative candidate loci. Local LD blocks containing at least two SNPs were calculated with all imputed SNP using the PLINK 1.9 software [61]. The local LD blocks of each significant QTN were determined via confidence intervals described by Gabriel [62]. The LD heatmap was visualized using Haploview software [63]. All the genes located in the LD block of QTNs were extracted for further analysis. By comprehensive analysis of gene annotation (MSU 6.1, http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_ Projects/o_sativa/annotation_dbs/pseudomolecules/ version_6.1/), protein domain function in previous reports, and transcriptome information (data were deposited in the CREP database, Collections of Rice Expression Profiling, http://crep.ncpgr.cn/), the candidate genes for each trait were further mined.

Phenotypic difference of candidate genes
The haplotypes of the candidate gene were determined based on the missense SNP, then the Wilcox-test was used to test the phenotypic difference among each haplotype. The characters of each trait were visualized with box plots using the R 3.6.1 language [64].