Genome-wide association study of important agronomic traits within a core collection of rice (Oryza sativa L.)

Background Cultivated rice (Oryza sativa L.) is one of the staple food for over half of the world’s population. Thus, improvement of cultivated rice is important for the development of the world. It has been shown that abundant elite genes exist in rice landraces in previous studies. Results A genome-wide association study (GWAS) performed with EMMAX for 12 agronomic traits measured in both Guangzhou and Hangzhou was carried out using 150 accessions of Ting’s core collection selected based on 48 phenotypic traits from 2262 accessions of Ting’s collection, the GWAS included more than 3.8 million SNPs. Within Ting’s core collection, which has a simple population structure, low relatedness, and rapid linkage disequilibrium (LD) decay, we found 32 peaks located closely to previously cloned genes such as Hd1, SD1, Ghd7, GW8, and GL7 or mapped QTL, and these loci might be natural variations in the cloned genes or QTL which influence potentially agronomic traits. Furthermore, we also detected 32 regions where new genes might be located, and some peaks of these new candidate genes such as the signal on chromosome 11 for heading days were even higher than that of Hd1. Detailed annotation of these significant loci were shown in this study. Moreover, according to the estimated LD decay distance of 100 to 350 kb on the 12 chromosomes in this study, we found 13 identical significant regions in the two locations. Conclusions This research provided important information for further mining these elite genes within Ting’s core collection and using them for rice breeding. Electronic supplementary material The online version of this article (10.1186/s12870-019-1842-7) contains supplementary material, which is available to authorized users.


Background
Cultivated rice (Oryza sativa L.) is one of the staple foods for over half of the world's population. Uncovering the genetic basis of natural variations in important agronomic traits in rice landraces is indispensable for ensuring the world's food supply.
In general, linkage mapping is a conventional method for gene mining in rice. However, association mapping based on linkage disequilibrium (LD) has been widely used in rice studies since it was firstly reported in maize [1,2]. Association mapping could overcome the limitations (i.e., limited alleles, high cost and poor mapping resolution) of linkage mapping [3] and enable researchers to use modern genetic technologies for exploiting natural genetic diversity and identifying elite genes in the genome [4]. Furthermore, many candidate genes or loci have been identified in rice through genome-wide association study (GWASs) of agronomic traits [5][6][7][8][9][10], abiotic stress tolerance [11][12][13] and metabolites [14,15].
A population with diverse landraces or cultivars which could be used in crops GWASs is supposed to be a permanent resource and be rephenotyped for many traits [2]. Sampling populations (e.g., core collections and mini core collections) created from rice landraces might be a suitable choice for rice GWASs [16]. Rice landraces are easier to be utilized for breeding than wild rice because they have greater genetic diversity than elite cultivars and represent an intermediate stage of domestication history between wild rice and cultivars [17]. As early as 1920-1964, Ying Ting collected more than 7128 rice landraces from all over China and from some of the other main rice-cultivating countries. This collection is one of the earliest collections of rice germplasm resources in China and was named Ting's collection [18]. Moreover, a rice core collection called Ting's core collection and consisting of 150 accessions selected based on 48 phenotypic traits has been constructed from 2262 accessions of Ting's collection [18]. In Ting's core collection, the average polymorphism information content (PIC) is 0.48, and the average genetic diversity is 0.54 [19]. Furthermore, Ting's core collection has been used in association mapping of 12 agronomic traits [20] and aluminum tolerance [21] with 274 SSR markers. However, no association mapping with higher resolution has been performed for agronomic traits within Ting's core collection.
In the present study, a GWAS of 12 rice agronomic traits was carried out using Ting's core collection of rice landraces with more than 3.8 million high-quality 3.8 million SNPs by whole-genome re-sequencing. Regions identified by the GWAS were compared with those identified as QTL and candidate genes in previous studies. This information will be very useful for rice breeders to improve elite cultivars.

Comparison between Ting's core collection and other populations used in GWASs
Ting's core collection consists of 150 rice landraces that were collected from 20 different provinces of China and from North Korea, Japan, the Philippines, Brazil, Sulawesi, Java, Oceania, and Vietnam (Additional file 2: Table S1). The number of varieties in Ting's core collection is lower than that in a population of Chinese rice landraces [5], a global collection [9] and a mini core collection of japonica rice [8], however, the phenotypic diversity in several agronomic traits in Ting's core collection are comparable to those in above mentioned collections or even higher for some agronomic traits (Fig. 1).

Genome re-sequencing and SNP identification
Whole-genome re-sequencing of Ting's core collection was performed, resulting in a total of 522.4 Gb of clean data with an average sequencing depth of 7.3× and an average coverage of 82.9% of the reference genome (Additional file 2: Table S2). The distribution of SNP positions along each chromosome are shown in Additional file 1: Figure S1. A total of 3,808,730 SNPs and 391,756 InDels with a minor allele frequency > 0.05 were generated, and 386,562 SNPs were found in the CDS region (Additional file 2: Table S3).

Phenotypic variation
A wide range of phenotypic variation in the 12 agronomic traits was revealed in Ting's core collection both in Guangzhou and Hangzhou (Fig. 1). Plant height, grain length, grain width, grain length/width, 100 grains weight, flag leaf length, flag leaf width and flag leaf length/width showed similar distributions in the two locations, while heading days, seed set rate, panicle length and panicle number per plant had different distributions in the two locations. The broad-sense heritability ranged from 56.2% (Heading days) to 96.5% (Grain length) for these traits (Fig. 1).
Population structure and LD estimation in Ting's core collection We performed PCA to identify the population structure of Ting's core collection with all SNPs data, and we observed two subpopulations in Ting's core collection (Fig. 2). The discrimination obtained via a NJ tree based on the SNP data was not identical to that based on Cheng's index method (Additional file 2: Table S1) [19] and showed fairly consistent results with that from the PCA (Fig. 3). Moreover, the LD dropped to the half of its maximum value at a distance of 100~350 kb on the 12 chromosomes, which is agreement with previous measurements [5,9,22,23] (Additional file 1: Figure S2) .
Relative kinship among varieties in Ting's core collection and the effect of controlling type I error using EMMAX In Ting's core collection, most kinship estimates between varieties were zero, and none of the kinship values were larger than 0.5, indicating that these varieties were unrelated (Additional file 1: Figure S3).
Observed versus expected P values for each signal were graphed for estimating the effect of controlling for type I errors. As deviations from expected values demonstrate that the statistical analysis may cause spurious associations [24]. Our result indicated that the false positives were unlikely for all traits except grain length/ width for the EMMAX method used in this study (Additional file 1: Figure S4).

GWAS of 12 agronomic traits
A total of 3,808,730 SNPs were included in a GWAS of 12 agronomic traits using the EMMAX method. Only one association signal's -log 10 (P) value was higher than 6.58 (this value was the significant threshold in this study, please see methods section)-a signal for heading days (Fig. 4a). Thus, we used -log 10 (mBF) = 4.97 as the significance threshold for different traits in our study. A total of 1308 and 4272 significant loci were identified for the 12 agronomic traits in Guangzhou and Hangzhou, respectively ( Table 1). The top-ranking candidate genebased association signals for each trait are shown in Additional file 3: Table S4.
Furthermore, Si et al. (2016) indicated that they considered analyzing the 11 predicted genes within the 260-kb interval centered on the index SNP from the GWAS given the estimated LD decay rate of about 100 to 200 kb [25]. Thus, we analyzed whether some of the significant detections for each trait were identical in the two locations according to the estimated distance of LD decay of 100 to 350 kb on the 12 chromosomes (Additional file 1: Figure  S2). Three significant regions (located on chromosomes 5, 6 and 7) for seed set rate were detected both in Guangzhou and Hangzhou. Moreover, two significant regions for flag leaf length/width were detected (located on chromosomes 10 and 12) in both locations (Figs. 4b, d, 5a, b, c, d and Table 1). Moreover, we chose the top 16 most significant signals (P value < 1 × 10 − 6 ) for in-depth analysis (Tables 2 and 3). The significant association signals with smaller P values and higher consecutive peaks for each trait are summarized in Table 3, Figs. 4 and 5, these signals might be located in candidate genes/regions. In addition, a detailed distribution of these new genebased association signals is included in Additional file 4: Table S5 To confirm the effect of different alleles at the   top 16 significant SNPs in the present study, we performed allelic analysis to these SNPs. Accessions in Ting's core collection carrying different alleles for most of the 16 SNPs showed distinct discrepancies of phenotypes ( Fig. 6).

Discussion
The abundant genetic variation in Ting's core collection makes it an important reservoir of genetic diversity and potential source of beneficial alleles for rice breeding (Fig. 1). It is very difficult to mine and utilize the exotic genes in all the rice accessions (i.e., 775,000) in the world [54] by either linkage mapping or association mapping. The maximum population size used for GWAS was 1495 rice accessions in a previous study [10]. One of the methods of utilizing a large set of germplasm in a GWAS is to construct a core collection [16]. A rice core collection consisting of 150 accessions selected based on 48 morphological traits from 2262 accessions of Ting's collection has been constructed and used in rice association mapping with low resolution [19,20]. Therefore, we performed a GWAS by wholegenome re-sequencing for getting higher resolution within Ting's core collection.
Although the population size of Ting's core collection is smaller than that of three other populations [5,8,9], the phenotypic diversity of several agronomic traits was comparable to that of these populations or even higher for some agronomic traits. Moreover, more than 3.8   [8]. Furthermore, a simpler population structure (Figs. 2 and 3), more rapid LD decay (Additional file 1: Figure S2) and more distant relatedness (Additional file 1: Figure S3) among accessions were found in Ting's core collection than in other collections. The above mentioned information illuminates and supports the fact that Ting's core collection is suitable for GWASs. Population structure in the present study was not identical to that in our previous study [55,56]. This discrepancy might be due to molecular markers density used in two studies. In our previous study, 274 SSR markers were included to detect the population structure while about 3.8 million SNPs were used in present study.
A total of 3,808,730 SNPs from 150 varieties were used for the GWAS (Additional file 2: Table S3). A mixed model was performed using EMMAX software [55,56]. EMMAX not only can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution. But also can reduce computational time [55,56]. The value obtained from a rough Bonferroni correction of P = 1/n, where n is the total number of markers used in the GWAS, is widely applied as the threshold P value for significance [5][6][7][8]10]. The threshold P value for significance in our study was P ≤ 2.63 × 10 − 7 , corresponding to -log 10 (P) = 6.58. However, only one peak, i.e., one on chromosome 4 for heading days was higher than this threshold value in Fig. 4a. Hence, we chose a lower -log 10 (mBF) value as the significance threshold for different traits in our study (Table 1) because there will be no significant locus according to the theoretical threshold P value. We speculated that this result might due to population size in our study. However, Ting's core collection is suitable for GWASs because the peaks located in well-known genes such as SD1, GS2, GS3, GS5, GL7, GW8 and TGW6 were also much lower than the theoretical threshold value (Figs. 4 and 5).
In our study, some significant association signals were identified through a GWAS of Ting's core collection. First, loci significantly associated with agronomic traits were uncovered close to cloned genes such as Hd1, SD1, Ghd7, GW8, and GL7 (Figs. 4, 5 and Table 4) that were reported in previous studies. Moreover, some of these loci were located by coincidence in these genes, and they might be natural variations of these genes, which could be functional ( Table 2 and Additional file 3: Table S4). Second, Si et al. [25] indicated that some significant loci within the distance of LD decay might be identical to each other. However, there were no identical significant loci in the two locations overall (Table 1), but some identical significant regions were discovered in the two locations when the estimated distance of LD decay of 100 to 350 kb was considered in Ting's core collection (Table 1, Figs. 4 and 5). Third, some new significant association signals that might be candidate genes were detected in our study (Figs. 4, 5 and Additional file 4: Table S5). Some peaks of these candidate genes such as the peak on chromosome 4 for heading days (Fig. 4a) were even higher than the threshold value. Further, the  Note: a R 2 represents the genetic variants explained by the significant SNPs. b Gene ID of MSU rice genome annotation project (http://rice.plantbiology.msu.edu/) Fig. 6 The box plots showing phenotypic distribution for Ting's core collection carrying the different alleles at the top 16 significant SNPs in Table 3. The middle line indicates the median, the box indicates the range of the 25th to 75th percentiles of the total data, the whiskers indicate the inter-quartile range and the outer dots are outliers peak on chromosome 11 for heading days (Fig. 4a) was higher than that of some famous genes such as Hd1. It would be valuable to test the functions of these candidate genes because some loci or regions were also detected by previous studies. For instance, the region on chromosome 8 for plant height, the region at position 23,300,000 on chromosome 1 for heading days and the region at position 21,650,000 on chromosome 2 were found to be significantly associated with related traits in the study of Zhao et al. [9].

Conclusions
In this study, Ting's core collection showed abundant genetic variation for agronomic traits and was proved to be a suitable natural population that could be comparable to other populations used in previous GWASs. Moreover, according to this study, core collections constructed from large natural populations of other plants might be good choices for GWASs. Furthermore, some natural variations in cloned genes were founded in this study, and these variations could be used for functional analysis of these genes. In addition, new candidate genes identified in this study could be very useful for rice improvement. In sum, this study provided important information for further mining these elite genes within Ting's core collection and using them for rice breeding.

Plant material
Ting's core collection with 150 accessions of rice landraces [18], was used in this study. The information for these accessions is shown in Additional file 2: Table S1.

Phenotyping
In total, 12 agronomic traits of Ting's core collection were measured in two locations. The methods of measuring these 12 agronomic traits were identical to those described in detail in our previous study [20]. A randomized complete block design with three replications was used in two locations. First, Ting's core collection was cultivated at the farm of South China Agricultural University, Guangzhou (23°16′' N, 113°8′ E), during the late season (July-November) in 2009. The design and methods of this research in Guangzhou were described in detail in our previous study [20]. Second, Ting's core collection was cultivated at the farm of China National Rice Research Institute, Hangzhou (30°3′ N, 120°2′ E), during the late season (May-October) in 2016. A randomized complete block design with three replications, as in Guangzhou, was used during this season in Hangzhou. The space between rows and between plants was set to 26 and 20 cm, respectively. Twenty-four plants of each variety were grown in four rows with 6 plants per row. For each block, the five plants in the middle position of the second and third row of each variety were selected to prevent edge effects. The broad-sense heritability (H 2 ) was calculated as H 2 ¼ σ 2 g =ðσ 2 g þ σ 2 e Þ, where σ 2 g is the genetic variance, σ 2 e is the environmental variance.

DNA isolation and genome sequencing
Total genomic DNA was extracted using a modified SDS method. Then, each landrace's DNA was sheared randomly into~500-bp fragments by Covaris, and the DNA fragments were loaded on 2% agarose gels. Fragments of~500 bp were recovered and purified, and adapters were then added to each fragment. After making libraries for the clusters, they were loaded into an Illumina HiSeq™ 4000 for 2× 150-bp paired-end sequencing at 6~7-fold genome coverage. The 150-bp paired-end reads were mapped onto the rice reference genome (IRGSP 1.0) using bwamem with the -M option in BWA software [57]. The mapped reads were realigned by using RealignerTargetCreator and IndelRealigner in GATK [58]. UnifiedGenotyper in GATK was used with the −glm BOTH option to label SNPs and indels. After removing nucleotide variants with a missing rate ≥ 0.25 and a minor allele frequency > 0.05, a total of 3,808,730 SNPs and 391,756 indels were generated.

Population genetic analyses
Principal component analysis (PCA), construction of a neighbor-joining (NJ) tree, determination of LD decay level and kinship analysis among landraces were performed based on SNPs. The population structure of the 150 varieties was estimated with PCA by using the software EIGENSTRAT [59]. PHYLIP version 3.695 software (http://evolution.genetics.washington.edu/ phylip/getme-new1.html) was used to construct the NJ tree on the basis of similarity measures. The software MEGA V5.2 was used to observe the NJ tree [60]. The LD in Ting's core collection was evaluated using squared Pearson's correlation coefficients (r 2 ) calculated with the −r2 command in the software PLINK [61]. A Q matrix was obtained from the membership probability of each variety using ADMIXTURE Version 1.22 software [62]. The Q matrix was used for further association mapping. The Loiselle algorithm was chosen to construct a kinship matrix (K) with the software SPAGeDi [63]. Moreover, all negative kinship values were set to zero.

GWAS
A total of 3,808,730 SNPs from 150 varieties were used for GWAS. A mixed model was performed using EMMAX software [56]. P ≤ 2.63 × 10 − 7 (P = 1/n, n = total number of markers used [7], which is a rough Bonferroni correction, corresponding to -log 10 (P) = 6.58). However, no significant loci were detected based on this threshold, hence, we calculated another significance threshold, i.e., a minimum Bayes factor (mBF), based on the P value threshold for significance. The mBF was calculated using the following formula: mBF = −e * P * ln(P) [64]. Thus, the significance threshold in this study was -log 10 (P) = 4.97.

Additional files
Additional file 1: Figure S1. SNP distribution along position in each chromosome. Figure S2. Genome-wide average LD decay estimated in Ting's core collection on 12 chromosomes. Figure S3. Distribution of pairwise relative 1 kinship values based on 3.8 million SNPs in Ting's core collection. The height of blue bar represents the percentage of varieties in different range of kinships. Figure S4. Plots of observed versus expected P-values using EMMAX for 12 agronomic traits. Red symbol represents expected P-values, and Blue symbol represents observed P-values. (DOCX 1084 kb) Additional file 2: Table S1. Accessions, variety names, origin and germplasm types of 150 rice varieties in Ting's core collection. Table S2. Resequencing average read depth and coverage in Ting's core collection. Table S3. Summary of categorized SNPs and InDels. (DOC 246 kb) Additional file 3: Table S4. List of all P-value ranked genes in the genebased association analysis of heading days/plant height/seed set rate/ panicle length/grain length/100 grains weight/flag leaf length/flag leaf width/panicle number per plant. (XLSX 169 kb) Additional file 4: