Transcriptomic resources for prairie grass (Bromus catharticus): expressed transcripts, tissue-specific genes, and identification and validation of EST-SSR markers

Background Prairie grass (Bromus catharticus) is a typical cool-season forage crop with high biomass production and fast growth rate during winter and spring. However, its genetic research and breeding has remained stagnant due to limited available genomic resources. The aim of this study was to generate large-scale genomic data using high-throughput transcriptome sequencing, and perform a preliminary validation of EST-SSR markers of B. catharticus. Results Eleven tissue samples including seeds, leaves, and stems were collected from a new high-yield strain of prairie grass BCS1103. A total of 257,773 unigenes were obtained, of which 193,082 (74.90%) were annotated. Comparison analysis between tissues identified 1803, 3030, and 1570 genes specifically and highly expressed in seed, leaf, and stem, respectively. A total of 37,288 EST-SSRs were identified from unigene sequences, and more than 80,000 primer pairs were designed. We synthesized 420 primer pairs and selected 52 ones with high polymorphisms to estimate genetic diversity and population structure in 24 B. catharticus accessions worldwide. Despite low diversity indicated by an average genetic distance of 0.364, the accessions from South America and Asia and wild accessions showed higher genetic diversity. Moreover, South American accessions showed a pure ancestry, while Asian accessions demonstrated mixed internal relationships, which indicated a different probability of gene flow. Phylogenetic analysis clustered the studied accessions into four clades, being consistent with phenotypic clustering results. Finally, Mantel analysis suggested the total phenotypic variation was mostly contributed by genetic component. Stem diameter, plant height, leaf width, and biomass yield were significantly correlated with genetic data (r > 0.6, P < 0.001), and might be used in the future selection and breeding. Conclusion A genomic resource was generated that could benefit genetic and taxonomic studies, as well as molecular breeding for B. catharticus and its relatives in the future. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-021-03037-y.


Background
Bromus L. is a genus distributed widely in the temperate world, which contains approximately 150 C3 grass species, many of which are not taxonomically identified [1,2]. Bromus catharticus is one of the major agricultural grass species, which belongs to the Ceratochloa section (hexaploid, 2n = 6x = 42), and has become naturalized in many areas, with facultative cleistogamous reproductive behavior and annual or short-lived perennial growth forms [1]. Due to high biomass yield, fast growth rate during winter and spring, strong adaptability, and the ability to remain green after seed maturation, B. catharticus has become more and more popular as a cool-season forage grass in the mountainous and hilly areas of Southwest China [3,4]. However, as cultivated varieties of B. catharticus were mostly developed through hybrid selection and domestication from wild germplasms, its genomic resources are considerably limited, resulting in slow progress on molecular breeding and taxonomy research.
Next generation sequencing technologies provide cutting-edge approaches for high-throughput sequence generation, allowing rapid and comprehensive analyses of genomes and transcripts. Extensive transcriptomic studies have already been applied in various plant biological contexts. Where for instance, comparative transcriptome analysis on different tissues and stages was used to reveal the regulatory modules of tissues development [5][6][7], or the metabolism of specific biochemical components [8][9][10], and tissue-specific expression profiling has proved effective in uncovering biological pathways and regulatory networks. Despite general applications of genomic approaches, to date, these have not yet been applied to research on B. catharticus. Therefore, it is still necessary to provide an original reference transcriptome profile for B. catharticus and its relatives.
Transcriptome sequencing has also been used to identify molecular markers, especially simple sequence repeat (SSR) markers, in many forage grass species, such as Medicago sativa [11], Elymus sibiricus [12,13], and Lolium multiflorum [14] due to its cost-efficiency. Molecular markers have a great potential in the modern plant breeding process, and have been widely applied to determine genetic diversity, genotype identification, and in marker-assisted selection (MAS) [13,15]. However, the lack of marker information for molecular phylogeny and genetic structure limited brome (Bromus L.) species collection, conservation and utilization, and there are as of yet few available EST sequences of Bromus L. in the GenBank database (https:// www. ncbi. nlm. nih. gov/). Previously, amplified fragment length polymorphisms (AFLPs) were estimated in 32 South American and one North American Bromus genus accessions [2]. RAPD and AFLP analyses on B. catharticus showed a narrow genetic basis of varieties from France and Singapore, which provided a reference for the use of molecular marker techniques to select parent genotypes to broaden the genetic basis of these varieties [16]. SSRs and AFLPs were used to compare the correlation between molecular markers and significant genetic variation in Bromus tectorum, and SSRs were suggested to be good surrogates for phenotypic traits in population genetic studies of strongly inbred species [17]. Due to a lack of effective molecular markers from genetic data of its own species or genus, SSRs and EST-SSRs (expressed sequence tag-SSRs) from wheat (Triticum aestivum) were employed on Bromus inermis to evaluate marker transferability, and showed advantages such as co-dominance, stability, and high reproducibility [18]. EST-SSRs have been confirmed to have excellent transferability between different species of the same genus, and even between different genera. Furthermore, transcriptome sequencing (RNAseq) is useful in the identification and development of a great number of SSR markers, and is faster and more cost effective compared to the traditional SSR development processes [13].
The present study reports a comprehensive transcriptome sequencing of a B. catharticus genotype. The major objectives were (a) to generate ESTs from various tissues at different developmental stages via transcriptomic sequencing of B. catharticus, and to annotate a de novo transcriptome assembly; (b) to compare transcript expression between different tissues and identify tissue-specific genes; (c) to carry out large-scale in silico identification of SSRs from transcriptomic data; and (d) to perform genetic diversity and population structure analyses on a set of germplasm accessions combined with phenotypic data. A genomic resource was generated that showed preliminary potential for application in molecular-assisted selection, which may provide a basis for further investigation of the genetics and taxonomy of Bromus and its relatives.

Illumina sequencing, de novo assembly, and functional annotation
Illumina sequencing of eleven libraries for seed, leaf, and stem tissues generated an average of 48,945,309 raw reads and 46,614,613 clean reads, and high-quality clean reads were used for subsequent analyses (Table 1). Transcripts amounting to 450,361 were assembled with an N50 length of 1346 bp and an N90 length of 281 bp, ranging from 201 to14,257 bp with a mean of 766 bp. The transcripts were then clustered into a total of 257,773 unigenes with 1129 bp, 1629 bp, and 531 bp for mean length, N50, and N90, respectively (  (Fig. 1a). Unigenes amounting to 25,739 (9.98%) were annotated in all seven databases, and 64,691 unigenes (25.1%) did not have any matches, and are potentially novel genes that arose during the evolution of the B. catharticus genome. Based on Nr database annotation, compared to more than 700 species, unigenes of B. catharticus were shown with the top match of Hordeum vulgare (24.6%), followed by Aegilops tauschii (19.9%), Triticum urartu (12.0%), Brachypodium distachyon (11.7%), and Triticum aestivum (5.9%; Fig. 1b). GO enrichment analysis of unigenes showed that cellular processes, metabolic processes, and single-organism processes were highly enriched, especially "binding" (66,849) and "catalytic activity" (53,923) in the molecular function category, and "cell" (38,314) and "cell part" (38,293) among cellular components (Fig. S3a). Among the five categories of KEGG pathways, metabolism was the largest, with 11 subgroups and 28,653 genes, including the most significant subgroup of transcription with 7,828 genes (Fig. S3b).

Analysis of DEGs in tissues of B. catharticus
In order to identify tissue-specific genes, we placed 10 transcriptomic samples into three groups, i.e., group S for seed with S1, S2, F1, and F2, group L for leaf with L1, L2, and L3, and group T for stem with T1, T2 and T3. We conducted a comparative analysis among S, L, and T for DEGs (P < 0.05), and determined tissue-specific highly expressed genes in one tissue, but not in the other tissues. We plotted Venn diagrams for the upregulated DEGs in the three pair-wise comparisons (i.e., S vs L, S vs T, and L vs T). A total of 1803 genes were found to be specifically and highly expressed in developing and germinating seeds (overlapping DEGs in "S vs L" and "S vs T, " Fig. 2a), but not in the other two tissues of leaf and stem. And 3,030 and 1,570 genes were specifically and highly expressed in leaves (overlapping DEGs in "L vs S" and "L vs T, " Fig. 2b) and stems (overlapping DEGs in "T vs S" and "T vs L, " Fig. 2c), respectively. We found genes specifically and highly expressed in developing and germinating seeds were significantly enriched in the KEGG pathway of "ribosome" (Fig. 2d). Leaf-specific highly expressed genes were significantly enriched in genes involved in "carbon metabolism", "photosynthesis", "glyoxylate and dicarboxylate metabolism", and "carbon fixation in photosynthetic organisms" (Fig. 2e). Stem-specific highly expressed genes were significantly enriched in genes involved in "phenylpropanoid biosynthesis", "plant-pathogen interaction", "amino acid biosynthesis", and "MAPK signaling  Tissue-specific genes and their KEGG annotation in seeds (termed S, including S1, S2, F1 and F2), leaf (L, including L1, L2, and L3) and stem (T, including T1, T2, and T3). Venn diagrams showed the number of genes specifically expressed in seeds (a), leaves (b) and stems (c). KEGG pathway enrichment diagrams of seed-(d), leaf-(e) and stem-(f) specific genes are shown pathway" (Fig. 2f ). To validate DEGs obtained from RNA-seq, we performed qRT-PCR analysis for 11 selected genes in three tissue samples (F2, L2 and T2), and the result was in good agreement with RNA-seq results (r 2 = 0.814, P < 0.001, Fig. 3).

Identification of EST-SSR markers and genetic diversity
A total of 37,288 SSRs were identified in 32,370 (14.47%) unigenes, in which more than one SSR was present in 4,238 transcripts. Tri-nucleotide SSR repeats were the most abundant (15,208), followed by mono-nucleotides (12,892) and di-nucleotides (8,174). A or T repeats (10,513) were four or five times more common than C or G repeats (2,379) among mono-nucleotide repeats. CCG/ CGG (4,818), AGG/CCT (2,673) and AGC/CTG (2,375) comprised 64.87% of all the tri-nucleotide repeat motifs among SSRs (Fig. 4, Table S1). AAAG/CTTT, ATCCG/ ATCGG, and ACC TCC /AGG TGG repeats dominated the tetra-, penta-, and hexa-nucleotide repeats, respectively, although they were rare. Of SSRs, 27,540 were successfully used to design > 80,000 primers; a total of 420 primer pairs were randomly selected to determine polymorphisms and performance, and finally 350 primer pairs were successfully used to amplify DNA fragments across 24 B. catharticus accessions. To further validate the sequences of the SSR loci, ten PCR products from nine accessions for two SSR markers were sequenced. In all of the cases, the sequenced alleles from the selected accessions were homologous to the original locus from which the primers were designed, which effectively guarantees acceptable downstream analysis (Fig. S4).
We further selected 52 effective primer pairs that were designed based on the di-, tri-, and tetra-nucleotide SSRs with 5-10 repeats (Table S2). The mean PP, PIC, Rp, and MI values were 91.60%, 0.254, 1.14, and 0.67, respectively. Among the three repeat types, tetra-nucleotide SSRs showed relatively higher PIC (0.290), Rp (1.29), and MI (0.73) values, indicating a high discriminatory power. Primers from five-repeat SSRs showed higher PIC values (0.285), while seven-repeat SSRs showed higher Rp (1.29) and MI (0.86) values (Table 3). A total of 152 bands amplified by 52 selected primer pairs were used to estimate Nei's genetic distance of the 24 accessions. The genetic distance values ranged from 0.008 to 0.790, with an average distance of 0.364. The maximum value (0.790) was observed between accession PI442077 and PI309955, and the minimum coefficient value (0.008) was observed between PI187000 and PI595114. To study the genetic relationship and population structure of B. catharticus, UPGMA clustering analysis and STRU CTU RE analysis were performed (Fig. 5). Based on maximum likelihood and delta K values (K = 6), the accessions in this study were divided into five STRU CTU RE subgroups (represented by five colors: blue-green, purple, orange, green, and light yellow). The result of four UPGMA clusters (Cluster I ~ IV) was consistent with STRU CTU RE analysis. Cluster II contained two STRU CTU RE subgroups (purple and orange). Both Clusters I and II consisted of 10 accessions, and only two accessions were found in Clusters III and IV, respectively. In addition, 14 accessions (> 50%) had high membership coefficients (Q-value > 80%), including 7, 2, 2, and 3 accessions  (Table 4). Additionally, accessions collected from the wild had relatively high genetic diversity (0.259), followed by cultivar materials (0.251), and then materials of uncertain breeding status (0.179).

Phenotypic variation and Mantel analysis
The basis for evaluation of plant germplasm resources was the phenotype. In this study, 11 quantitative traits in 24 B. cartharticus accessions were investigated to estimate breeding potential and phenotypic variation. The descriptive statistics showed variation coefficients ranged from 2.21% to 24.87%, with an average of 11.61%, indicating a high level of phenotypic variance (Table S3).
Analysis of UPGMA clusters and PCA showed that all 24 accessions were divided into four groups (A, B, C, and D), which showed a weak relationship with geographical distribution (Fig. 6). Group A contained only one accession (PI 442,077) from Asia, with low biological yield, PH, SD, TN, and wide flag leaf. Group B consisted of four accessions from South America and two from Europe and Asia, and showed relatively high PH, SD, leaf width, and low TN and LFI. Group C contained almost all the accessions from North America (5) and Europe (3), and other accessions from Asia (4) and South America (3), and showed high TN and LFI, and low SD and leaf width. There were two accessions in Group D, PI308506 from South America and PI595118 from North America, which both had a particularly high TN, and low SD with a dwarf phenotype (Fig. 6a, Table S3). Moreover, PI308506 showed narrow leaves and late heading, and PI595118 showed broad leaves and early heading. The PCA analysis  also indicated that the first two components accounted for 91.97% of the variance (76.45% and 15.53%, respectively), and that there were four groups (Fig. 6b). Mantel analysis was performed to understand the relationship between genetic distance and 11 traits based phenotypic distance, which showed a high correlation (r = 0.612, P < 0.01) (Fig. 7). While Mantel analysis was also used to analyze the correlation between the Euclidean distance of 11 phenotypic traits and genetic distance of 24 accessions respectively ( Table 5). The results showed that the coefficients ranged from 0.340 to 0.666, and four traits (PH, FLW, SD, and FMY) had high significant correlation (r > 0.6, P < 0.001). In addition, phenotypic groups were consistent with phylogenetic clusters. . All accessions were grouped into four clusters. The accession codes with red, blue, yellow, and black indicated that these accessions were collected from South America, North America, Asia, and Europe, respectively. All accessions colored based on the sample origins were divided into four UPGMA clusters and five STRU CTU RE subgroups (represented by five colors: blue-green, purple, orange, green, and light yellow). The numbers beside each branch represent bootstrap values, and only the numbers higher than 40% are presented  The genetic data could mainly explain the phenotypic difference. For example, the low biological yield, PH, SD, TN, and wide flag leaf of PI442077 in Group A might be due to particular genetic information in Cluster IV (Figs. 5, 6 and Table S3). All ten accessions in Cluster I were found in Group C, and all six accessions in Group B were in Cluster II. Both PI308506 and PI595118 in Group D showed dwarf phenotype, with more tillers and thinner stems (Fig. 6, Table S3), but there were significant differences in heading date and leaf width between PI308506 in Cluster IV and PI595118 in Cluster III, which might be explained by a split in genetic clustering.

The transcriptome is a key genetic resource for Bromus species
Efforts toward molecular phylogeny and molecular breeding of Bromus L. have stagnated due to limited genomic resources. In the present study, our complete transcriptomic analysis in B. catharticus covered 11 tissue samples, and likely assembled almost all expressed genes, with a total of 257,773 unigenes, 193,082 (74.9%) of which were annotated. While our data represents a large collection of expressed genes that can be used in future genetic studies and improvement programs in B. catharticus and even other Bromus species, these numbers are superior to those in some other forage species previously reported, for example Dactylis glomerata [19], Elymus sibiricus [12] and Pennisetum purpureum [20]. It is known that the third-generation sequencing of PacBio and Nanopore could produce up to 1 Mb reads, but the full length of transcripts always prefers to do the pooling of various tissues, to lower the sequencing cost; in addition, their sequence errors are higher than those of Illumina sequencing platforms [21,22]. Our data may be used as a gene reference in subsequent transcriptomic analyses in B. catharticus in the future.

Tissue-specific gene expression
Grass seeds are the basis of natural grassland restoration and artificial grassland establishment, while stems and leaves are important factors for forage yield and nutritive value [23,24]. Few tissue-specific transcriptome works on forage grasses have been done to identify candidate genes involved in tissue development and forage quality improvement. Research on elephant grass (Pennisetum purpureum) stem lignocellulose identified 3,852 DEGs, and screened 43 candidate genes involved in lignocellulose biosynthesis, which might promote the development of high-quality elephant grass varieties [7]. We found 1,570 stem-specific genes, among which those significantly enriched in the phenylpropanoid biosynthesis pathway might affect the feed value of B. catharticus, as phenylpropanoid metabolism is a major pathway of lignocellulose biosynthesis [7,25]. While we identified 3,030 leaf-specific genes mainly related to the photosynthetic pathways, which might be of benefit in the improvement of leaf photosynthetic efficiency and biomass parameters. Analysis of leaf transcriptome of alfalfa identified 5,133 DEGs and senescence-associated pathways, including ribosome and phenylpropanoid biosynthesis, and starch and sucrose metabolism pathways, and suggested a novel interpretation of the molecular mechanisms of leaf senescence [26]. In addition, our comparative analysis showed that the ribosome pathway was the most significantly enriched pathway among 1,803 seed-specific genes. It is known that ribosomes are needed to synthesize storage and other important proteins, such as late embryogenesis abundant (LEA) proteins with key roles in seed maturation and drying [27], and they are also required in seed germination, where new ribosomes are produced after several hours of imbibition to synthesize proteins related to DNA repair, RNA synthesis, organelle assembly, and energy supply [28,29]. Simply put, this work provides a preliminary understanding of tissue-specific expression profiling of B. catharticus, which may be valuable genomic data for the improvement of the yield and quality of B. catharticus.

Abundant EST-SSRs in B. catharticus
EST-SSRs are powerful genetic markers, which have been popularly used to determine genetic diversity, population structure, cultivar identification, genetic maps, and marker-assisted selective breeding due to its advantages in identifying co-dominance, abundant polymorphisms, high transferability, and rapid and convenient detection methods [13,30]. To date, few EST-SSRs have been applied to B. catharticus, and most of the diversity estimation was based on the phenotypic characteristics. Within our transcriptomes from different development and growth stages, with a total of 37,288 SSRs, the occurrence frequency of EST-SSRs was one per 7.8 kb. This frequency was higher than that of alfalfa (1/12.06 kb) [11] and Leymus chinensis (1/10.78 kb) [31], but lower than that of E. sibiricus (1/6.95 kb and 1/6.2 kb in two previous reports [12,13]). In addition, the total number of SSRs was far more than that of E. sibiricus (8,769 and 8,871 in two reports [12,13]), Lolium multiflorum (11,254) [14], and alfalfa (1,649) [11]. In contrast, the proportion of EST-SSR markers with numbers of high polymorphisms was lower than that of alfalfa (27/100) [11] or E. sibiricus (112/500) [12], which may be caused by primer design and selection. The differences in the amount and frequency of SSRs also strongly depends on the species, size of the database, SSR search criteria, and mining tools used. In this study, tri-nucleotide SSR repeats were most abundant, followed by mono-and di-nucleotide repeats among various classes of SSRs, and CCG/CGG was the most in tri-nucleotide repeats, which was identical to those found in E. sibiricus and L. multiflorum, but different from the GAA/CTT repeats most abundant in Sesamum indicum and Lupinus luteus. The abundance of SSRs seems to be species-specific [12,13,32,33]. In brief, the present study identified a large number of molecular markers conducive to breaking the bottleneck problem of genetic research in B. catharticus.

EST-SSRs are valuable tools for germplasm characterization and molecular breeding of B. catharticus
Understanding of genetic diversity and population structures is very important for the effective management and utilization of germplasm resources [34]. Evaluation of the genetic variation of forage resources can provide a basis for selection of resources, and thus speed up the progress of high-quality forage breeding. The evaluation of B. catharticus germplasms has been relatively lagging, and diversity estimation by molecular markers showing significantly higher amounts and resolution than those of morphological and cellular markers was especially rare [35]. A previous study on the genetic diversity of B. catharticus showed that the genetic basis of commercial varieties was very narrow, with an average genetic similarity coefficient of 0.96 [16]. In contrast, the present study contained 24 accessions from four continents, including wild types and uncertain materials, and found an even lower level of genetic similarity (averaged 0.636) than those of commercial varieties. Most accessions from Asia were grouped into one cluster, and 66.67% accessions showed mixed relationships, while 87.50% South America accessions showed a pure origin (Q-value > 80%), which suggested the probability of gene flow among the two regions were different. Based on breeding states, the results suggested higher diversity in wild materials than cultivar and cultivated materials, which might be due to contribution of long-term artificial domestication [36]. The tested wild resources showed higher genetic diversity, which also indicated the possible benefits of breeding comprehensive varieties with wild resources in the future. Over the long history of evolution, a number of germplasms were formed under genetic and environmental influences, leading to abundant morphological diversity. Phylogenetic analysis of all the accessions were consistent with phenotypic groups. There is a relatively high correlation (r = 0.612, P < 0.001) between genetic variation and phenotypic variation, and the phenotypic clusters showed a weak relationship with geographical distribution, suggesting that the phenotypic diversity of B. catharticus was mainly due to the contribution of genetic factors. This could be supported by the previous works on Ceratochloa, which suggested a high correlation (r = 0.70, p = 0.001) between morphological variation and DNA polymorphism using AFLP markers [2], and a low correlation (r = − 0.243) between geographical distance and genetic similarities among hexaploid Bromus accessions [3]. Phenotypic diversity analysis in Argentinian populations suggested total phenotypic variation of B. catharticus was mostly due to the environmental component [37]. The inconsistency between this study and others might be caused by different sources of tested germplasms and collection of phenotypic traits. In this study, we also suggested FLW, PH, SD, and FMY might be more reliable and effective traits in a selective breeding project; in particular, SD as an indicator of major biomass traits showed relatively high variation coefficients and high correlation with molecular data, which could be highly effective and stable in marker-trait association or QTL mapping. The identified EST-SSR markers may potentially be used for further genetic selection and marker-assisted breeding.

Conclusions
In this study, in B. catharticus, we conducted RNA-seq analysis to identify 1803, 3030, and 1570 genes specifically and highly expressed in seed, leaf, and stem, respectively. We also generated 257,773 unigenes based on a total of 11 RNA-seq datasets, which were used to develop 37,288 EST-SSRs. And 52 polymorphic EST-SSR markers were selected to analyze genetic diversity among 24 B. catharticus accessions. A low genetic diversity was found in all the accessions, albeit with higher ones in South America and Asia and wild accessions. The phylogenetic analysis clustered all the accessions into four clusters, which were consistent with the phenotypic clustering results. Furthermore, Mantel analysis suggested the total phenotypic variation was mostly contributed by genetic component, and four phenotypic parameters, including stem diameter, plant height, leaf width, and biomass yield, were significantly correlated with genetic variations, suggesting a great potential in genetic selection and breeding. In general, a comprehensive genomic resource was generated that could benefit genetic and taxonomic studies, as well as molecular breeding for B. catharticus and its relatives in the future.

RNA isolation and transcriptome sequencing
BCS1103, a new line of B. catharticus with an excellent yield, was originally selected via wild germplasm collected in the southwest region of Sichuan Province, China. The single plant was grown to maturity in a pot with recommended cultural practices at Sichuan Agricultural University, Yaan, Sichuan. Eleven tissue samples of this accession were collected from at least ten healthy individuals, i.e., seeds with emerged radicle length of 1-2 mm (S1) and with shoot length of 1 cm (S2), aboveground seedlings at the three-leaf stage (S3), leaves at tillering stage (L1), flag leaves (L2), and penultimate leaves (L3) at 8 days after fertilization (DAF); stems at tillering stage (T1) at 8 DAF (T2) and 20 DAF (T3); and seeds at 8 DAF (F1) and 20 DAF (F2) (Fig. 8). All tissues were placed directly into liquid nitrogen before being stored at − 80 °C. Total RNA was isolated using the method described by Ghawana et al. [38]. A total amount of 1.5 μg RNA per sample was used for library construction. Sequencing libraries were generated using NEBNext ® Ultra ™ RNA Library Prep Kit for Illumina ® (NEB, USA) following the manufacturer's recommendations, and index codes were added to attribute sequences to each sample. The libraries were sequenced on an Illumina Hiseq 2500 platform.

De novo assembly and gene annotation
In this study, clean reads were obtained by removing those reads that contained adapters or poly-N (i.e., ambiguous sequences), and low-quality reads that had more than 50% of bases with Q-value ≤ 20. The unigenes were binned by Corset [39] from the assembled contigs using Trinity [40]. All unigenes were annotated against seven databases: the NCBI non-redundant protein database (Nr), the NCBI nucleotide database (Nt), Pfam, Swiss-Prot, Cluster of Orthologous Group, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG). NCBI Nt and Nr alignments were considered superior annotations if there were inconsistent results among these databases.

Gene expression and DEGs identification
RSEM software was used for mapping clean reads onto the unigenes [41]. Gene expression levels were estimated based on FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) values. The expression patterns of each sample reflected by the density distribution of FPKM values were plotted based on a log 10 (FPKM) scale as fold change. Moreover, tissue-specific genes were identified by overlapping Fig. 8 Tissue samples used in this study. A total of eleven samples were collected for RNA-seq, including seeds emerging with a radicle length of 1-2 mm (S1) and shoot length of 1 cm (S2), aboveground parts of seedlings at the three-leaf stage (S3), leaves at the tillering stage (L1), flag leaves (L2) and penultimate leaves (L3) at 8 DAF, stems at tillering stage (T1), 8 DAF (T2), and 20 DAF (T3), and seeds at 8 DAF (F1) and 20 DAF (F2). All samples were collected and mixed from at least ten individual plants upregulated DEGs in the two comparisons using DEGseq with a threshold set of adjusted p-value < 0.05 and |log 2 fold change|> 1 [42].

Quantitative real-time PCR (qRT-PCR) analysis
To experimentally validate our results, eleven genes were selected for qRT-PCR analysis on the three samples (F2 for seed, L2 for leaf and T2 for stem), with three biological replicates. Details of the eleven genes and their primer pairs were shown in Table S4. The first-strand cDNA was synthesized according to Revert Aid Premium Reverse Transcriptase Kit (Thermo Scientific ™ ). The experiment was performed in a CFX96 Real-Time System (Bio-Rad, USA) with GAPDH (glyceraldehyde-3-phosphate dehydrogenase) gene as internal control. Each PCR reaction contained 10 μl of the SYBR-Green Master Mix and 4 pmol of each primer, and the amplification was set under the following condition: an initial step of 95 °C for 3 min and a three-step cycle of 95 °C for 10 s, 57 °C for 10 s and 72 °C for 15 s, repeated 40 times. Gene quantification was determined using the ΔΔCT method. The relative expression values normalized using log 2 (fold change) were used to calculate the correlation coefficient between RNA-seq and qRT-PCR results.

EST-SSR detection and primer design
MISA software was used to detect potential SSR markers in the unigenes [43]. Default parameters were set to include mono-nucleotide repeats of more than 10X, di-nucleotide repeats of more than 6X, and tri-, tetra-, penta-, and hexa-nucleotide repeats of more than 5X. Primer 3 was used to design PCR primers for each SSR with the default parameters [44]. PCR primers were synthesized in TsingKe Biological Technology Company (Beijing, China).

SSRs validation and genetic variability
A total of 24 accessions were collected from ten countries across four continents (North America, South America, Europe, and Asia), including wild materials, cultivars, and accessions of uncertain breeding status ( Table 6). Seeds of BCS1103 and Jiangxia were provided by Sichuan Agricultural University, and the others were obtained from the National Plant Germplasm System (USA). Seeds were planted in basins until three-leaf stage, in a randomized complete block design with three replicates at Sichuan Agricultural University (Ya'an, Sichuan, China) during the crop season in 2016 and 2017. In each plot, intra-row spacing of 0.5 m and inter-row spacing of 0.5 m allowed Polymorphic information content (PIC), marker index (MI), and resolving power (Rp) were calculated to assess the amplification efficiency of SSRs. PIC was calculated for each primer pair according to the formula: where PIC is the polymorphic information content of marker i, f i is the frequency of the fragments, which were present, and "1 − f i " is the frequency of the fragments, which were absent. PIC was averaged over the fragments for each primer pair. MI was calculated following the formula from Powell et al. [45]: MI = PIC × EMR, where EMR (effective multiple ratio, EMR = PP × NPF) is defined as the product of the percent of polymorphic loci (PP) and the number of polymorphic fragments (NPF). The Rp of each primer pair was calculated according to the formula of Prevost and Wilkinson [46]: Rp = ∑Ib, where Ib is the fragment informativeness, calculated as: Ib = 1-[2 ×|0.5 − p|], where p is the proportion of the accessions containing the fragment. Genetic diversity was measured by the number of different alleles (Na), number of effective alleles (Ne), Shannon's information index (I), and expected heterozygosity (He), which were calculated using GENALEX version 6.5 [47].
Pair-wise (Dice) genetic distance was calculated, and the matrix was used to construct a cluster using the Unweighted Pair Group Method with Arithmetic (UPGMA) in Free Tree software with 1000 bootstrap repetitions [48]. Moreover, a Bayesian model-based clustering method was used to cluster accessions into subpopulations using STRU CTU RE software (version 2.3.4) [49], and K-means clustering was used to identify the most likely number of genetic groups (K) by calculating ΔK [50]. Structure simulations were carried out using an admixed model, and based on initials trails, the membership of each genotype was tested for the range of genetic clusters from K = 2 to K = 10 (each with 10 independent runs). For each value of K, the burn-in period was set to 50,000 iterations, and estimations were performed for 100,000 Markov Chain Monte Carlo replicates in each run. CLUMPP [51] was used with the Greedy Algorithm method, and the graphical display (barplot) of population structure was generated using DISTRUCT [52].

Phenotypic and Mantel analysis
Phenotypic data was collected at the full-flowering stage in April, 2017, except for biomass yield and heading date. Nine individual plants were selected randomly from three replicate plots (three per plot) for observations of 11 morphological traits: plant height (PH, cm), stem diameter (SD, mm), flag leaf length (FLL, mm), flag leaf width (FLW, mm), penultimate leaf length (PLL, cm), penultimate leaf width (PLW, mm), length of first internode (LFI, cm), tiller number (TN), days from seeding to heading (DSH, days), dry matter yield (DMY, g/plant), and fresh matter yield (FMY, g/plant) [37] (see measuring methods in Table S5). Data analysis was performed using NTSYS (version 2.2) and PAST (version 3.02) software [53,54], including descriptive statistics, PCA, and UPGMA clustering. Finally, a Euclidean distance matrix of morphological traits was calculated using NTSYS, and the genetic distance and phenotypic distance matrix were used for Mantel analysis through GENALEX version 6.5 [47].