Illumina paired-end sequencing and de novo assembly, functional annotation
In total, we obtained 707,382,370 raw reads, from which a total of 690,806,162 clean reads were produced after quality control (Additional file 2: Table S1). All clean reads were deposited at NCBI and can be accessed via SRP349608. All of the high-quality reads from these fifteen libraries were mixed for transcriptome assembly using the Trinity software. All clean reads were de novo assembled into 284,782 transcripts with a mean length of 441 bp and N50 of 1,308 bp. These reads would be assembled into 113,582 unigenes, with a mean length of 396 bp and N50 of 934 bp (Additional file 3: Table S2). Based on homologous searches, 68,217 (60.06%) unigenes had achieved blast by DIAMOND program hits in at least one of the six databases (Additional file 4: Table S3). Further analysis of homologies had the highest homology with sequences from Vitis vinfera (19.25%), followed by Oryza sativa (5.71%), Juglans regia (4.51%) (Additional file 1: Fig. S1).
Functional categories of the unigenes
GO classification and KEGG pathway-based analysis were performed to gain insight into the assembled unigenes' functional categorization.. A total of 38,657 unigenes were assigned to GO terms, which were classified into 50 functional groups under three principal categories, i.e., biological process, Molecular function, and Cellular components (Fig. 2A, Additional file 5: Table S4). In the Biological process category, the largest subgroups were “biological process”, “regulation of transcription, DNA templated”, and “transcription, DNA templated”; In the cellular components category, “nucleus”, “cytoplasm”, “plasma membrane” were the most highly represented ones. As for the Molecular function category, the most abundant genes were associated with “molecular function”, “protein binding”, and “ATP binding”.
Besides, a total of 20,652 unigenes were matched in the KEGG database and assigned to 138 pathways in five KEGG biochemical pathways, including “metabolism”, “Genetic information processing”, “Organismal systems”, “Cellular processes”, and “Environmental information processing” (Fig. 2B, Additional file 5: Table S4). The three most represented pathways were “translation”, “carbohydrate metabolism”, and “folding, sorting, and degradation”, followed by “environment adaptation” and “overviews”, whereas “membrane transport” and “replication and repair” pathways represented the smallest categories. The pathways are associated with seed germination, including “plant hormone signal transduction”, “starch, and sucrose metabolism”, and “biosynthesis of amino acids”.
All the assembled unigenes were subjected to a search against the evolutionary genealogy of genes to evaluate the annotation's effectiveness and the transcriptome library's completeness: Non-supervised Orthologous Groups (eggNOG) database. Unigenes were assigned into 23 categories (Fig. 2C, Additional file 6: Table S5). The top three categories were “function unknown”, “posttranslational modification, protein turnover, chaperones”, and “signal transduction mechanisms”, followed by “transcription”, and “replication, recombination and repair”; The smallest category was “cell motility” and “nuclear structure”.
Identification and functional classification of DEGs
The fifteen samples were separated into five groups in the principal component analysis (PCA) plot, and three replicates of each sample were grouped. Three seed samples at different developmental stages were closer to each other than to other samples (Additional file 1: Fig. S2). DEGs during seed germination were identified by pairwise comparisons at different time points to find germination-induced genes in seeds. The number of DEGs in CAM2 vs. CAM4S was minimum, and there were much more DEGs in CAM2 vs. CAM6S and CAM6S vs. CAM26S. Thus, most gene expression changes occurred after the water uptake stage (CAM4S) (Fig. 3A, Additional file 7: Table S6). A Venn analysis showed the distribution of common and unique DEGs between different adjacent stages and pairs of CAM0 vs. CAM2, CAM0 vs. CAM4S, CAM0 vs. CAM6S, and CAM0 vs. CAM26S. There were 12 common DEGs in adjacent stages (Fig. 3B, Additional file 8: Table S7). Comparatively speaking, there were more common DEGs (297) in pairs of CAM0 vs. CAM2, CAM0 vs. CAM4S, CAM0 vs. CAM6S, and CAM0 vs. CAM26S (Fig. 3C, Additional file 8: Table S7), indicating that the gene expression was substantially altered throughout different seed germination stages of C.oleifera. These phase-related gene expression changes during seed germination may have important functional implications for the growth of C.oleifera seed.
To gain insights into the functional categorization and metabolic pathways of DEGs involved in seed germination, 6820 of the 11,391 DEGs in cotyledons during C.oleifera seed germination were subjected to function annotation and function enrichment analysis. Abundant GO terms like “nucleus”, “transcription, DNA-templated”, “molecular function”, “zinc ion binding”, “plasma membrane, and “protein serine/threonine kinase activity” were enriched in adjacent stages, and pair of CAM0 vs. CAM2, CAM0 vs. CAM4S, CAM0 vs. CAM6S, and CAM0 vs. CAM26S (Fig. S3, Additional file 9: Table S8). In addition, the results of the 3810 DEGs mapped to the KEGG database showed that a large number of genes were involved in the pathways related to “plant-pathogen interaction”, “carbon metabolism”, “endocytosis”, “ribosome”, “RNA transport”, “Protein processing in endoplasmic reticulum”, “spliceosome”, “starch and sucrose metabolism”, “phenylpropanoid biosynthesis”, and “plant hormone signal transduction” (Fig. S4, Additional file 9: Table S8). This suggests a crucial role of hormones regulating C.oleifera seed germination.
Besides, the GO terms of 12 common DEGs among adjacent stages mainly included plasma membrane, dioxygenase activity, and nucleosome. The GO terms of 297 common DEGs with pairs of CAM0 vs. CAM2, CAM0 vs. CAM4S, CAM0 vs. CAM6S, and CAM0 vs. CAM26S, mainly included “nucleosome, “molecular function”, “secondary metabolic biosynthesis pathway”, “signal transduction”, and “vacuolar membrane”. The genes involved in the KEGG pathways were related to the mRNA surveillance pathway, including “cyanoamino acid metabolism”, “plant hormone signal transmission”, “endocytosis”, “plant pathogen interaction”, “protein processing in endoplasmic reticulum”, and “2-oxocarboxylic acid metabolism” (Additional file 8: Table S7).
Identification of genes associated with hormones signaling in seed germination
To gain insights into the functional roles of hormones during the seed germination of C.oleifera, we mapped the DEGs to hormone signaling and transduction pathways and analyzed their expression in different stages (Fig. 4). A total of 86 genes were identified to be associated with the biosynthesis, metabolism, and signaling of ten hormones, including Abscisic acid (ABA), Auxin (IAA), Ethylene (ETH), Karrikins (KARs), Brassinosteroid (BR), Cytokinin (CTK), Gibberellin acid (GAs), Strigolactone (STR), Salicylic acid (SA) and Jasmonic acid (JA). The genes associated with ABA biosynthesis and metabolism were chosen as the largest group with 17 members, followed by GAs with 13 members, KARs and Auxin with 12 members, and CTK with 10 members. In contrast, the genes with JA signaling belonged to the smallest group with only 1 member. These hormone-signaling and transduction-related genes exhibited differential expression across the different examined stages. A significant proportion of DEGs related to GA, Auxin, and ETH had high expression levels in the CAM26S stage, indicative of their potential roles in seedling growth. Besides, most of the genes that respond to KARs and BR are relatively highly expressed at CAM0 and/or CAM2 stage, suggesting that they may function in the early stage of seed germination.
Identification of transcription factors in C.oleifera
Transcription factors (TFs) play critical roles in various plant development processes. To provide insights into the regulatory network underlying seed germination, we examined the expression of TFs, especially their dynamic differential expression. 310 TFs belonging to 36 different families were found to be differentially expressed in the stages examined. The TFs displayed expression patterns across the five stages. A majority of TFs showed relatively broad expression patterns in all stages (Additional file 1: Fig. S5), while some exhibited distinctive stage-specific patterns (Fig. 5). Interestingly, multiple transcription factor members of the WRKY and ERF family, which respond to GA signals [16], were significantly highly expressed in the CAM2 stage (soaking seeds). In addition, the meristem regulator SHOOTMERISTEMLESS (STM), KNOTTED- like from Arabidopsis thaliana (KNAT1), PLETHORA 2 (PLT2), and WOX11, as the root-specific transcription factor, was also observed to be highly expressed in this stage [17, 18].
Co-expression network analysis with WGCNA
To identify the clusters of highly interconnected genes that were specific to tissues, co-expression networks were constructed on the basis of pairwise correlations between genes in their common expression trends across all sampled tissues. As observed in the dendrogram, 20 unique modules of eigengenes have been identified (Fig. 6A, Additional file 10: Table S9). Notably, 4 out of 20 co-expression modules consist of genes that are significantly relevant to hormones and growth during seed germination (r > 0.7, P < 10–2) (Fig. 6B).
Here, two modules (MEyellow and MEgrey60) were listed for further analysis. The MEyellow module, including 935 genes (Additional file 10: Table S9), was highly correlated with the content of ABA (Fig. 6B, C). According to the GO analysis, all genes in the MEyellow module were highly enriched in molecular function, zinc ion binding, quercetin 7–0-glucosyltransferase activity, transcription factor activity, UDP-glycosyltransferase activity, and sugar: proton symporter activity, etc. (Additional file 7: Table S6). Those enriched in the KEGG pathways were associated with plant-pathogen interaction, plant hormone signal transduction, circadian rhythm-plant, amino sugar, and nucleotide sugar metabolism, etc. (Additional file 10: Table S9).
WGCNA can also be employed to construct gene networks, in which each node represents a gene and the connecting lines (edges) between genes represent co-expression correlations. Hub genes refer to those that show most connections in the network. In the MEyellow module network, there were 27 genes that encode transcription factors and 4 hormone-related genes. Most of these transcription factor genes were highly expressed during seed soaking and may regulate or participate in activities operating in seed germination (Additional file 10: Table S9).
The 50 most highly connected hub genes in the MEyellow module were used for analyzing the gene expression network. Gene expression showed that the expression level in the stage of seed soaking was higher than that in other stages (Fig. 6B, 7A Additional file 11: Table S10). Co-expression network showed core hub genes, namely, 25556_c1_g1 and 26030_c3_g4. The function of the 25556_c1_g1 gene remains unknown. The 26030_c3_g4(AT3G29970) gene belongs to molecular function. Other highly connected hub genes include signal transduction CoPH1(25268_c3_g2), sugar:proton symporter activity CoPMT3(26833_c0_g1), vacuole CoAT3G62550(27851_c0_g1), and CoDJC66(29838_c0_g1) (Fig. 7B). Interestingly, CoPMT3 (26833_c0_g1) is annotated into the protein O-mannosyl transferase gene family (PMT), which is an important sugar proton symporter activity, and produces small ubiquitin-like modifiers (SUMOs). PMTs have differential tissue-specific functions in phosphatidylcholine (PC) biosynthesis and plant growth. As primary enzymes for phosphocholine (PCho) biosynthesis, PMT3 are involved in PtdCho biosynthesis and vascular development in Arabidopsis seedlings [19, 20].
The 110 genes in the MEgrey60 module were highly correlated with the growth of seedlings (Fig. 6B, 6D). These genes were highly enriched in GO terms, including molecular function, protein serine/threonine kinase activity, transferase activity, sugar: proton symporter activity and transcription factor, secondary metabolite biosynthetic process, signal transduction, transcription, and transmembrane transport (Additional file 10: Table S9). Besides, KEGG analysis showed that those genes were enriched in pathways associated with plant-pathogen interaction, plant hormone signal transduction, phenylpropanoid biosythesis, cyanoamino acid metabolism, limonene and pinene degradation, and brassinosteroid biosynthesis (Additional file 10: Table S9). The hub genes with the maximum number of edges (81) is CoVIT_13s0047g00260 (28624_c1_g1), a member of the shiu: CAMK1 protein kinase family, which is associated with protein autophosphorylation, transporter activity [CoAT1G24430 (29788_c0_g1), CoNPF3.1 (40839_c0_g1), CoK4D641(gene43291_c0_g3)], xyloglucan: xyloglucosyl transferase activity [CoXTH7 (30876_c0_g5)], protein binding [CoP02879(41442_c2_g1)], peroxidase activity [CoPOD (42101_c0_g1)], and xylem development [CoPRX52 (29405_c1_g1)] (Fig. 7C, D, Additional file 10: Table S9). CoXTH7 (30876_c0_g5) can catalyze xyloglucan endohydrolysis (XEH) and/or endotransglycosylation (XET). It may be an essential constituent of the primary cell wall that participates in cell wall biogenesis. CoNPF3.1 (40839_c0_g1) is annotated into the nitrate transporter1/peptide transporter (NPF) family, and it has been described as transmembrane transporter, gibberellic acid homeostasis, and nitrate assimilation in Arabidopsis thaliana [21]. It may improve the development of buds and the growth of roots [22].
Validation of the expression of DEGs by qRT-PCR
To validate the reliability expression profiling obtained by RNA-seq, 12 DEGs with different expression patterns in cotyledons present in the seed were selected for qRT-PCR analysis. The genes chosen for qRT-PCR analysis included 9 TFs and 3 signaling-related hormones genes. For all these genes, the results of qRT-PCR exhibited almost similar expression patterns as compared to those in the RNA-seq data (Fig. 8A), confirming a high correlation between the RN-seq and qRT-PCR data (Fig. 8B).