Transcriptome sequencing and annotation
To identify as many transcripts as possible, equal amounts of total RNA from the roots, stems and leaves of T. cuspidata plants were pooled together and reverse transcribed into cDNA. To minimize bias that favors sequencing shorter transcripts, unfiltered and > 4 kb cDNA fragments were equally mixed and used to construct sequencing libraries. Using the PacBio Sequel platform, a total of 5,678,524 subreads with an average length of 2047 bp were generated. According to the bioinformatics procedure shown in Figs. 2, 568,432 CCSs were obtained from the subreads by removing adapters and artifacts. After read clustering and consensus calling, 181,230 sequences were retained as final consensus transcripts for subsequent analysis, which included evidenced-based gene model construction and candidate gene discovery. These consensus transcripts were further corrected with Illumina short reads to improve the accuracy (Proovread), and redundant transcripts were then removed with CD-HIT software, generating 148,038 nonredundant transcripts. Finally, Cogent software was used to reconstruct gene models and visualize the AS. Among all nonredundant transcripts, 103,072 transcripts were reconstructed into 13,636 UniTransModels, and 44,966 transcripts were left as singletons, which had only one splicing type. In total, 58,602 unigenes (UniTransModels + singletons) were obtained, each of which was bioinformatically predicted to be one genomic locus. The completeness of the transcriptome was assessed by the BUSCO method with the Embryophyta (ODB10) core gene dataset. Approximately 91.1% of the 1440 expected embryophyte genes were identified as complete, indicating the high integrity of the T. cuspidata transcriptome.
To compare the integrality of transcripts from Iso-Seq and RNA-Seq, RNA-Seq data from the same samples were assembled with Trinity and SOAP, producing 63,872 and 53,781 unigenes, respectively. As shown in Fig. 3a, the Iso-Seq unigenes had much longer CDSs than those from RNA-Seq. For unigenes produced by Trinity and SOAP, CDSs less than 1000 bp accounted for 84.1 and 88.1% of the total, respectively, numbers obviously higher than those generated by Cogent software (56.5%). For unigenes from Cogent, the CDSs with more than 2000 bp accounted for 14.2% of the total, compared with Trinity (3.4%) and SOAP (2.3%). Next, we further investigated the transcript integrality by aligning the unigenes from the different sequencing platforms to those of a well-curated full-length protein database, UniProt Swiss-Prot (UniProtKB). Compared to the RNA-Seq unigenes, a significantly increased percentage of the Iso-Seq unigenes contained full-length ORFs (covering 100% of curated full-length proteins) or nearly full-length ORFs (covering 90% of curated full-length proteins) (Fig. 3b). We compared our 58,602 unigenes with 18 genes encoding enzymes determined to be involved in Taxol biosynthesis and found that most (14 out of 18) were nearly fully represented by long-read sequences (95% coverage). These results suggested that Iso-Seq confers a substantial advantage in the production of full-length transcripts over RNA-Seq.
To capture the most informative and complete annotation information, we used a basic local alignment search tool (BLAST) to annotate all the unigenes based on sequence similarity searches against public databases, including the Swiss-Prot, KEGG, GO, KOG, NCBI Nr, and NCBI Nt databases. In addition, annotation was performed with hmmScan based on a domain similarity search against the Pfam database. In total, 42,920 unigenes were successfully matched to known sequences or domains in at least one of the seven databases, and 10,388 unigenes were annotated in all the databases (Additional file 4: Figure S1). In addition, a total of 9523 transcripts were predicted to be high-confidence lncRNAs by CPC, CNCI and Pfam protein structure domain analysis (Additional file 5: Figure S2). LncRNAs are important regulators of the secondary metabolism pathway, regulating gene expression on multiple levels via a number of complex mechanisms [46, 47].
To functionally classify the T. cuspidata unigenes, GO terms were assigned to each unigene using BLAST2GO based on the best BLASTx hit from the NR database. In total, 23,966 unigenes were assigned GO terms, which were classified into three major categories (biological process, cellular component and molecular function) (Additional file 6: Figure S3). The major subgroups of biological processes were “cellular process” (GO: 0009987) and “metabolic process” (GO: 0008152). In the cellular component category, unigenes involved in the “cell part” (5120, 22.2% of the total) and “cell” (5120, 22.2%) were highly represented. For the molecular function classification, the major categories were “binding” (GO: 0005488) and “catalytic activity” (GO: 0003824). Additionally, to explore the biological functions and interactions of unigenes in T. cuspidata, 58,602 unigenes were searched against the KEGG database. A total of 38,547 unigenes had significant matches in the database and were assigned to 363 KEGG pathways, which were categorized into five subcategories as follows: organismal systems, genetic information processing, cellular process, environmental information processing, and metabolism (Additional file 7: Figure S4). The most heavily enriched KEGG pathways were related to metabolic pathways. Among these pathways, the “biosynthesis of secondary metabolites” pathway included 8155 unigenes, providing a valuable resource for further gene function research. The pathways with the highest unigene representations were those related to carbon metabolism (ko01200; 496 unigenes), biosynthesis of amino acids (ko01230; 449 unigenes), purine metabolism (ko00230; 283 unigenes), oxidative phosphorylation (ko00190; 235 unigenes), and amino sugar and nucleotide sugar metabolism (ko00520, 224 unigenes).
Gene expression analysis
To identify gene expression differences in different Taxus tissues, we analyzed the expression patterns of 58,602 unigenes and found 3627 differentially expressed genes (DEGs) (Fig. 4a). The largest differences were observed between leaves and roots, in which 2611 DEGs were detected, including 1437 upregulated unigenes and 1174 downregulated unigenes. The smallest difference existed between stems and roots, in which 1115 DEGs were detected, including 199 upregulated unigenes and 916 downregulated unigenes. Between stems and leaves, 1482 DEGs were detected, including 649 upregulated unigenes and 832 downregulated unigenes, and 50 unigenes were differentially expressed in any two of the three tissues (Fig. 4b). In other words, the expression differences were more remarkable in the leaf versus root comparison than in the leaf versus stem and stem versus root comparisons. The functional categories of 402 total unigenes upregulated in roots were further analyzed using the KEGG database, and the significant DEGs were represented in the five main KEGG categories. Among them, 227 unigenes were associated with secondary metabolism (Additional file 8: Figure S5). The biosynthesis of secondary metabolites and plant hormone signal transduction were each significantly enriched in the unigenes specifically expressed in roots. In addition, 161 DEGs belong to TFs, 104 of which were upregulated in roots.
As shown in Fig. 1, the biosynthesis of Taxol is a complex process because skeleton modifications require the participation of several hydroxylases, ACTs and mutases. To investigate whether the genes involved in Taxol biosynthesis are coexpressed, we analyzed the expression patterns of 12 characterized genes in this pathway using real-time PCR. As shown in Fig. 4c and Additional file 3: Table S3 qRT-PCR results are, in most cases, consistent with the expression levels calculated from RNA-Seq data. The gene encoding the signature enzyme TS was obviously expressed at a higher level in roots than in stems and leaves, while its expression in leaves was slightly higher than that in stems (Fig. 4c). The PAM gene, involved in side chain synthesis, exhibited expression patterns different from those of TS, as PAM was expressed at the lowest level in the root, followed by the stem, and exhibited the highest expression in the leaf. Among ten genes encoding enzymes modifying the Taxol skeleton, six had expression patterns similar to those of TS. In the left four genes, TAT showed high expression in all three tissues, while T2αH and T7βH showed moderate expression in all tissues; T13αH exhibited an expression trend opposite that of TS, showing the lowest expression in the root and the highest expression in the leaf. According to our coexpression analysis, we hypothesized that the enzymes involved in the stem biosynthetic pathway are TS, T5αH, TAT, T10βH, TBT, DBAT, BAPT and DBTNPT. In addition, several genes in the first part of the biosynthetic pathway were expressed at substantially higher levels than those in the final steps. In particular, the gene encoding the last enzyme, DBTNPT, was expressed at the lowest level in all plant tissues, which is one of the reasons underlying the extremely low Taxol content in T. cuspidata. Therefore, we hypothesized that dramatically improving DBTNPT expression will substantially contribute to Taxol production. We also investigated the coexpression of Taxol-specific genes with several upstream genes and found that GGPPS and two genes from the MEP pathway, DXS and DXR, were consistently coexpressed with most of the downstream genes, such as TS, T5αH, T10βH, TBT, DBAT, BAPT and DBTNPT, while HMGS and HMGR from the MVA pathway had expression patterns different from those downstream genes, suggesting that the isoprene precursors for Taxol biosynthesis may primarily derive from the MEP pathway, which is consistent with previous reports [48].
Alternative splicing analysis
AS is a regulated process that increases the diversity of an organism’s transcriptome and proteome, mediating plant biological processes ranging from plant development to stress responses [21, 22]. In our results, transcript isoforms were identified by aligning individual long-read consensus transcripts back to the reconstructed full-length unigenes. We identified 13,636 unigenes undergoing AS events, of which 8627 were assigned GO terms, which were classified into three major categories (biological process, cellular component and molecular function) (Fig. 5a). GO enrichment analysis showed that these AS genes are highly enriched in binding, catalytic activity, metabolic processes and cellular processes.
Six unigenes were used to validate the authenticity of the AS events using the RT-PCR method, including two genes encoding CYP450s, one encoding a TF and three encoding splicing factors (Fig. 5b). Specific primers were designed to amplify the fragments of the predicted transcripts. The PCR fragments matching the predicted sizes were subsequently sequenced with the Sanger sequencing method. All the AS events were confirmed, suggesting the reliability of our analytical procedure even in the condition without an available reference genome. Surprisingly, the AS of some genes exhibits a tissue-preferential pattern. For example, only isoform 3 of TcuCYP709K2 (Tcuc14855), which has a correct ORF and can be translated into a functional protein, was found in the leaf, while three isoforms existed in the stem. For unigene TcuCYP866A22 (Tcuc27867), the leaves produced mainly isoform 2, while the stem produced isoform 3, and the root produced isoform 1 and isoform 2. Isoform 2 had a correct ORF and the ability to synthesize active protein. The molecular mechanism underlying the tissue-preferential pattern of AS and its role in regulating gene expression and encoded protein function require further study.
Putative CYP450 genes involved in Taxol biosynthesis
Plant CYP450s are heme-containing enzymes that play roles in a wide variety of both primary and secondary metabolism reactions [49,50,51]. To date, five CYP450s have been shown to be involved in Taxol biosynthesis, and at least three missing steps in the pathway are thought to be catalyzed by enzymes from the CYP450 superfamily. Based on the annotation results, a total of 139 full-length or near full-length CYP450s (458 ~ 673 amino acids in length) with intact Pfam CYP450 domains were classified by alignment with the CYP450 database using standard sequence similarity cutoffs, specifically 40, 55 and 97% for family, subfamily and allelic variants, respectively. Thus, the 139 TcuCYP450s were classified into 36 families and 63 subfamilies (Additional file 9: Figure S6). Among 139 members, 79 TcuCYP450s were identified for the first time in T. cuspidata, and 66 TcuCYP450s were gymnosperm-specific CYP450s from 6 families (CYP750, CYP867, CYP725, CYP947, CYP864 and CYP866) and 3 subfamilies (CYP76AA, CYP720B and CYP716B).
Four previously characterized CYP450 genes encoding T2αH [52], taxadiene 5α-hydroxylase [53], taxoid 10β-hydroxylase [54] and taxoid 14β-hydroxylase [55] were found in our transcriptomes. To date, all of the characterized CYP450s involved in Taxol biosynthesis belong to the CYP725A subfamily. We inferred that 10 novel CYP725A unigenes may be candidate CYP450s of the Taxol pathway (Fig. 6a). Moreover, Zhang et al. [56] identified two candidate genes with high similarity to Taxus CYP450s via analysis of high-throughput RNA sequencing data from G. biloba and found that G. biloba suspension cells exhibit taxoid 9α-hydroxylation activity. This CYP450 belongs to the CYP716B subfamily, suggesting that C9 hydroxylases in T. cuspidata may belong to the CYP716B subfamily (Fig. 6a). Therefore, we chose to further analyze the expression patterns of 13 TcuCYP450 unigenes in different tissues, including 10 unigenes encoding enzymes from the CYP725A subfamily and 3 TcuCYP450 unigenes from the CYP716B subfamily (Fig. 4c). The qRT-PCR profiles showed that 8 novel unigenes in the CYP725A subfamily and one unigene in the CYP716B subfamily had tissue-specific expression patterns similar to those of known genes in the Taxol biosynthetic pathway (Fig. 4c). In particular, the expression levels of TcuCYP725A21 and TcuCYP725A9 in roots were more than 47-fold and 19-fold higher than those in leaves and 96-fold and 60-fold higher than those in stems, respectively. Considering all the evidence, most genes in the Taxol pathway were highly expressed in the root, and we inferred that the above 9 unigenes are candidates for Taxol biosynthesis.
Putative BAHD ACTs in Taxol biosynthesis
Until now, five ACTs have been shown to participate in the Taxol biosynthetic pathway, including TAT, TBT, DNTBAT, DBAT and BAPT, and all belong to BAHD ACTs [11, 57,58,59,60]. These ACTs deliver an acyl group from a corresponding acyl-CoA thioester to a Taxol pathway intermediate, catalyzing either O- or N-acyl group transfer reactions. Some acetylases in the Taxol pathway remain unidentified, such as the enzyme that can catalyze taxa-4(20),11(12)-diene-5α-ol into the important precursor 2-debenzoyltaxane (Fig. 1). In total, 39 BAHD ACTs were found in the T. cuspidata transcriptome. In general, BAHD ACTs share 2 conserved regions, HXXXD and DFGWG motifs [12]. MEME analyses showed that among the 39 protein sequences, 8 sequences lacked an intact HXXXD motif or a DFGWG motif. The thirty-one BAHD ACTs with intact BAHD domains and some characterized BAHD ACTs from other plants were subjected to phylogenetic analyses, and all enzymes were divided into five distinct clades (Fig. 6b).
All BAHD ACTs from T. cuspidata were clustered into three clades, Clades I, IV and V. Clade V plays an important role in the acetylation of amino groups to form amides and transfer hydroxycinnamoyl- or benzoyl-CoAs. To date, all characterized BAHD ACTs involved in Taxol biosynthesis belong to Clade V. We found that 21 of 31 BAHD ACTs from T. cuspidata belonged to Clade V. Among them, TcuACT31 showed high similarity to the identified TAT (98%), which catalyzes taxa-4(20),11(12)-diene-5α-ol into taxa-4(20),11(12)-diene-5α-yl acetate in the Taxol pathway. However, we did not find the other four characterized BAHD ACTs in the T. cuspidata transcriptome. According to the real-time results, the four enzymes were expressed at substantially lower levels in T. cuspidata than TcuACT31 (Fig. 4c), which may lead to difficulty in their identification by sequencing analysis. The expression levels of twelve enzymes that were phylogenetically closest to the five characterized BAHD ACTs were analyzed using real-time PCR technology. Four members (TcuACT8, 13, 17, and 21) had the same expression patterns as TAT and TBT, as their expression quantities in root tissue were similar to that in leaf tissue but significantly higher than that in stem tissue (Fig. 4c). In addition, three members (TcuACT9, 10, 30) exhibited expression trends similar to those of BAPT, DBAT and DBTNBT, as their levels in roots were significantly higher than those in stems and leaves. We inferred that these seven BAHD ACTs are possible candidates for the Taxol pathway.
Putative transcription factors regulating Taxol biosynthesis
We herein identified 1940 unigenes representing putative TFs distributed across 61 families and including bZIPs, bHLHs, WRKYs, and MYBs. The number of TFs is comparable to that of another diterpenoid-producing plant, S. miltiorrhiza (1948 TFs), and to that of the model plant A. thaliana (2357 TFs). As shown in Fig. 7a, we found 355 putative C2H2-type zinc finger-containing proteins in T. cuspidata, an obviously higher number than those in the angiosperm A. thaliana and S. miltiorrhiza. Plant C2H2 zinc finger proteins are mainly involved in the growth and development of plants at various stages and the regulation of gene expression under environmental stress, including extreme temperatures, salinity, drought, oxidative stress and excessive light [52]. The phenomenon of C2H2-type zinc finger protein families being more abundant than other families might be the result of more gene duplication events in the zinc finger-containing protein genes in T. cuspidata genomes during evolution, resulting in the rapid expansion of these genes, and they may have special functions in T. cuspidata. Seven TFs have been identified to be involved in Taxol pathway regulation, including one WRKY, three bHLHs, two ERFs and one AP2 TF [16,17,18,19].
WRKY proteins constitute one of the largest classes of TFs in plants and contain at least one highly conserved WKRY domain, which is essential for binding to DNA cis-elements [53]. More than ten plant WRKY TFs have been shown to regulate secondary metabolism, including TcWRKY1, PqERKY1, and CrWRKY1 [16, 54, 55]. TcWRKY1 from T. chinensis is the sole WRKY TF that regulates Taxol biosynthesis. TcWRKY1 was reportedly able to specifically bind to W-box elements (TTGAC(C/T)) within the DBAT promoter and activate DBAT expression [16]. A total of 36 proteins with complete WRKY domains were categorized into 3 groups and 7 subgroups (Additional file 10: Figure S7) according to Eulgem’s method [61]. We did not find the TF corresponding to TcWRKY1 in the T. cuspidata transcriptome, although six TcuWRKYs were divided into the same subgroup, Group IIa, with TcWRKY1 (Fig. 7b). Among them, TcuWRKY45 shared 88% identity with TcWRKY1 at the protein level and was expressed significantly more in roots than in stems and leaves (Fig. 7e). In addition, TcuWRKY45 had the same WRKY domain as TcWRKY1 (Fig. 7b), suggesting that it may also bind to the DBAT promoter, although its actual roles in Taxol biosynthesis require further elucidation.
bHLH proteins are the second largest class of TFs found in plants, and they all contain a bHLH DNA binding domain [57]. Three methyl jasmonate (MJ)-inducible bHLH TFs (TcJamyc1, TcJamyc2 and TcJamyc4) have been identified in T. cuspidata and shown to have a negative effect on Taxol biosynthesis [17]. Sixty-four TcubHLH TFs with intact bHLH domains were grouped into 13 subgroups by phylogenetic analysis according to Heim’s method [62]. (Additional file 11: Figure S8). TcJamyc1, TcJamyc2 and TcJamyc4 all belonged to subgroup IIIe, and three bHLH proteins from this transcriptome were divided into the same subgroup, among which TcubHLH74 and TcubHLH70 corresponded to TcJamyc1 and TcJamyc4, respectively. We did not find TcJamyc2 in this transcriptome dataset. TcubHLH94 showed the highest similarity to TcJamyc2 and had the same DNA binding domain, suggesting its possible role in regulating the Taxol biosynthetic pathway (Fig. 7c, and Additional file 11: Figure S8). Interestingly, as shown in Fig. 7e, TcJamyc4 showed its highest expression in roots, while TcJamyc1 and TcubHLH94 showed their lowest expression in roots. Therefore, the true roles of these TFs in Taxol biosynthesis need further study.
The AP2/ERF proteins are a large class of plant-specific TFs that share well-conserved AP2/ERF-type DNA binding domains [58]. To date, two ERFs (TcERF12 and TcERF15) from T. chinensis and one AP2 from T. cuspidata have been identified to regulate Taxol biosynthesis [18, 19]. A total of 119 AP2/ERF superfamily members in T. cuspidata were found in this transcriptome dataset, including 8 AP2s, 49 DREBs, 55 ERFs and 6 RAVs [63, 64]. We did not find the identified AP2, which belonged to group II with one AP2 domain, while all 8 of the AP2 family members in this study belonged to group I with 2 AP2 domains. Based on phylogenetic analysis, forty-three TcuERFs with complete domains were categorized into 6 subgroups (Additional file 12: Figure S9). Two identified TcuERFs, TcERF12 and TcERF15, belong to subgroups B1 and B3, respectively. Ten subgroup B1 members and nineteen subgroup B3 members were found in the T. cuspidata transcriptome. Among those TcuERFs, TcuERF34 and TcuERF31 are the enzymes corresponding to TcERF12 and TcERF15 in T. cuspidata, respectively (Fig. 7d). In Group B1, TcuERF77, 81 and 107 were closer to TcERF12; in Group B3, TcuERF18, 25, 44 and 99 were closer to TcERF15 (Additional file 12: Figure S9). Among those TcuERFs, TcuERF18, 44, 77, 81, and 107 had a similar expression pattern to TcERF12 and TcERF15, with the highest expression in the roots (Fig. 7e). These five ERFs are regarded as lead candidates for regulating Taxol biosynthesis in T. cuspidata. Further work is needed to verify the predicted roles of these ERFs in the regulation of Taxol biosynthesis.