G. australe FL transcriptome sequencing and bioinformatics pipeline
To identify as many transcripts as possible, we extracted high-quality RNA from ten tissues of G. australe F. Mueller at different developmental stages or treatments and verified that the RNA quality met the requirements for reverse transcription (Additional file 1:Figure S1). Then, we sequenced barcoded single-molecule real-time (SMRT) Bell libraries on the PacBio Sequel platform using the latest P6–C4 chemistry. As a result, 1,001,928 circular consensus sequences (CCS) were obtained from five SMRT cells, and the expected distribution of transcript lengths ranged from 200 to 10,000 nucleotides (Additional file 2:Figure S2). All CCS reads were further classified into 656,418 full-length non-chimeric (FLnc) reads, which had the 5′ and 3′ barcoded primers and the poly (A) tail simultaneously, and 242,894 non-full-length (nFL) sequences lacking any one of the 5′ primer, 3′ primer or poly (A) tail. After using the Iso-Seq cluster’s clustering algorithm, ICE (iterative clustering for error correction), we finally obtained 320,357 FL consensus isoform sequences (Additional file 14:Table S1). Afterward, we employed ‘LoRDEC’ [17] to correct errors using RNA-Seq reads (Additional file 15:Table S2). Thus, about 22 million (3.11%) errors were corrected, with 0.77% insertions, 1.75% deletions and 0.60% SNPs, being consistent with the above polished consensus isoforms (Additional file 15:Table S2).
Reconstruction of G. australe transcript loci
To reconstruct FL transcript models for G. australe, we developed a computational pipeline using several publicly available tools (Fig. 1a). High homology is known to exist between genes from G. australe and those from other Gossypium species, and COding GENome reconstruction Tool (Cogent) has limitations when analysing highly similar genes from large families [18]. Therefore, the isoforms were first collapsed using Cupcake based on a G. raimondii reference [19], and then Cogent was employed to reconstruct the resulting isoforms into one or several FL unique transcript model(s) based on the k-mer clustering and De Bruijn graph methods. As a result, we mapped 307,829 (96.09%) isoforms to the G. raimondii reference and collapsed 276,459 (86.30%) isoforms into 26,627 transcript loci by Cupcake. The remaining uncollapsed isoforms were presumed from G. australe species-specific sequences. Cogent was then subsequently employed to reconstruct the 43,898 remaining isoforms and generate 5165 ‘reconstructed gene sequences’. By applying the reconstructed gene sequences as a reference, we obtained 5274 transcript loci from the remaining isoforms. Finally, we collapsed 320,357 FL isoforms into 92,728 non-redundant isoforms derived from 31,904 loci.
Rarefaction analysis using subset of the FL reads revealed that sequencing depth was almost saturated for transcript loci discovery, implying that more sequencing data would not facilitate to discover much more transcript loci (Additional file 3:Figure S3a). While investigating on the relationship between sequencing depth and transcript discovery, we found that saturation reached slowly at the isoform level (Additional file 3:Figure S3b). This result reflected the overabundant transcripts in G. australe transcriptome, raising the possibility that lowly abundant or rare transcript isoforms remain to be discovered.
To verify the PacBio isoforms and remove isoforms with false splice junctions, we analysed splice motif from 79,390 (85.62%) non-redundant multi-exon isoforms. The results showed that a total of 177,349 splice junctions were captured from these multi-exon isoforms. Among them, most of the splice junctions (84.65%) were the canonical GT-AG, followed by non-canonical sites (GC-AG with 1.52%, AT-AC with 0.40%, and others with 13.43%) (Additional file 16:Table S3). To further validate the reliability of the above splice junctions detected in our Iso-Seq, RNA-Seq reads from leaf and seed germinated for 5 h, 15 h or 30 h were employed to map using STAR. The result demonstrated that 61.10–73.37% of the GT-AG were validated, followed by GC-AG (25.69–33.17%) and AT-AC (3.57–4.56%). None of the other non-canonical sites were validated by RNA-Seq. Totally, 82.08, 44.54 and 5.55% of GT-AG, GC-AG and AT-AC splice junctions in Iso-Seq were verified using RNA-Seq, respectively (Additional file 16:Table S3). The remaining non-supported splice junctions were presumed to be false splice junctions or tissue-specific expression of genes/transcripts. For example, the gene PB.3866, a seed-specific expression exon, was detected in RNA-Seq from seed germinated for 15 h and 30 h (Additional file 4:Figure S4). The isoform of PB.3866 contains a GT-AG junction that was not supported by the G. australe leaf or seed RNA-Seq, inferring that this isoform might be tissue-specific expressed (Additional file 4:Figure S4). Thus, the splice junctions without RNA-seq-supported non-canonical motif were presumed as false splice junctions, and then the isoforms containing these false splice junctions were removed at last. Finally, 70,803 isoforms derived from 29,597 loci were for further analysed.
Functional annotation of FL gene transcript loci in G. australe
Candidate open reading frames (ORFs) within the transcript sequences were detected by TransDecoder [20] with a minimum length of 100 amino acids (aa), and a log-likelihood score was calculated for each of the three reading frames. To maximize sensitivity for capturing ORFs with functional significance, we scanned all ORFs for homology to the Pfam protein domain database or the protein sequences curated in the NCBI NR (non-redundant proteins), Ensembl Plants and Swiss-Prot UniProt databases. When multiple ORFs appeared in a single transcript, the first ORF in each transcript was considered the ‘representative’. The analyses indicated the presence of 61,805 non-redundant isoforms (87.29% of the total) that contained high-quality ORFs with lengths ranging from 300 bp to 9345 bp with an average of 1171 bp, and these ORFs pertained to 25,246 transcript loci (85.30% of the total). Only 4, 351 transcript loci had no ORFs, inferred as non-coding RNAs, and these loci are addressed in the next section. The isoforms that contained the longest ORF in each transcript locus were employed as the ‘locus representative’ for subsequent comprehensive functional annotation.
To estimate the completeness of our transcriptome sequencing, we used BUSCO [21] for gene diversity estimation. BUSCO checks for essential single copy orthologs should be present in a whole transcriptome dataset for any member of the given lineage. The Embryophyta lineages (1440 conserved proteins) were used to examine transcriptome completion, in which the FL consensus isoforms and gene loci were used to ensure that completeness tracked across the collapsed steps. In our results for BUSCO alignments, the gene loci showed a lower completeness level than the FL consensus isoforms, being 80.42 and 91.04%, respectively (or 85.28 and 94.44%, respectively, when fragmented BUSCOs were counted) (Additional file 5:Figure S5). The FL consensus isoforms had a higher level of duplication in the BUSCO alignment, suggesting that the collapsed strategy successfully reconstructed transcript models and reduced duplicate isoforms from FL dataset. Finally, the entire representative CDS from each gene locus was aligned to the G. raimondii genome, and 22,733 (90.05%) loci were uniquely mapped onto 13 homologous chromosomes (Additional file 6:Figure S6). The number of genes located on each chromosome ranged from 1086 to 2851 (Additional file 6:Figure S6), and the numbers were positively correlated with chromosome length. In pericentromeric regions, which are known to be gene-poor and transposon-rich, very few genes were mapped. Most genes were distributed on the proximal ends in gene-rich regions (Additional file 6:Figure S6).
Most of the transcript loci (24,742, 98.00%) exhibited homology with at least one protein database, and 19,661 (77.88%) were found in all three protein databases (Fig. 1b). In addition, 20,543 transcript loci matched Pfam protein domains, and all but 16 of these loci exhibited homology with at least one protein database as well (Fig. 1b). Among the transcript loci collapsed in our analysis, 504 did not return any matches that contained an ORF; hence, they were considered potentially novel genes in G. australe (Additional file 17: Table S4). PlantTFDB [22] was employed to predict transcription factors from those gene loci. The results showed 1575 putative transcription factors from 55 families (Additional file 18:Table S5), representing 6.24% of the protein-coding genes. Then, using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) software, we further assigned the genes to different functions. Finally, 18,722 (74.16%) and 5691 (22.54%) of the reconstructed genes were assigned to different functional terms in three GO categories and KEGG pathways, respectively (Additional file 17:Table S4). Among these GO categories, the functional terms ‘cellular process’, ‘metabolic process’ and ‘response to stimulus’ were identified to be the top three most common annotations among the ‘biological process’ GO terms (Additional file 7:Figure S7). Importantly, the G. australe genes annotated in ‘response to stimulus’ are valuable natural resources, which could be used to improve the resistance of cultivated species in future breeding programmes. In the KEGG annotated gene sets, the majority were classified into ‘Metabolism’, of which ‘Carbohydrate metabolism’, ‘Amino acid metabolism’ and ‘Lipid metabolism’ were the three most abundant pathways (Additional file 8:Figure S8). In particular, 270 (1.07%) G. australe genes belonged to ‘Metabolism of terpenoids and polyketides’. 13,005 single copy orthologs were detected by comparing with the G. raimondii [19], G. arboreum [23], G. hirsutum [24] (At and Dt subgenome) using OrthoMCL [25]. In addition, 7082 genes were found to be unique in G. australe. Functional enrichment analysis showed that these genes were highly enriched in ‘Biosynthesis of secondary metabolites’ in KEGG and ‘response to stimulus’ in GO. These genes would be employed to address some species-specific biological issues of G. australe in further research.
Non-coding RNA analysis
To identify candidate non-coding RNAs from the G. australe PacBio data, 4351 (14.70%) non-ORF transcript loci were searched against the miRNA stem-loop sequences (28,645 records) curated in miRBase by BLASTN (E value <= 1e-5). In total, 86 transcript loci matched 65 pre-miRNA sequences (Fig. 1c). Among them, 28, 12 and 2 matched miRNAs from G. raimondii, G. hirsutum and G. herbaceum, respectively. The remaining 44 matched miRNAs from other species, suggesting that these are novel Gossypium spp. miRNAs (Additional file 19:Table S6). The remaining 4265 transcript loci out of 4351, which had lengths greater than 200 nucleotides, were subjected to lncRNA detection. Specifically, we used the longest isoform in each transcript locus as the ‘locus representative’ to validate its coding potential with CPC (Coding Potential Calculator) [26] and CNCI (Coding-Non-Coding Index) [27]. Finally, 1792 and 2415 of the loci were predicted as non-coding RNAs by CPC and CNCI, respectively (Additional file 9:Figure S9). Among them, 1468 common loci were predicted to yield non-coding RNAs by both programs, and these loci were considered lncRNAs (Fig. 1c and Additional file 20:Table S7). Only six of these 1468 lncRNAs were annotated with known non-coding RNAs (ncRNAs) by searching against ncRNAs in the NONCODE [28] database (Additional file 20:Table S7). To determine the functions of these lncRNAs, including the unannotated ones, further characterization must be performed experimentally.
AS detection in the G. australe transcriptome
AS of precursor mRNAs (pre-mRNAs) from multi-exon genes allows organisms to increase their coding potential and regulate gene expression through multiple mechanisms [29,30,31,32,33]. Approximately 60% of protein-coding genes in plants contain introns that are excised to generate mature transcripts [29,30,31]. Previous reports have shown that plants have much more complex transcriptomes and that their FL transcripts are more sensitive than RNA-Seq assemblies for AS discovery, especially for genes with isoform complexity [8, 12]. PacBio Iso-Seq has proven to be a powerful tool for detecting complex AS events at the whole-transcriptome scale [34, 35]. Based on the reconstructed transcript loci, we found that 12,414 (49.17%) loci generated a single isoform, and approximately 12,832 (50.83%) genes produced two or more isoforms, while 803 (3.18%) genes had ten or more splice isoforms detected (Fig. 2c). Then, AStalavista [36] was used to ascertain the frequencies of the five main types of AS (exon skipping, ES; alternative acceptor site, AA; alternative donor site, AD; mutually exclusive exons, MX; and intron retention, IR) in the G. australe transcriptome (Fig. 2a). A total of 31,448 AS events were identified from 9944 genes in the Iso-Seq dataset (Additional file 21:Table S8), among which IR was the most abundant, accounting for 68.85% of AS events (Fig. 2b). Our result is consistent with those of previous reports showing that IR was the most frequent type of AS event in plants [30, 31, 37,38,39,40,41,42]. IR, AA (15.64%) and AD (9.31%) together accounted for more than 90% of the detected AS events in the G. australe transcriptome (Fig. 2b). For example, the gene PB.15419 showed nine different isoforms by Iso-Seq, including nine splice junction sites. All of these isoforms were confirmed by RNA-Seq from leaf tissue, and eight of the junction sites were strongly supported by at least 20 RNA-Seq short reads (Fig. 2d). GO enrichment analysis showed that these AS genes were highly enriched in ‘phosphorus metabolic process’, ‘organelle organization’, ‘protein modification process’, ‘organonitrogen compound catabolic process’, ‘dephosphorylation’ and so on (Additional file 10:Figure S10). Moreover, genes with detected AS were highly enriched in the GO terms of several important metabolic processes associated with ‘spliceosome’ (Additional file 10: Figure S10), which plays a central role in catalysing pre-mRNA splicing.
Alternative polyadenylation in G. australe
Alternative polyadenylation of the 3′ end of most transcripts is an important post-transcriptional modification, regulating gene expression through multiple mechanisms in most eukaryotic [43, 44]. Alternative polyadenylation caused by alternative cleavage increases transcriptome complexity. As full-length transcripts detected by Iso-Seq contain a poly(A) tail, it has advantages in the detection of alternative polyadenylation sites. Of the 25,246 genes detected by Iso-Seq, 14,536 genes have at least one poly(A) site, and 1075 genes have at least five poly(A) sites (Additional file 11:Figure S11a and Additional file 22: Table S9). The average number of poly(A) sites per gene was 2.16. Nucleotide composition flanking all poly(A) sites was analysed. There was clear nucleotide bias around poly(A) sites in G. australe with an enrichment of uracil (U) (upstream) and adenine (A) (downstream) (Additional file 11:Figure S11b). This nucleotide bias near poly(A) sites was consistent with previous reported [12, 13], suggesting that the identified poly(A) sites are credible. In addition, we performed MEME [45] to identify potential motifs enriched upstream of cleavage site. One conserved motif (AAUAAA) was identified around 25 nts upstream of the poly(A) site (Additional file 11:Figure S11c and d). This conserved motif is a known cis-elements as poly(A) signals in both plants and animals [43, 46].
Differential expression analysis during germination
To better understand the transcript dynamic during seed germination, pairwise comparison was carried out and significance was confirmed using edgeR [47]. In total, 12,993 genes were assigned as differentially expressed genes (DEGs), with an FDR cutoff of 0.001 and |log2(fold change)| ≥ 1. The result indicated that DEGs between germinated seed (5 h vs 30 h) were more than that between germinated seed (5 h vs 15 h and 15 h vs30h) (Additional file 12:Figure S12a). Totally, 6956 DEGs were identified from pairwise comparison during germinating seeds; and 12,613 DEGs were detected between germinating seed and leaf (Additional file 12:FigureS12b). G. australe seed did not appear pigment glands at 5 h germination while glands appeared at 30 h. We then performed GO enrichment to analyze the function of 3643 DEGs overlapped between 5 h vs 30 h and 5 h vs 15 h (Additional file 12:Figure S12b). Our result showed that 10 GO-slim terms were identified as over-representations based on an FDR < 0.05, such as thylakoid, transmembrane transporter activity, carbohydrate metabolic process, extracellular region and etc. (Additional file 12:Figure S12c). To further illustrate the relationships among DEGs with highly similarities of temporal expression patterns, 12,993 DEGs were clustered into four main groups (designated K1–4) using K-mean clustering algorithm in TCseq package (Additional file 12:Figure S12d). The genes in each group enriched in particular functional categories. Four terms were significantly enriched in K1, they are ‘chromosome organization’, ‘organelle organization’, ‘cellular component organization’ and ‘cell wall organization or biogenesis’. Genes in K2 were identified responsible for cell cycle and cell division. K3 was suggested to be involving in ‘photosynthesis’, ‘lipid metabolic process’ and ‘response to stress’. K4 group contained the most overrepresented GO terms in four groups, in which genes involving in “macromolecule biosynthetic process”, “peptide metabolic process”, “cellular biosynthetic process” and “gene expression” were greatly enriched. However, the functional categories of secondary metabolic process and sesquiterpene biosynthetic process were not enrichment in these clusters.
AS dynamics during seed germination in G. australe
Mature G. australe seeds are spotlessly white and without pigment glands. The pigment glands appear only after the seeds have germinated for more than 24 h [48]. To profile AS during the seed germination of G. australe, RNA-Seq paired-end reads during seed germination were employed to confirm the four major AS types (IR, AA, AD and ES) based on MISO [49]. The results demonstrated that the number of expressed genes (FPKM > = 1.0) rose slightly during seed germination, from 18,601 at 5 h to 20,616 at 30 h (Fig. 3a). The expressed gene number in the leaf (18,755) was slightly less than that in the germinating seed (Fig. 3a). Meanwhile, we found that the number of identified AS events increased during seed germination. In particular, many more AS events appeared in the leaf (3996), although the total number of gene loci was lower than that in the germinating seeds, indicating that AS plays an important regulatory role in the leaf tissues. In total, 4777 AS events were detected from 3338 genes in the four tissues. Among the four major AS types, IR and ES events were the most and the least frequent in all tissues, respectively (Fig. 3b). AA events were more frequent than AD events (Fig. 3b). These results are consistent with our Iso-Seq results (Fig. 2b). However, the percentage of IR detected by RNA-Seq was higher than that by Iso-Seq, accounting for over 80% of the total events (Fig. 3b). To characterize differentially expressed events during seed germination, we performed a pairwise comparison using MISO. In total, 124, 156 and 89 differential events were identified from 111, 139 and 86 genes, respectively, in the pairwise comparisons (at 5 h, 15 h and 30 h during germination) (Additional file 23:Table S10). The most abundant splicing difference was found between 5 h and 30 h of germination (Fig. 3c). Meanwhile, more differential AS events were present between 5 h and 15 h than between 15 h and 30 h, and this result is consistent with the finding that the number of detected AS events increased between 5 h and 15 h (Fig. 3a). No differential ES events were detected between 15 h and 30 h. We also performed a pairwise comparison based on tissues in the leaf and germinating seed (Fig. 3d), which showed 433, 553 and 208 differential events from 379, 496 and 184 genes, respectively (Additional file 23:Table S10).
To validate the robustness of the AS events detected, we randomly selected ten annotated genes to confirm their gene expression in germinating seeds and young leaves by semi-quantitative RT-PCR. Our RT-PCR indicated that the gel banding pattern and the size of the fragments were consistent with the AS events identified from the Iso-Seq data (Fig. 4). We found that the expression level of each isoform exhibited a tissue- and/or development-biased pattern. For instance, a small transcript fragment from the gene PB.15419 (SNF1-related protein kinase regulatory subunit gamma-1) was preferentially expressed in 5 h germinating seeds (expected size, 290 bp). A fragment from the gene PB.12382 (Pre-mRNA-processing factor) was expressed at the highest level in 5 h germinating seeds, gradually decreased during seed germination, and was not expressed in leaf (expected size, 455 bp). Overall, the results indicated that Iso-Seq was a powerful tool for the exploration of tissue-specific expression AS events/isoforms in combination with RNA-Seq data sets in the transcriptome.
Re-characterization of genes involved in terpene biosynthesis in G. australe
The molecular mechanisms of delayed gland morphogenesis in Australian cotton species are still poorly understood. Australian wild cotton contains immature lysigenous glands but no terpenoid aldehydes, and its pigment glands appear only after seed germination; thus, there is little gossypol deposited in the dormant seeds of the species [4]. Gossypol, a natural terpenoid aldehyde that is toxic to humans and monogastric animals, reduces the value of cottonseed despite its richness in oil and proteins. The key genes involved in the gossypol biosynthetic pathways have already been identified [48, 50]. Accordingly, we manually investigated isogenes encoding the enzymes involved in both the methylerythritol phosphate (MEP)- and mevalonate (MVA)- dependent biosynthetic pathways, as well as a potential cotton delta-cadinene synthase (EC 4.2.3.13) in the G. australe FL transcriptome. We found that 38 genes were involved in gossypol synthesis in the MEP- and MVA-dependent biosynthetic pathways. Then, the expression of these genes were investigated in germinating seeds or leaf by RNA-Seq. Our results revealed that genes coding the main rate-determining enzyme of the MEP-pathway, 1-deoxy-D-xylulose 5-phosphate synthase (DXS), were detected with low expression level in germinating seed. Otherwise, genes coding the enzymes involving in MVA pathway were abundance (Additional file 13:Figure S13). The high expression levels of MVA pathway genes in 5 h and 15 h germinating seed indicated that gossypol biosynthesis occurred early even the seeds surface without pigment glands detected by naked eyes. In addition, most gene loci (30, 78.95%) had multiple isoforms in G. australe (Additional file 24:Table S11). Moreover, the intron retention level of PB.13680, which encodes MK (mevalonate kinase), gradually decreased during seed germination in G. australe (Fig. 5a). The IR event from PB.13680 (the left border) was differentially expressed between the 5 h germinating seed and leaf. To validate the accuracy of the differential expression results, we designed primers for semi-quantitative RT-PCR at the region shown by a triangle in Fig. 5a. This gene could transcribe three isoforms, two of which were detected. We found that the fragment produced by an IR event was preferentially expressed in 15 h and 30 h germinating seeds (Fig. 5b). This result may suggest that gossypol biosynthesis is not only associated with gene expression but may also be regulated by different transcript isoforms resulting from AS.