De novo assembly and Transcriptome characterization of an endemic species of Vietnam, Panax vietnamensis Ha et Grushv., including the development of EST-SSR markers for population genetics

Background Understanding the genetic diversity in endangered species that occur inforest remnants is necessary to establish efficient strategies for the species conservation, restoration and management. Panax vietnamensis Ha et Grushv. is medicinally important, endemic and endangered species of Vietnam. However, genetic diversity and structure of population are unknown due to lack of efficient molecular markers. Results In this study, we employed Illumina HiSeq™ 4000 sequencing to analyze the transcriptomes of P. vietnamensis (roots, leaves and stems). Raw reads total of 23,741,783 was obtained and then assembled, from which the generated unigenes were 89,271 (average length = 598.3191 nt). The 31,686 unigenes were annotated in different databases i.e. Gene Ontology, Kyoto Encyclopedia of Genes and Genomes, Nucleotide Collection (NR/NT) and Swiss-Prot for functional annotation. Further, 11,343 EST-SSRs were detected. From 7774 primer pairs, 101 were selected for polymorphism validation, in which; 20 primer pairs were successfully amplified to DNA fragments and significant amounts of polymorphism was observed within population. The nine polymorphic microsatellite loci were used for population structure and diversity analyses. The obtained results revealed high levels of genetic diversity in populations, the average observed and expected heterozygosity were HO = 0.422 and HE = 0.479, respectively. During the Bottleneck analysis using TPM and SMM models (p < 0.01) shows that targeted population is significantly heterozygote deficient. This suggests sign of the bottleneck in all populations. Genetic differentiation between populations was moderate (FST = 0.133) and indicating slightly high level of gene flow (Nm = 1.63). Analysis of molecular variance (AMOVA) showed 63.17% of variation within individuals and 12.45% among populations. Our results shows two genetic clusters related to geographical distances. Conclusion Our study will assist conservators in future conservation management, breeding, production and habitats restoration of the species.


Background
Panax species (Araliaceae) are medicinally important plants of North America and eastern Asia [1,2]. In 19 species of the Panax genus [3,4], three known species, Panax vietnamensis, P. stipuleanatus and P. bipinatifidus are related to the high mountains of Vietnam [5,6]. Panax species are characterized by the presence of ginsenosides, which refer to a series of dammarane [7]. P. vietnamensis was found for the first time in Ngoc Linh mountain of Kon Tum province [8]. P. vietnamensis (Vietnamese Ginseng) is an endemic species of Vietnam, rich in saponin compound [1,8,9]. It is a perennial plant and grows up to 1 m in height, with a diameter of 4-8 mm under humus forest canopy. It has oval-shaped leaves with a serrated margin. The flowers are an inflorescence and fruit turn red at maturity with 1-2 white color seeds. P. vietnamensis is routinely used for the treatment of many serious diseases and enhancement of body stamina during mountainous journeys by the Sedang ethnic group [1]. Excessive exploitation in the past few decades and slow growth rate, poor regeneration of P. vietnamensis, the natural population sharply declines and put the species in endangered [10]. Therefore, it is listed in in the Vietnam Red Data Book 2007 (EN A1a,c,d, B1 + 2b,c,e) [11]. It is currently in the protection list of both the central and local governments of Vietnam. It needs urgent conservation and restoration, but main hurdle is the unexplored structure and genetic diversity of the P. vietnamensis wild populations due to unavailability of informative and reliable molecular markers for P. vietnamensis.
In the current study, (i) we have produced global transcriptome from P. vietnamensis using the Illumina HiSeq™ 4000 and analyzed functions, classification and metabolic pathways of the resulting transcripts. (ii) Then we have developed a set of EST-SSRs for P. vietnamensis and (iii) confirmed the efficacy of these markers by studying the genetic structure and diversity of three wild populations of P. vietnamensis. (iv) At last, the influences of geographical distance on genes flow within wild population were tested.

Results
De novo assembly and Illumina sequencing of P. vietnamensis transcriptomes Transcriptome sequencing of P. vietnamensis, a total of 7,083,775,547 bases were generated and after a stringent quality check 23

Functional annotation of assembled unigenes
For functional annotation analyses the unigenes were blasted against the seven databases (COG, GO, KEGG, KOG, Pfam, Swissprot and NR), a total 31,686 matched sequences was found (  Table 2). For the species distribution BLASTx was used to search against Nr databases, the P. vietnamensis transcriptome shows highest similarities with Elaeis guineensis (25%) followed by Phoenix dactylifera (22%) and Musa acuminata (9%) (Fig. 2).
Based on Nr annotations, we used the GO system to categorize the possible functions of the unigenes. A total of 72,183 (80.86%) unigenes was successfully grouped into three classes (biological process, molecular function and cellular component) and 51 subclasses (Fig. 3). The biological process was the top  58%) unigenes were related to "molecular function" in which prominent subcategories are "binding" (7459; 45.77%) and "catalytic activity" (7130; 43.75%). It was also observed that the few genes are enriched in the terms of "nutrient reservoir activity", "molecular carrier activity", "protein tag" and "translation regulator activity". A total of 7647 unigenes were assigned to Clusters of Orthologous Groups (COG), to check the reliability of the transcriptome library and effectiveness of the annotation process, for functional prediction and classification (Fig. 4). COG-annotated putative proteins were functionally classified into 25 categories. The top groups were "general function prediction only" (9089), "translation, ribosomal structure and biogenesis" (3388) and "transcription" (977), respectively. However, only few unigenes were annotated as "extracellular structures" and "nuclear structure." The KEGG pathway analysis was used to explore the biological pathways in P. vietnamensis that might be active with an E value cutoff < 10 − 5 . The 5838 unigenes was significantly matched in the KEGG database and assigned to 118 KEGG functional pathways (Fig. 5). The specific pathways, including plant hormone signal transduction, purine metabolism, ribosome, RNA transport spliceosome and many more pathways. In addition, 45 unigenes were in the terpenoid backbone biosynthesis pathway.

Genetic structure and diversity of population
A total of 98 individuals from three P. vietnamensis populations produced 27 different alleles, ranging from 120 to 265 bp at the nine loci (Table 4). In the current study,  Table 5, including the alleles mean number (A = 2.6), the effective alleles numbers (A E = 2.2), the proportion of polymorphic loci (92.59%), the observed heterozygosity (H O = 0.422) and expected heterozygosity (H E = 0.479). The fixation index (F) was positive for all the populations (F = 0.13). Therefore, these results showed heterozygosity deficiency and significant inbreeding (p < 0.05). Seven loci of the nine loci had positive fixation and indicating high homozygosity and inbreeding. However, among the loci, five loci had significant inbreeding (p < 0.05). Two loci had negative values.
During the Bottleneck analysis using Stepwise mutation model (SMM) and two phase model (TPM) models (p < 0.01) shows that targeted population is significantly heterozygote deficient (Table 5). This suggests sign of the bottleneck in all population.    The analysis of molecular variance (AMOVA) showed that total variation was highly significant (p < 0.001) within individuals i.e. 63.17% ( Table 6). The F ST were significant (p < 0.05), values range was from 0.072 to 0.182 (average = 0.133) and with 1.63 gene flow. Low genetic differentiation value (F ST = 0.072) was found between DT and TN population, while high value (F ST = 0.182) was between DT and KT population ( Table 7).
The genetic relationship of P. vietnamensis populations are expressed in Fig. 8. The DT and TN populations were grouped together and firmed one cluster with the bootstrap value of 100%. In the STRUCTURE analysis, the highest ΔK value (2032.81) (Fig. S1) for 98 individuals revealed K = 2 to be the optimum number of genetic clusters and indicated that all the studied plants exhibited admixture from two clusters. The percentage of ancestry of each population and individuals in two genetic groups shows that one group (red) was predominant in two populations (DT and TN) and second group (green) was predominant in one population i.e. KT (Fig. 9).

Discussion
Transcriptome sequencing/analysis is very effective tool for gene identification [53][54][55][56] and to identify gene expression at different developmental stages or physiological conditions of a cell [47]. Illumina HiSeq™ 4000 technology is effective, timeless, affordable, trusty tool for transcriptome description and gene detection in non-model plants as well. Previous studies showed that the numbers of ESTs were generated from P. ginseng leaves [48,57], P. notoginseng roots [58] and American ginseng (P. quinquefolius L.) flowers, leaves and roots [44]. To date, many researchers have studied the molecular markers for the genetic analysis of Panax species i.e. P. ginseng [40-42, 46, 48, 49], P. notoginseng [45]. However, due to unavailability of reference genome for P. vietnamensis, using Illumina HiSeq™ 4000 the produced reads were assembled through the de novo assembler Trinity. For the first time, we have reported comprehensive transcriptional information for the EST-SSR markers development and then explored the diversity and genetic structure of existent natural P. vietnamensis populations.
Illumina paired-end sequencing technology generated 23,741,783 clean and high quality reads with 93.5% Q30 bases and GC content 51.43%. The current results is higher than the previous study [2] on P. vietnamensis var. fuscidiscus (43.25%), indicating better quality sequencing. In the sequence assembly 89,271 unigenes (average length = 598.32 bp, N50 = 942 bp) were recorded, which was shorter than the results of Cao et al. [47] in P. ginseng (average length = 690-698 bp, N50 = 1130-1161 bp) and Zhang et al. [2] in P. vietnamensis var. fuscidiscus (average length = 1304 bp, N50 = 2018 bp). We had used the same technology; this might be the depth of sequencing, method of assembly and natural features of the plants. The transcriptome sequencing data of P. vietnamensis was further explored for genetic diversity, population structure and marker development. The 72,183 (80.86%) unigenes were annotated into 51 GO categories. The "metabolic activities", "binding" and "cell part" was on the top among biological activities, molecular function and cellular component, respectively. The results are in line with GO functional categories of P. vietnamensis var. fuscidiscus [2] and P. ginseng [47]. The predicted unigenes 5838 (5.42%) through KEGG pathways were mapped into 118 biological pathways and majority of pathways were related to metabolism. The specific pathways, including ribosome, RNA transport spliceosome, purine metabolism and signal transduction of plant hormone etc. These data unveil the active metabolic processes as well as synthesis of multifarious metabolites in the species. In P. vietnamensis and other Panax species, leaves have high value of aldehydes, esters and terpenoids, these compound help in resistance against biological as well as environmental pressures, such as cold, drought and pests.
In the current study, we have recorded the unigenes for signal transduction of the plant hormones that reacts to plant environmental conditions. Microsatellites are spread in plant genomes and involved in the regulation of their expression and function [13,59]. The studies on distribution of SSRs in species, the mechanism of SSR variation and comparison are the first step towards elucidation of the function [59,60]. There are many types of SSR markers and extensively spread in plant genomes [43,61]. From 8927 unigenes, 11,343 EST-SSR molecular markers were identified by RNA sequencing, while 2763 unigenes have EST-SSR locus more than one. Zhang et al. [2] also identified 21, 320 SSRs in P. vietnamensis var. fuscidiscus with 2918 containing more than one SSR. In the previous studies of Um et al. [49] on P. ginseng and Zhang et al. [2] on P. vietnamensis var. fuscidiscus in EST-SSRs di-nucleotide repeats (60.1 and 52.25%, respectively) were the most abundant type. We have identified SSR markers (11,343) of P. vietnamensis, the mono-nucleotide (5004; 44.12%), di-nucleotide (4648, 40.98%) and tri-nucleotide (1563, 13.78%) were the top repeats. The leading was dinucleotide, tri-nucleotide, tetra-nucleotide repeat motif in P. vietnamensis with AG/TC (90.06%), CCG/GGC (22.65%), ACTG/TGAC (19.30%), respectively. Which confirmed the study of Um et al. [49] on P. ginseng, but tri-nucleotide repeat motif was different in P. vietnamensis than other plants, such as Myricarubra [27], Salix, Populus and Eucalyptus [29]. The CG/CG motif (0.17%) was irregularly detected in P. vietnamensis, as also observed by Wu et al. [44], which confirmed that the repeat motif CG/CG is not common in numerous dicotyledon plants [62][63][64][65].
Genetic diversity has important character in the germplasm improvement and generally used in various medicinal plants [65][66][67][68]. The genetic diversity degree in many plants can be linked with the numbers of loci and populations [31,69], the geographical range size [70] and genetic exchange [71]. In the current research, the nine SSR markers showed a high degree of genetic diversity in P. vietnamensis populations and expected heterozygosity (H O = 0.422 and H E = 0.479) compared to some Panax species, such as P. stipuleanatus [50], P. ginseng [72,73] However, our results are in line with studies of Reunova et al. [46] on P. ginseng, the natural species of Russia (H O = 0.453 and H E = 0.393), Liu et al. [74] on P. notoginseng (H E = 0.350) and Reunova et al. [75] on P. vietnamensis (H E = 0.55) using microsatellite markers. High levels of genetic diversity in three P. vietnamensis populations, TN, DT and KT indicated that this species is predominantly outcrossed. High gene flow (N m > 1) may be a consequence of high outcrossing rates in the three populations. Dispersal of pollen grains by insects might be considered as a major factor for this species. Positive fixation index values were detected in all P. vietnamensis populations and showed a deficit of heterozygosity due to inbreeding. This suggests small neighborhood size and mattings between siblings within populations. Our results also showed a sign of the bottleneck in all three studied populations (p < 0.005). Significant heterozygosity deficits were detected in three populations (TN, DT and KT) under TPM and SMM (p < 0.005) models. The models suggested reduction in population size of the targeted populations. F ST is trenchant approach for measurement of gene flow in populations and genetic variation [76]. The genetic variation among P. vietnamensis populations was moderate. The low F ST value between two populations (DT and TN) can facilitate strong gene flow within populations (N m = 3.2). However, the low level of genetic variation between two populations, TN and DT (F ST = 0.072) might be due to geographical distance. These two populations are located in the same province of Quang Nam. The results of AMOVA analysis also indicated that 63.17% of variation was distributed within individuals and 12.45% among individuals within populations. These results showed a moderate genetic structure of P. vietnamensis. Genetic variation among populations is highly affected by genetic drift, gene flow, mutations, selection and long-term evolution [77]. Long lived and outcrossing species maintain high degree of genetic variation in populations and low genetic differentiation in populations, reflecting maximum levels of gene flow. Previous studies reported low differentiation between populations, and reflecting strong gene flow in P. ginseng [46,72] and P. stipuleanatus [50]. The strong gene flow among populations might be due to high outcross rates within populations. Thus, pollen grains dispersion through insects can be considered as a major factor of the population structure. The Bayesian analysis and UPGMA tree showed two different groups of genetically mixed individuals of P. vietnamensis. In the current study we had isolated population among province through geographical distance. Two populations DT and TN closed together within the province (Quang Nam) and formed a genetic cluster while the KT population in Kon Tum province was separated and formed one cluster by the geographical distance, where gene exchange between the two groups was restricted. The P. vietnamensis faced serious threats in their survival. Based on our studied results, all the studied populations can be considered for both in-situ and ex-situ conservation strategies.

Conclusions
De-novo transcriptome sequencing of P. vietnamensis was performed by the Illumina sequencing platform. We produced a large number of ESTs and identified candidate genes that differentially expressed in P. vietnamensis. A total of 11,343 EST-SSRs was identified. It is obvious from the data that the natural populations of P. vietnamensis maintained high level of genetic diversity. Numerous SSR markers were identified and will contribute to marker-assisted breeding of P. vietnamensis. This study does not only provide ground for P. vietnamensis breeding but also a platform for its conservation, to maintain genetic diversity.

Plant material
We had collected samples (roots, leaves and stems) in liquid nitrogen of P. vietnamensis (Fig. 10a) from a wild population (Quang Nam province) for RNA extraction stored at − 80°C. P. vietnamensis (ten plants) wild population (Quang Nam province) was used for EST-SSRs development to test the amplification relevancy of the synthesized EST-SSR primers (Table 8). Three different wild populations (98 Plants) of P. vietnamensis were sampled to assess structure and genetic diversity (Fig. 10b, Table 8). The wild population of P. vietnamensis was sampled during spring and summer 2019, respectively. For DNA extraction fresh leaves were desiccated in silica gel. This species was identified by Dr.

RNA extraction
Total RNA was extracted from each sample (roots, leaves and stems) by the OmniPlant RNA Kit (DNase I) for Illumina sequencing. RNA quality and quantity were validated by Nanodrop and 1.2% agarose gel electrophoresis [18]. Total RNA (equal amount of each sample) was pooled together and sent to Breeding Biotechnologies Co., Ltd., for transcriptome sequencing using Illumina HiSeq™ 4000.

Transcriptome sequencing and De novo assembly
Cleaned mRNA was used for cDNA library construction extracted from 200 μg of total RNA using Oligo (dT). The cDNA 1st strand was prepared from random hexamers using mRNA as a template and the other strand was from buffer, dNTPs, RNase H and DNA polymerase I, and then cleaned with AMPure XP beads. The cleaned double-stranded cDNA was subjected to terminal repair, the sequencing linker was ligated and then the fragment size was selected with AMPure XP beads. The cDNA library was acquired by PCR enrichment. After library validation on a BioAnalyzer (Agilent 2100), Breeding Biotechnologies Co., Ltd. sequenced the cDNA libraries on a MiSeq (Illumina HiSeq™ 4000). The Trimmomatic v3.0 [78] was used for raw reads filtration. The reads showing adaptor contamination, length < 36 bp and low quality value (quality < 20) higher than 15% were eliminated. Trinity [79] with default parameters were used for de novo assembly of the cleaned reads. Then the TIGR Gene Indices clustering tool (TGICL) v2.1 [80] was used to cluster and eradicate redundant transcripts, and identified unigenes for further analysis.

Development, detection of EST-SSR markers and primer design
The assembled P. vietnamensis transcriptome was mined by MISA (Microsatellite identification tool) for markers. The candidate SSRs from 2 to 6 nucleotides range were defined as for dinucleotides, 6 repeats and for all higher order motifs, 5 repeats, according to Jurka and Pethiyagoda [90].  [18].

Population genetic analysis
To evaluate population structure and genetic diversity of P. vietnamensis precisely, 101 polymorphic SSR markers were carefully chosen and 20 primer pairs were successfully amplified for DNA fragments. Among the selected SSR markers, nine markers had clear and reproducible profiles, and were selected for study (Table 2). DNA isolation kit (Norgenbiotek, Canada) was used for genomic DNA extraction. The samples were crushed by Mixer mill MM 400 (Retsch) in liquid nitrogen. DNA quality and concentration were validated according to Vu et al. [18]. The concentration was then diluted to 10 ng/μl. PCR was executed in 25 μl volume including 2 μl of genomic DNA (total 10 ng), 12.5 μl Master Mix 2X, 1 μl of each 10 μM primer, and 9.5 μl H 2 O deionized. The reaction was amplified in the thermal cycler conditions: an initial denaturing at 94°C for 3 min, 40 cycles for 1 min at 94°C, 30s at 54-56°C annealing temperature for primer pair (each) and 1 min extension at 72°C and 10 min at 72°C for final cycle before holding the samples at 4°C till analysis. The Sequi-Gen®GT DNA electrophoresis system were used for amplification products separation with 8% polyacrylamide gels in 1 x TAE buffer and then visualized by GelRed™ Nucleic Acid Gel Stain. Alleles size was detected by Gel-Analyzer software of GenoSens 1850 (Clinx Science Instruments Co., Ltd) with a 20 bp DNA ladder (Invitrogen).
Genetic parameters were analyzed on the GenAlEx [92], with the proportion of polymorphic loci (P), effective alleles (A E ), the number of alleles per locus (A). Observed heterozygosities (H O ), expected heterozygosities (H E ), the fixation index (F), the differentiation index between pairwise populations (F ST ), the matrix of F ST between various populations and gene flow (Nm) was calculated by the formula Nm = [(1/Fst) -1]/4 [93] Polymorphism information content (PIC) value was calculated according to Botstein et al. [94] Tests for deviation from Hardy-Weinberg equilibrium at the locus (each) and disequilibrium in the linkage for each locus pairwise combination in each population were performed by Genepop v.4.6 [95]. Testing of recent bottleneck events for population (each) via the SSM and TPM were tested using Botteneck v.1.2 [96]. The significance of these tests was measured by the one-tailed Wilcoxon signed rank test. The proportion of the SSM was set to 70% under default settings. The genetic distances among populations were also calculated using GenAlEx. The significance of F ST values in each population pair across all loci was tested by applying the sequential Bonferroni correction.
The data were subjected to AMOVA using Arlequin 3.1 and significance test was applied on a basis of 10,000 permutations [97]. The UPGMA approach was used to determine the genetic association among population by Poptree2 [98]. STRUCTUREv.2.3.4 was used to explore population structure with Bayesian clustering approach [99]. The admixture model was set with correlated allele frequencies i.e. in the data set (K), in different groups, ten separate runs were employed for K within 1 and 15 at 500,000 Markov Chain Monte Carlo (MCMC) repetitions and at 100,000 burn-in the period. Structure Harvester [100] was used for the group (number) detection that best fits in the dataset based on the ΔK according to Evanno et al. [101].
Additional file 1: Figure S1. The Delta K distribution graph.
Abbreviations AMOVA: Analysis of molecular variance; COG: Clusters of orthologous groups; EST-SSRs: Expressed sequence tag-simple sequence repeat; GO: Gene ontology categories; KEGG: Kyoto encyclopedia of genes and genomes pathways; UPGMA: Unweighted pair group method of arithmetic average; SSM: Stepwise mutation model; TPM: two phase model