Comparative Chloroplast Genomics of Four Pilea Species (Urticaceae): High Levels of Sequences Divergence Provides New Insight into Interspecic Diversity in Pilea

Pilea is a genus of perennial herbs from the family Urticaceae, which are used for courtyard ornamental. For some species, they are used as medicinal plants in traditional Chinese medicine as well. The morphological characteristics of medicinal species from Pilea are similar, and it is dicult to accurately distinguish them based only on morphological characteristics. Besides, the species classication of Pilea are still controversial. The classication of many species are still in an unresolved state. At present, there is no information about the chloroplast genomes of Pilea, which limits our further understanding of this genus. Here, we rst reported 4 chloroplast genomes of Pilea taxa (P. mollis, P. glauca, P. peperomioides and P. serpyllacea), and performed comprehensive comparative analysis.


Results
The four chloroplast genomes have similar structural characteristics and gene order with other angiosperms. These genomes all have a typical quartile structure, which contains 113 unique genes, including 79 protein-coding genes, 4 rRNA genes, and 30 tRNA genes. Besides, we detected SSRs and repeat sequences, and analyzed the expansion/contraction of IR regions. In particular, the comparative analysis showed a rather level of sequence divergence in the non-coding regions, even in the proteincoding regions of the four genome sequences, suggesting a high level of genetic diversity in Pilea. Moreover, we identi ed eight hypervariable regions, including petN-psbM; psbZ-trnG-GCC; trnT-UGU-trnL-UAA; accD-psbI; ndhF-rpl32; rpl32-trnL-UAG; ndhA-intron and ycf1, are proposed for use as DNA barcode regions. Phylogenetic analysis showed that four Pilea species form a monophyletic cluster with a 100% bootstrap value.

Conclusion
The results obtained here could provide abundant information for the phylogenetic position of Pilea and further species identi cation. High levels of sequences divergence promote our understanding of the interspeci c diversity of this genus, also provide reference for the rational classi cation of unsolved species in the future.

Background
The Pilea plants are a perennial herb from the family Urticaceae, which mainly distributed in tropical and subtropical, some species distributed in warm temperate regions as well. Pilea is a species-rich genus, which is the largest one in the family Urticaceae, and also a larger genus among angiosperms [1]. The leaves of many species in Pilea have color spots, which can be used for garden cultivation with ornamental purposes (e.g. P. cadierei and P. mollis). They are often the main plant groups in the shady and humid environment of the garden landscape. On the other hand, in Chinese traditional pharmacopeia, several species are recorded as medicinal plants from which a variety of pharmacologically active substances can be extracted [2][3][4]. For example, P. peperomioides is recorded in "Dai medicine" for antiin ammatory and detoxifying, also used for erysipelas and bone setting. However, it's a group that gets little attention, and there are also little reports about Pilea. According to the pharmacopeia, different species have different pharmacological ingredients and medicinal values. Moreover, as many species are similar in morphology, species and population accurate identi cations are particularly important for the rational usage of these medicinal plants based on molecular markers.
The genus Pilea is also a controversial group in traditional taxonomy, previous studies suggesting that Sarcopilea also belongs to this genus [5]. With little attention from experts and scholars, many species being ignored or unresolved, it is di cult for the revision of this species-rich genus. Moreover, some new species have been reported in recent years [6,7]. It is necessary for us to further study the phylogenetic status of Pilea plants. However, relatively little research has been reported on this genus, especially in the eld of molecular biology and genomics. Though some researchers have used molecular methods to explore phylogenetic relationships within the genus Pilea [1] and its phylogenetic position in the family Urticaceae [5], the selected DNA fragments are one-sided and partial with low bootstrap supports values, which has certain limitations.
The chloroplast is a kind of organelle involved in photosynthesis [8] and energy transformation in plants and algae [9,10]. Chloroplast genome (referred to as cp genome in the following text) encodes many key proteins that play essential roles in photosynthesis and other metabolic properties [11]. The cp genome sequences have unique characteristics, such as monophyletic inheritance [12], conserved coding region sequences, and a typical quartile structure of genomes. Due to its conservative genome structure and contents, the cp genomes has become an ideal model to resolve plant phylogenies and evaluate biodiversity [13]. Although the cp genomes are relative conserved compared to the nuclear genomes and mitochondrial genomes [14], it also contains highly variable regions that are widely used as molecular markers [15][16][17][18]. Chloroplast genomes can provide many molecular markers for species identi cation in plants, such as chloroplast simple sequence repeats (SSRs), exhibiting great potential in species identi cation [19,20]. Unfortunately, no chloroplast genomes of Pilea plants have been reported so far.
Here, we rst sequenced, assembled and analyzed the cp genomes of four Pilea plants. They are common ornamental or medicinal plants (P. mollis, P. glauca, P. peperomioides and P. serpyllacea). In this study, our main works are as follows: (1) we rst sequenced and assembled cp genomes of Pilea plants; (2) we analyzed the structural characteristics and sequence divergence of cp genomes in Pilea; (3) we identi ed SSR loci and repeat sequences for further studies on population genetic structure; (4) we inferred the phylogenetic status of Pilea in Urticaceae based on the complete cp genome sequences; and (5) we identi ed the hypervariable regions which could be used as DNA barcodes for identi cation of this genus. Using Illumina Hiseq sequencing platforms, 5.38-5.89 G clean data from each Pilea species were  obtained, with the number of clean reads are ranged from 17,935,118 to 19,627,967 (Additional File 1:  Table S1). The chloroplast was then assembled based on these data. Genome annotation

General features of cp genomes
The cp genomes of four Pilea species all comprises 133 genes, among which, 113 are unique genes, including 79 protein-coding genes, 4 rRNA genes and 30 tRNA genes (Additional File 1: Table S2). The gene order and gene numbers of these four species are highly similar, showed conserved genomic structures. Figure 1 shows the schematic diagram of the cp genomes of Pilea. Introns play a signi cant role in selective gene splicing [21]. Among the 79 protein coding genes annotated, nine unique genes (rps16, rpoC1, atpF, petB, petD, rpl16, rpl2, ndhB, ndhA) contain one intron and two unique genes (ycf3, clpP) contain two introns. Six unique tRNA genes (trnK-UUU, trnG-UCC, trnL-UAA, trnV-UAC, trnI-GAU, trnA-UGC) contain one intron. There are seven protein-coding genes, four rRNA genes, and seven tRNA genes completely duplicated in the IR regions, so they have two copies. The gene rps12 is a trans-spliced gene, the 5' end is located in LSC region. However, the 3' end was found in the IRa and IRb region. These results are similar to other species in the nettle family [22].

Repeats analysis
Simple sequence repeats (SSRs), also referred to as the microsatellite sequences, provide a large amount of genetic information [23][24][25]. Because of its high genetic polymorphism, SSRs are often used for the development of molecular markers and play an important role in the identi cation of species [26,27]. In this study, we detected 68, 75, 71, 80 SSRs in the 4 analyzed species, respectively ( Fig. 2A, Additional File 1: Table S3). Most SSRs are mononucleotide homopolymers, particularly for A/T, which accounts for 70.75% of the total. Hexanucleotide repeats are detected only in P. mollis. These SSR showed high polymorphism, suggesting great potential in the identi cation of Pilea species.
In the cp genomes of Pilea species, we detected four types of interspersed repeats. Most of them are forward repeats and palindromic repeats (Fig. 2B). By contrast, there are only two reverse repeats and one complement repeats. The only complement repeats were found in P. peperomioides. The detailed sequences showed in Additional File 1: Table S4. Moreover, the length of these short interspersed repeats mainly ranged from 30 to 34 bp. We note that one forward repeats with a length of 102 bp, (detected in P. serpyllacea, R13). These longer interspersed repeats thought to be essential for promoting cp genome rearrangements [28,29]. Whether or not these repeats have caused the rearrangement of the cp genomes of Pilea species are interesting questions.

Contraction and expansion analysis of IR regions
The contraction and expansion of IR regions are considered to be an important reason for the length diversity in cp genomes [30]. Besides, with the expanded/contracted of the IR regions, genes near the border have an opportunity access to IR or single-copy regions [31]. We retrieved the published cp genomes of six species from Urticaceae, and compared them with the four Pilea species. We observed several genes span or near the boundary of IR and single copy regions. They mainly are rps19, rpl22, rpl2, ycf1, ndhF and trnH (Fig. 3). It's worth noting that, an abnormal expansion of IR regions was observed in Gonostegia hirta. The IR regions are over 30,000 bp in G. hirta and more genes access to the IR regions (e.g. rpl36 and rps19). However, the length of IR regions in the other nine species are about 25,000 bp, and rps19 gene span the LSC/IRb boundary except for D. iners and H. tenella. The former rps19 gene is in LSC region, while the latter is completely in IR regions. Besides, trnH gene completely accesses to IR regions in H. tenella, obtained two copies. It can be seen that the genomic structure, gene order and numbers of some species in Urticaceae have changed obviously.
Furthermore, gene ycf1 crosses the SSC/IRa boundary, most of which located in the SSC region. The length of ycf1 gene in the four Pilea species varies widely, indicating the possibility of sequence differences. Surprisingly, we annotated two copies of ycf1 in the four Pilea plants, they cross IRb/SSC boundary and are not annotated in other species. Sequence alignment found that the two copies of ycf1 are existed in other taxa, indicating that previous annotation are imperfect, although one of the two copies is a fragment of ycf1, and are generally considered to be a pseudogene. Interestingly, except for E. dissectum, a small fraction of the ndhF gene (less than 100 bp) crosses the IRb/SSC regions, which means that fragments of ycf1 have an overlap with ndhF in Pilea species. The overlapping areas are 108 bp. These results are also observed in Arabidopsis, the overlaps are about 30 bp [32]. Whether or not these overlaps affect the transcription or translation of these proteins is also an interesting subject.

Genome divergence
To evaluate genomic divergence, a sequence identity analysis based on mVISTA [33,34] was performed between 4 Pilea species, with the reference being the cp genome of P. peperomioides. We observed varying degrees of sequence divergence, especially in LSC and SSC regions. In contrast, the IR regions are more conservative. Most of these highly variable regions are observed in Conserved Non-Coding Sequences (CNS) (Fig. 4). However, the regions with the greatest sequence divergence are found in protein-coding regions, which is gene ycf1. The coding regions of ycf1 in the four Pilea species showed signi cant differences, and the similarity are even less than 50% in some fragments. Overall, the analyzed genomic sequences showed high levels of sequence divergence, suggesting a high level of genetic diversity in the genus Pilea.
To quantify the levels of DNA polymorphism, the 4 cp genomes were aligned and analyzed by using DnaSP v6.0 [35]. We detected 8 hypervariable regions with Pi value exceed 0.06 (

Nucleotide variations in protein-coding genes
The protein-coding regions are highly conserved in cp genomes [36]. We analyzed the protein-coding sequences of 79 identi ed unique orthologous genes in 4 Pilea taxa. Surprisingly, these protein-coding genes also showed high levels of variation (Fig. 6A, Additional File 1: Table S5). Of the 79 shared genes, 63 had a mutation rate of more than 2%, and 30 had a mutation rate of more than 4%. The gene with the highest mutation rates is ycf1 (16.62%), then followed by matK (10.54%), ccsA (8.74%) and rps15 (8.42%). Only two genes (psbJ and psbL) showed extreme conservation without any variable sites. Moreover, we observed a total of 11 genes (ycf1, ndhF, rps19, accD, rpoC2, rps16, rpoA, rpl20, ndhD, rpoC1 and ycf2) with InDels in nucleotide sequences by using DNASP 6.0 [35]. Among which, ycf1 had 35 InDels, then followed by ycf2 (9), accD (4) and rpoC2 (3). Considering that the protein-coding regions are highly conserved, protein-coding sequences with high nucleotide mutation rates are usually infrequent in the same genus, these results indicated that the Pilea plants had high genetic diversity and there are great differences among species.
In this study, synonymous (dS) and nonsynonymous (dN) substitution rates, along with dN/dS were estimated for the 79 shared genes in parallel by using PAML v4.9 [37]. Among the 79 genes, ycf1, matK, ccsA and rps15 had relatively higher dN values, and rps16, rpl32, ndhF and psaJ had higher dS values (Fig. 6B, Additional le 1: Table S6). Most genes exhibited considerably low values of dN/dS (less than 0.6), implying most of the protein-coding genes were under purifying selection during evolution. However, the dN/dS ratio of three genes (rpl36, clpP and accD) was between 0.6 and 1.0. Moreover, the dN/dS ratio was greater than 1.0 for petL, rps12, ycf1 and ycf2, indicating that they have been under positive selection during evolution. This results clearly indicate that cp genes in different plant species of Pilea may be subjected to diverse selection pressures.

Phylogenetic analysis
Compared to nuclear and mitochondrial genomes, the cp genomes are highly conserved, and it has been widely used in phylogenetic and evolutionary studies [38][39][40]. With the development of high throughput sequencing technology, the chloroplast genome sequence plays an important role in species identi cation as a super barcode [41,42]. In this study, we constructed the maximum likelihood (ML) trees by using the complete cp genome sequences as the data sets (detailed materials are shown in Additional File 1: Table S7). The phylogenetic tree has high bootstrap supports in all nodes, shows the reliability of the phylogeny recovered (Fig. 7).
Our phylogenetic tree displayed two clades clearly, and then further diversi ed into four subclades with 100% bootstrap supports (ML). These four subclades correspond to four subfamilies, they are Boehmerioideae, Cecropioideae, Lecanthoideae and Urticoideae, respectively. This is consistent with the traditional classi cation [5]. All the 4 Pilea species clustered together (all nodes have BS = 100 for ML), and form a monophyletic group, which is a sister group to Elatostema. They all are belonging to subfamily Lecanthoideae.

Discussion
Conserved genome structure and gene content We rst reported the cp genome sequences of four Pilea species. Our assembly results showed that the length of 4 cp genomes ranged from 150,398 bp to 152,327 bp. This result is similar to most Urticaceae plants [43,44]. In our study, the longest and shortest cp genome sequences in Urticaceae are 159,085 bp (Gonostegia Hirta) and 146,842 bp (Hesperocnide Tenella), respectively. This suggests that the cp genomes of Urticaceae may have undergone different evolutionary processes. Among our four Pilea taxa, the longest genome sequence is P. peperomioides (152,327 bp) and the shortest is P. serpyllacea (150,398 bp). Structurally, they are highly similar to most angiosperms, and we didn't detect gene gain/loss, suggesting that cp genomes are still relatively conserved.
In this study, we detected SSRs and repeat sequences in the four cp genomes. Of the total 294 SSRs, 215 are mononucleotide repeats, accounting for the majority of all SSRs (73.13%). These mononucleotide repeats are mainly A/T repeats, and they have a signi cant impact on the overall G/C content of the genomes [45,46]. These SSR sequences are often composed of simple repeating units such as polyadenine (PolyA) or polythymine (Poly T) repeats. With the length polymorphism in different species, they are often used as molecular markers. There are abundant SSR loci in cp genomes, which have been applied in species identi cation [47,48].
The variation of IR regions is a common phenomenon in angiosperms. Compared with the overall absence of one IR region [49][50][51], the expansion/contraction of IR regions are more common in angiosperms [52,53]. By comparative analysis, we found that G. hirta signi cantly expanded the IR regions, which also led to an increase in the overall length of the cp genomes.

High levels of sequences divergence reveals interspeci c diversity in Pilea
In our chloroplast comparative genomics analysis, we compared the whole cp genome sequences based on mVISTA. Speci cally, we also calculated the percentage of variable sites and estimated the ratios of dN/dS rates among 79 orthologous protein-coding genes. Similar to most angiosperms, the non-coding regions of Pilea plants showed higher polymorphism than coding regions. Surprisingly, we also found high levels of sequence differences in the coding regions of Pilea plants. Of the 79 orthologous genes identi ed, 63 had a mutation rate of more than 2%, and 30 had a mutation rate of more than 4%. This is rare in other genera, because usually only the ycf1 gene had a high mutation rate [54]. The mutation rate of ycf1 gene in the four Pilea plants is an astonishing 16.62%. Also, a total of 35 InDels were detected, including a large fragment insertion in P. peperomioides (177 bp, data not shown). These InDels caused an increase in the length of the ycf1 gene in P. peperomioides. In addition, unusually high nucleotide mutation rates also observed in matK, ccsA and other genes.
In general, dN changes are subject to bidirectional effects of varied mutation rates and selective constraints. The ratio of dN/dS greater than 1 is thought to be a sign that the gene has experienced selection pressures. In our study, the dN/dS ratio indicates that four genes (petL, rps12, ycf1 and ycf2) may have undergone positive selection. The rapid evolution of protein-coding genes is closely related to the adaptive evolution of species [55,56], indicating that Pilea plants may have experienced a rapid evolutionary process, result in species-rich of Pilea.  In particular, the gene ycf1 with a large number of InDels can be used as speci c molecular markers, which is of great signi cance for us to correctly identify and rationally utilize the medicinal taxa from this genus.
Phylogenetic status of Pilea based on cp genome sequences Moreover, the phylogenetic status of Pilea in Urticaceae was analyzed based on the complete cp genomic sequences. In a one-sided analysis based on chloroplast genome sequences, Pilea and Elatostema are sister groups of each other, both belonging to the subfamily Lecanthoideae. This is consistent with the results of traditional classi cation studies [5]. However, due to the matrilineal inheritance of the chloroplast genomes [59], the results are restricted. Accurate phylogenetic relationships need a comprehensive analysis of nuclear and organelle genes [60]. Furthermore, the relationships between Pilea and other plants of family Urticaceae need more cp genome sequencing in the future.

Conclusions
In this study, four cp genomes of Pilea plants were sequenced, assembled and annotated, which was the rst report in this genus. By comparison, we found the cp genomes of four species have similar structural characteristics, and a typical quartile structure similar to that of most angiosperms. Unusually, the genome sequences of four species, including the relatively conserved protein-coding regions, showed high levels of divergence. We glimpsed a rich genetic diversity in Pilea. The interspeci c diversity of Pilea may be more abundant than we previously thought. This is of reference signi cance for us to evaluate the number of species in this genus. In summary, we provide 4 high quality chloroplast reference genomes, and the results obtained here provide valuable information for the understanding of the genetic diversity and contribute to the resource utilization of Pilea plants in the future.

Plant material, DNA extraction and Sequencing
The fresh leaves of four Pilea plants were collected from Guangzhou, Kunming and Suqian, respectively. All the samples were saved at the Herbarium of Southwest University, Chongqing, China. The detailed information for the plant samples shown in Additional File 1: Table S8. The total genomic DNA was extracted by using CTAB method [61]. The DNA library with an insert size of 350 bp was constructed using the NEBNext® library building kit [62] and sequenced by using the Hiseq Xten PE150 sequencing platform. Sequencing produced a total of 5.4-5.9 G raw data per species. Clean data were obtained by removing low-quality sequences: sequences with a quality value of Q < 19 accounted for more than 50% of the total base, and sequences with more than 5% bases being "N".

Genome assembly and annotation
The de novo genome assembly from the clean data was accomplished utilizing the NOVOPlasty (v. 2.7.2) [63] with the k-mer length of 39 bp and a sequence fragment of rbcL gene from maize as the seed sequence. The correctness of the assembly was con rmed by using Bowtie2 (v2. 0.1) [64] to manually edit and map all raw reads to the assembled genome sequence under the default settings. The cp genome was annotated initially by using CPGAVAS2 [65] using the reference genome (Elatostema dissectum, GenBank: NC_047192.1). Geseq was then used to con rm the annotation results [66]. Furthermore, the annotations with problems were manually edited by using Apollo [67]. The genome maps were drawn by OGDRAW [68]. The genome sequence has been deposited in GenBank with accession numbers: MT726015 to MT726018.

Analysis of nucleotide substitution rate
The protein-coding sequences in the previous step were processed in parallel. We used the CODEML module in PAML v.4.9 [37] to estimate rates of nucleotide substitution, including dN (nonsynonymous), dS (synonymous), and the ratio of nonsynonymous to synonymous rates (dN/dS). The detailed parameters were: CodonFreq = 2 (F3 × 4 model); model = 0 (allowing a single dN/dS value to vary among branches); cleandata = 1 (remove sites with ambiguity data); other parameters in the CODEML control le were left at default settings. The phylogeny tree structure of each genes were generated by using the Maximum Likelihood (ML) method implemented in RaxML (v8.2.4) [76].

Declarations
Availability of data and materials The annotated chloroplast genome sequences of four Pilea plants were deposited in GeneBank (https://www.ncbi.nlm.nih.gov/) with Accession number: MT726015 to MT726018. All the samples are saved at the Herbarium of Southwest University, Chongqing, China. All other data and material generated in this manuscript are available from the corresponding author upon reasonable request.

Ethics approval and consent to participate
The four collected Pilea species are widely distributed in China as ornamental plants. Experimental researches do not include the genetic transformation, preserving the genetic background of the species used, and any other processes requiring ethics approval.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests   Comparison of the cp genomes in the 4 Pilea species by using mVISTA. The genes were represented as grey arrows on the top of the alignment. Different regions are labeled with different colors. The pink regions are "Conserved Non-Coding Sequences" (CNS), the dark blue regions are exons, and the light-blue regions are tRNA or rRNA. 50% and 100% refer to the similarity among sequences. Gray arrows above the aligned sequences represent genes and their orientation. Nucleotide diversity (Pi) of cp genomes among the 4 Pilea species. Each black dot represents the nucleotide diversity per 500 bp. There were seven intergenic regions (petN-psbM, 0.06067; psbZ-trnG-GCC, 0.07067; trnT-UGU-trnL-UAA, 0.06433; accD-psbI, 0.06003; ndhF-rpl32, 0.06100; rpl32-trnL-UAG, 0.06800; ndhA-intron, 0.06533) and one protein-coding region (ycf1, 0.07367-0.17067) had Pi values greater than 0.06.

Figure 6
Sequence polymorphism among 79 shared cp genes of 4 Pilea species. A. Percentages of variable sites in 79 shared protein-coding genes. We used MEGA 6.0 to calculate the percentages of variable sites.
Label the three genes with the highest mutation rates with *, they are ycf1 (16.62%), matK (10.54%) and ccsA (8.74%). B. The estimations of nonsynonymous (dN), synonymous (dS) substitution rates and dN/dS of 79 shared protein-coding genes. Label the four genes with the highest dN/dS values with *. Figure 7 Phylogenetic relationships of species from Urticaceae inferred using Maximum likelihood (ML) method.
The phylogenetic tree constructed using the complete cp genome sequences among the 25 cp genomes.
The number at the bottom of the scale, 0.01, means that the length of the branch represents the replacement frequency of bases at each site of the genome at 0.01. Bootstrap values were calculated from 1000 replicates. Two nodes, which have BS < 100 for ML were marked speci cally. The orange stars indicates the cluster of Pilea. Two taxa from Moraceae, namely, M. indica and F. carica were used as outgroups.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.