Comparative analyses of chloroplast genomes from 22 Lythraceae species: inferences for phylogenetic relationships and genome evolution within Myrtales

Background Lythraceae belongs to the order Myrtales, which is part of Archichlamydeae. The family has 31 genera containing approximately 620 species of herbs, shrubs and trees. Of these 31 genera, five large genera each possess 35 or more species. They are Lythrum, with 35; Rotala, with 45; Nesaea, with 50; Lagerstroemia, with 56; and Cuphea, with 275 species. Results We reported six newly sequenced chloroplast (cp) genomes (Duabanga grandiflora, Trapa natans, Lythrum salicaria, Lawsonia inermis, Woodfordia fruticosa and Rotala rotundifolia) and compared them with 16 other cp genomes of Lythraceae species. The cp genomes of the 22 Lythraceae species ranged in length from 152,049 bp to 160,769 bp. In each Lythraceae species, the cp genome contained 112 genes consisting of 78 protein coding genes, four ribosomal RNAs and 30 transfer RNAs. Furthermore, we detected 211–332 simple sequence repeats (SSRs) in six categories and 7–27 long repeats in four categories. We selected ten divergent hotspots (ndhF, matK, ycf1, rpl22, rpl32, trnK-rps16, trnR-atpA, rpl32-trnL, trnH-psbA and trnG-trnR) among the 22 Lythraceae species to be potential molecular markers. We constructed phylogenetic trees from 42 Myrtales plants with 8 Geraniales plants as out groups. The relationships among the Myrtales species were effectively distinguished by maximum likelihood (ML), maximum parsimony (MP) and Bayesian inference (BI) trees constructed using 66 protein coding genes. Generally, the 22 Lythraceae species gathered into one clade, which was resolved as sister to the three Onagraceae species. Compared with Melastomataceae and Myrtaceae, Lythraceae and Onagraceae differentiated later within Myrtales. Conclusions The study provided ten potential molecular markers as candidate DNA barcodes and contributed cp genome resources within Myrtales for further study. Electronic supplementary material The online version of this article (10.1186/s12870-019-1870-3) contains supplementary material, which is available to authorized users.

natans as edible food, Heimia myrtifolia as an important medicinal plant [9] and Lawsonia inermis as a natural dye. Overall, the species of Lythraceae have high economic and ornamental value and are widely used in horticulture [10,11].
Past studies of Lythraceae have concentrated on morphology [2,12], palynology [13,14] and anatomy [15]. However, these studies did not distinguish the intraspecific relationship within Lythraceae. More recently, to deepen our understanding of the relationship among Lythraceae species, the modern branch method was used to make a preliminary estimate of the phylogeny within Lythraceae species [16]. Based on rbcL genome data, the psaA-ycf3 spacer in the cp genome and the ITS sequence of the nuclear ribosomale DNA, the phylogenetic relationship within Lythraceae ware preliminarily inferred [17]. These two noncoding regions improved the resolution between species in an rbcL bifurcation diagram [17]. However, due to the use of certain DNA fragments, these studies lead to incomplete conclusions. Complete cp genomes will provide better solutions to relationship reconstruction within Lythraceae and allow exploration of its phylogenetic position within Myrtales.
The chloroplast is an essential organelle for land plants [18], and is mostly inherited maternally [19]. The cp genome usually consists of a two-stranded DNA molecule, and most cp genomes are 120-220 kb in length with 120-140 coding genes [20,21]. The cp genome usually has four parts: a large single copy (LSC) region, a small single copy (SSC) region, and two copies of the inverted repeat region (IRA and IRB). Because the cp genome is more conserved and shorter in length than the nuclear and mitochondrial genomes, some cp genome sequence have been used to distinguish species and conduct phylogenetic studies [22][23][24][25]. An increasing number of cp genomes have recently been reported because complete cp genome sequences provides better data to distinguish marginal taxa, especially below the species level.
In this study, we report six newly sequenced Lythraceae cp genomes and compare them with those of 16 other species within Lythraceae including nine published cp genomes (P. granatum, H. myrtifolia, Lagerstroemia fauriei, Lagerstroemia floribunda, Lagerstroemia guilinensis, Lagerstroemia indica, Lagerstroemia speciosa, Lagerstroemia subcostata and Lagerstroemia intermedia) downloaded from GenBank and seven unpublished Lagerstroemia cp genomes (Lagerstroemia excelsa, Lagerstroemia limii, Lagerstroemia villosa, Lagerstroemia siamica, Lagerstroemia tomentosa, Lagerstroemia venusta and Lagerstroemia calyculata). Our objectives were as follows: (1) To detect differences between the cp genomes of 22 Lythraceae species; (2) to select 10 highly variable regions to act as candidate barcodes for identifying related species of Lythraceae; (3) to reconstruct phylogenetic relationships to verify branch relationships within Lythraceae and explore its status in Myrtales.

Codon usage
A total of 79 coding genes were used to estimate the codon usage frequency. They were encoded by 25,068 (L. indica) to 27,111 (L. guilinensis) codons. The termination codons were UGA, UAG and UAA. For the 22 species, the GCU encoded alanine had the highest RSCU value and the UAC encoded tyrosine had the lowest at approximately 0.45. Among most of the 22 Lythraceae species, the AAA encoded lysine had the highest number of occurrences, at more than 1000. This result was also reported in the cp genomes of H. myrtifolia, Aquilaria sinensis, Epipremnum aureum and Papaver rhoeas [9,[27][28][29]. The RSCU results ( GC guanine-cytosine, LSC large single-copy region, SSC short single-copy region, IRs inverted repeats had a higher nucleotide frequency than G or C in the third codon position. It is often the case in terrestrial species that the third codon position prefers A/T over C/G, and the richness of A/ T in the IR regions may be the main reason [30,31].

Comparative genomic analysis within 22 Lythraceae species
Taking the annotation of L. excelsa as a reference, MVISTA was carried out with the cp genome sequences of 22 Lythraceae species. After the 22 cp genomes were pair wise compared, we found that the similarity between the sequences was rather high. From Fig. 2, it is apparent that the 14 Lagerstroemia species are separated from the eight other Lythraceae species. The divergence among the 14 Lagerstroemia species was low. The LSC and SSC regions had more variation than the IR regions, and the noncoding regions had greater differentiation than the coding regions. Some regions contained more variation, such as ndhF, ndhH, matK, ycf2, rpl22, accD, rpoB, rbcL, psbK among the coding genes and psbM-trnD, trnI-trnA, ndhF-rp132, rp132-trnL, ndhD-psaC, atpA-atpF, trnI-GAU intron, trnK-rps16, trnH-psbA among the intergenic regions (Fig. 2). Similar divergence levels were measured for these regions previously [32,33]. Compared to the LSC and SSC regions of the 22 cp genomes, the IR regions were most conserved in terms of the sequence and number of genes. However, large variations also existed in connections between the IR, LSC and SSC regions. Inversion and translocation were not detected in the compared genomes. IR amplification and contraction were the main reasons for the difference in the size of these 22 cp genomes.
Significant differences in evolutionary rates were present among the genes across the 22 Lythraceae species analyzed. Overall, the mean Ka/Ks were less than 0.5 for most genes (92.21%). 17 genes showed Ka/Ks higher than 1 for at least one species. Among the 17 genes, seven genes (rbcL, psbJ, rpl2, rpl20, rpl23, ccsA and ycf4) presented these high rates for at least 15 species. The results showed that the seven genes may be under positive selection. Seven genes associated with photosynthesis (psbN, psbI, psaC, atpH, petD, psbD and psbM) showed the lowest rates of evolution (mean Ka/Ks = 0 to 0.0057), and showed uniform rates in most species evaluated. The Ka/Ks of psbN, psbI, psaC and atpH were 0 because there were no non-synonymous substitutions (Additional file 3: Table S3).
In order to detect a possible evolutionary rate acceleration in particular phylogenetic branches, We analyzed three genes with most variable Ka/Ks, namely rpl23 (large subunit of ribosome), rbcL (large subunit of rubisco) and ycf4 (genes of unknown function). Since the Ka/Ks in comparison among 14 Lagerstroemia species were almost 0, we compared the Ka/Ks at rpl23, rbcL and ycf4 in comparison of 14 Lagerstroemia species and the remaining eight Lythraceae species. For the rpl23 gene, the Ka/Ks ranged from 0.891 to 1.8077 except for the comparison with D. grandiflora. There was no non-synonymous substitution between Lagerstroemia species and D. grandiflora in addition to L. excelsa. As seen in the phylogenetic tree, the relationship between the D. grandiflora and the 14 Lagerstroemia species was closer than the other seven Lythraceae species. For the rbcL gene, the Ka/Ks ranged from 0.1119 to 0.3849, which may be due to a low Ks value (0.0046-0.0177). For the ycf4 gene, in addition to the comparison with W. fruticosa (2.4259-2.8340), the ratio of Lagerstroemia species and other seven Lythraceae species ranged

Contraction and expansion of inverted repeats (IRs)
The genomic structure, including the number and sequence of genes, was highly conserved among the 22 Lythraceae species. However, there were structural changes in the IRA and IRB boundaries (Fig. 3). Although the IR region is more conserved than the other regions, the enlargement and contraction of IR boundaries played a major role in genome size [34][35][36].   gene. The distance between rps19 and the IRA-LSC boundary ranged from 3 bp to 279 bp. Except for the 14 Lagerstroemia species and W. fruticosa, the IRA-SSC boundary was embedded in the ndhF encoding gene and had a length of 7 bp (W. fruticosa) to 158 bp (L. guilinensis) in the IRA region. For the other 7 Lythraceae species, ndhF was located on the right side of the IRA-SSC at a distance of 28 bp to 55 bp from the boundary. For all species, the SSC-IRB boundary was located in the ycf1 gene with a length of 1062 bp to 2252 bp in the IRB region, causing a ycf1 pseudogene in the IRA region with a corresponding length. The trnH-GUG noncoding gene was located on the right side of the IRB-LSC boundary ranging from 69 bp to 75 bp at a distance of 0 to 33 bp from the IRB-LSC boundary.

Long repeat structure analysis
Twenty-two Lythraceae species had 383 long repeats of four types. Eighteen species had only forward and palindromic repeats, and only T. natans had all four kinds of repeats. L. indica had the largest number of repeats, including 22 forward and six palindromic repeats. W. fruticosa, P. granatum, L. salicaria and P. granatum had only seven long repeats. As a whole, H. myrtifolia and the 14 Lagerstroemia species had more long repeats than D. grandiflora, T. natans, L. salicaria, L. inermis, P. granatum, W. fruticosa and R. rotundifolia (Fig. 4a, Additional file 4: Table   S4). The copy length ranged from 30 bp to 81 bp. Repeat sequences of 30, 31 and 41 accounted for most of the total length (Fig. 4b).
Simple sequence repeat (SSR) analysis SSRs, also called short tandem repeats or microsatellites, are made up of nucleotide repeat units 1-6 bp in length [37]. SSRs play a significant role in plant taxonomy and are widely applied as molecular markers [38,39]. There were 211-332 SSRs in each Lythraceae species that ranged from 8 to 16 bp in length (Fig. 5, Additional file 5: Table S5). Six kinds of SSRs were discovered: mononucleotide, dinucleotide, tri-nucleotide, tetra-nucleotide, pentanucleotide and hexa-nucleotide. However, hexa-nucleotide repeats were detected in only the cp genomes of L. siamica, L. intermedia, T. natans and L. salicaria. Among each Lythraceae species, mononucleotide repeats were the most common, with numbers ranging from 123 to 212; followed by trinucleotide ranging from 56 to 68; dinucleotide ranging from 16 to 52; tetranucleotide ranging from 6 to 12; pentanucleotide ranging from 0 to 2 and hexa-nucleotide ranging from 0 to 1. (Fig. 5a). It was previously found that mono-nucleotide repeats were richest in Fritillaria, Lilium and Epimedium [22,40]. As a result, mononucleotide repeats may play a more important role in genetic variation than the other SSRs. In the 22 Lythraceae species, A/T mononucleotide repeats accounted for 45.30 and 50.00%, respectively. C/G mononucleotide repeats accounted for 1.40 and 3.30%, respectively. Most of the other SSRs were composed of A/T, which may have led to the high AT content covering 62.66% of the whole cp genomes within the 22 Lythraceae species (Fig. 5b). Similar biases were also reported in Quercus [41]. Moreover, the number of A/T mononucleotide repeats in D. grandiflora, T. natans, L. salicaria, L. intermis, P. granatum, W. fruticosa, R. rotundifolia and H. myrtifolia were more than 13 Lagerstroemia species, ranging from 71 to 92/71-103. Among the 14 Lagerstroemia species, the number of A mononucleotide repeats ranged from 54 to 58, with T mononucleotide repeats ranging from 65 to 71, except in L. villosa. These results show that the A/T mononucleotide repeats numbers in the same genus are similar. However, the number of A/T mononucleotide repeats of L. villosa was 88/117, which was much higher than those of the other 13 Lagerstroemia species. We can infer that the longer intergenic spacers are the main reason.

Divergence hotspots among 22 Lythraceae species
Divergent hotspots on cp genomes can be utilized to identify closely related species and provide information about phylogeny [42,43]. The nucleotide diversity (Pi) values of the coding regions and intergenic regions of the 22 cp genomes within Lythraceae were computed using the program DnaSP 5.1. It can be seen in Fig. 6 that the values for the intergenic regions were higher than those for the coding regions, indicating that intergenic regions were more differentiated. For the coding regions, the Pi values of the IR region ranged from 0.0029-0.0144, the Pi values of LSC ranged from 0.00261-0.04547 and the Pi values of SSC ranged from 0.01254-0.04532. For the intergenic regions, the Pi value of the IR region ranged from 0.00232-0.15964, the Pi values of the LSC ranged from 0 to 0.22362 and the Pi values of the SSC ranged from 0.03567-0.17653 (Fig. 6, Additional file 6: Table S6). A total of 10 hotspots with high divergence were selected as potential molecular markers to identify related species and examine phylogeny within Myrtales.
Combining the results of DnaSP and mVISTA, we assessed the ability of 10 regions to distinguish the 22 Lythraceae species using ML trees. In the coding regions, the four most variable genes were ndhF, matK, rbcL, and rpl22. For the intergenic regions, trnK-rps16, rpl32-trnL, trnM-atpE, psbM-trnD, trnH-psbA and ndhF-rpl32 were the most variable. The regions with the greatest divergence according to their Pi values were similar to the regions obtained from the mVISTA program. Among the 10 divergent hotspots, 7 hotspots were distributed in the LSC region, and the other 3 hotspots were located in the SSC region. The IR regions were so conserved that no highly divergent hotspots were detected. According to the ML trees, trnK-rps16, ndhF, and rpl32-trnL had the highest resolution. The trnK-rps16 gene clearly separated all the genera within Lythraceae, but the 14 Lagerstroemia species could only be divided into five large branches. The ndhF gene could also divide all the genera within Lythraceae with bootstrap values of 36-100%, and it separated all 14 Lagerstroemia species. Except for the node subtending L. venusta, L. intermedia and L. speciosa with the bootstrap value of 22%, the 14 Lagerstroemia species were separated with bootstrap values of 64-100%. The rpl32-trnL gene divided all the genera except for Lythrum and Heimia, and the 14 Lagerstroemia species could only be divided into five large branches. Compared with trnK-rps16 and rpl32-trnL, ndhF had the highest resolution and was the best candidate marker for barcoding.

Phylogenetic analysis of 22 Lythraceae species with related cp genomes within Myrtales
MP, ML and BI trees were constructed based on the 66 shared protein coding genes of 50 cp genomes (Additional file 7: Table S7). These cp genomes included those of 22 Lythraceae species, 12 Myrtaceae species, The topological structures of the ML trees, MP trees and BI trees were consistent, and the four families (Lythraceae, Onagraceae, Myrtaceae and Melastomataceae) were classified into four monophyletic clades. In addition, Melastomataceae was identified as the basal group in Myrtales. The five subfamilies of the Lythraceae gathered into one clade, demonstrating that P. granatum and T. natans, formerly considered to belong to Punicaceae, and Trapaceae belong to Lythraceae. The 14 Lagerstroemia species gathered into one clade. Only two nodes with bootstrap values under 90% in the ML tree. The remaining nodes had support values of more than 92%. The bootstrap values of all nodes reached 100% in the MP tree (Fig. 7). The results showed that the Melastomataceae family, which was sister to the other families within Myrtales, was the earliest differentiating group. The next family to diverge was the Myrtaceae family, followed by the Onagraceae and Lythraceae. The 22 Lythraceae species gathered into one clade, which was resolved as sister to three Onagraceae species (Ludwigia octovalvis, Oenothera argillicola and Oenothera biennis). As a whole, the phylogenetic tree showed clear internal relationships among Myrtales species.

Discussion
Each of the 22 Lythraceae cp genomes had four conjoined structures and contained 110-112 distinctive genes consisting of 76-78 coding genes, 29-30 tRNAs and 4 rRNAs. The genome length ranged from 152,049 to 16, 0769 bp with GC content between 36.41 and 37.72%. It was clear that the 22 cp genomes were highly conserved in genome size, structure and organization, which were also consistent with the cp genomes of Melastomataceae species reported previously [26]. The largest location of variation among the 22 Lythraceae cp genomes was in the intergenic areas, which is a common phenomenon in cp genomes [10,44,45].
The slow evolutionary rate and the low Ka/Ks detected in the analyzed Lythraceae species were within expectations, and Ka/Ks varied among groups of different functional genes. As a common evolutionary pattern for photosynthetic plants, photosynthesis genes (psbN, psbI, psaC, atpH, petD, psbD and psbM) had the lowest evolutionary rates. The genes rpl2, rpl20 and rpl23 involved in replication, rbcL and psbJ involved in photosynthesis, ycf4 of unknown functions and other genes including ccsA evolved more quickly and had high Ka/Ks (≥1). The seven genes evolved faster among 22 Lythraceae species analyzed were also found in Capsicum and Sesamum indicum species [23,46]. Some genes are species-specific in terms of the rates of evolution, such as clpP gene. Although it is highly conserved in most green plants, it is by far the fastest evolving plastidencoded gene in some angiosperms. The rates of evolution in the plastid Clp protease complex are extreme different Fig. 7 The phylogenetic tree is based on 66 shared protein-coding genes of 50 species. Numbers indicated the bootstrap values from the BI (left), ML (middle) analyses and MP (right) analyses [47]. The mean Ka/Ks of the clpP gene within Lythraceae species was 0.0395, which was different from the high ratio of Ka/Ks in some plants. Williams also found that clpP1 has undergone remarkably frequent bouts of accelerated sequence evolution, which may result from the intron loss in many lineages, such as Oenothera. However, the clpP gene contained two introns across 22 Lythraceae species, which may be the reason for its low Ka/Ks. The clpP experiencing negative (purifying) selection among Lythraceae species may result from conserved lengths (591 bp). Genes under positive selection typically have large insertions of more or less repeating amino acid sequence motifs [48]. Genes under positive selection may also be bound up with a recent increase in diversification rate after adapted to novel ecological conditions [49].
The boundaries between the four cp genomes regions are important in the evolution of some taxa [50]. For example, pseudogenes such as ψycf1 or ψrps19 were produced by contraction and expansion of the IR region. The ψycf1 pseudogene exists in all 22 Lythraceae species while the ψrps19 pseudogene was absent in 4 Lythraceae species. The rps19 gene was located in the LSC regions of H. myrtifolia, W. fruticosa and D. grandiflora. In the cp genome of L. villosa, the rps19 gene was fully duplicated in IRA, as has also been reported in some Malpighiales species [51].
In previous studies, comparative analysis based on complete cp genomes was scarce due to the limited number of published cp genomes of Lythraceae species, and the phylogenetic relationships within Lythraceae were not clear. P. granatum and T. natans were placed alone in the Punicaceae family and the Trapaceae family respectively. The relationship between T. natans and the other species within Myrtales could not be confirmed because of the large morphological variation in T. natans, so DNA data were necessary to confirm the location of T. natans in Myrtales. The rbcL gene, the pasA-ycf3 spacer, and the ITS sequences have been used to establish trees and infer phylogenetic relationships within Lythraceae, and these relationships were corroborated by our results. The sister relationship between Trapa and Sonneratia was strongly supported, while the sister relationship between Trapa and Lythrum was weakly supported. Overall, the position of T. natans in the family Lythraceae was confirmed in our phylogenetic analysis. Our results further show that P. granatum belong to the Lythraceae.

Conclusion
In this study, the newly sequenced cp genomes of D. grandiflora, T. natans, L. salicaria, L. inermis, W. fruticosa and R. rotundifolia were reported and combined with those of 16 other species to compare a total of 22 Lythraceae cp genomes. The cp genomes of the 22 Lythraceae species were similar in structure, composition and gene order, showing that they are highly conserved. Three phylogenetic trees showed that 42 Myrtales species were completely divided into four branches representing four families with high bootstrap values. From previously existing cp genomes, the evolutionary history of Myrtales had been preliminarily understood. The results of this study provide additional rich genetic resources for phylogenetic research and will play an important role in further study within Myrtales.

DNA extraction of plant materials and sequencing
The fresh leaves of six species of Lythraceae within Myrtales (D. grandiflora, T. natans, L. salicaria, L. inermis, W. fruticosa and R. rotundifolia) were obtained from the nursery of Zhejiang A&F University, and then immediately stored in silica gel. A CTAB method was used to extract the genomic DNA [52]. A NanoDrop 2000 Micro spectrophotometer and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA) were employed to evaluate the concentration and quality of the extracted DNA. Following the manufacturer's instructions, the purified DNA was used to build a sequencing library. The Illumina HiSeq 2000 sequencer (Illumina Biotechnology Company, San Diego, CA) was used to obtain paired-end (PE) reads of 150 bp [9].
Chloroplast genome assembly, annotation, and structure Trimmomatic v0.3 was used to trim and filter raw reads with a Phred quality score ≤ 20. The other parameters in Trimmomatic v0.3 were set as follows: the sliding window was set to 4:15, the trailing was set to 3, the leading was set to 3 and the minlen was set to 50 [53]. CLC version 9.11 (Qiagen Company, Hilden) with default parameters was used to perform de novo assembly. Four to eight different contigs were created for each species [54]. The BLAST algorithm was used with the L. fauriei cp genome as a reference to align all contigs. The ends of each contig could be overlapped by 50 to 80 bp and combined as one large cp genome. The Re-read mapping was also conducted to validate the genome. The coverage of each genome varied from 500x to 900x. DOGMA v1.2 was used to perform genome annotation [8][9][10]55]. OGDRAW (http://ogdraw. mpimp-golm.mpg.de/) was used to draw the circular cp genome map of the Lythraceae species and then manually edited [56].

Codon usage
The relative synonymous codon usage (RSCU) is the ratio of the frequency of the specific codon to the expected frequency [57]. An RSCU > 1.00 means that a codon is used more frequently than expected, while an RSCU < 1.00 denotes that a codon is used less frequently than expected. The RSCU was obtained using DAMBE5 [58].

Genome comparative analysis and molecular marker identification
A total of 22 Lythraceae species were compared. Taking the L. excelsa annotation as the reference, the mVISTA in LAGAN mode was used to make pairwise alignments among the 22 cp Lythraceae species genomes [59].
The 77 protein coding regions of 22 Lythraceae species were used to evaluate evolutionary rate variation. DnaSP 5.1 was to calculate the rates of nonsynonymous (Ka) and synonymous substitutions (Ks) [60]. A total of 13, 318 Ka/Ks were obtained; the value could not be calculated if Ks = 0.
MEGA 6 was used to align the cp genomes after manual adjustments in BioEdit software [61]. Then, DnaSP 5.1 was used to separately evaluate the Pi values of the coding and noncoding sequences. Pi values across the complete cp genomes, LSC, SSC, and IR regions were also calculated using DnaSP 5.1 [62].

Phylogenetic analysis
To reconstruct the phylogenetic relationships and examine the phylogenetic status of Lythraceae within Myrtales, the complete cp genomes of 42 Myrtales species were used for analysis. Clustal X 2.1 software with default parameter settings was used to align 66 protein coding gene sequences, with manual adjustments to the alignment ends when necessary [65]. The data matrix used in phylogenetic analysis is provided as supplementary data. Evolutionary relationships were analyzed using MEGA 6 for maximum likelihood (ML) and maximum parsimony (MP), MrBayes 3.1.2 for Bayesian inference (BI) trees [60,66]. If the bootstrap values of the nodes were equal to 100%, they were not marked on the tree. In all analyses, eight species were considered outgroups. The phylogenetic trees were plotted in FigTree [67].