Comparative analysis of chloroplast genome structure and molecular dating in Myrtales

Background Myrtales is a species rich branch of Rosidae, with many species having important economic, medicinal, and ornamental value. At present, although there are reports on the chloroplast structure of Myrtales, a comprehensive analysis of the chloroplast structure of Myrtales is lacking. Phylogenetic and divergence time estimates of Myrtales are mostly constructed by using chloroplast gene fragments, and the support for relationships is low. A more reliable method to reconstruct the species divergence time and phylogenetic relationships is by using whole chloroplast genomes. In this study, we comprehensively analyzed the structural characteristics of Myrtales chloroplasts, compared variation hotspots, and reconstructed the species differentiation time of Myrtales with four fossils and one secondary calibration point. Results A total of 92 chloroplast sequences of Myrtales, representing six families, 16 subfamilies and 78 genera, were obtained including nine newly sequenced chloroplasts by whole genome sequencing. Structural analyses showed that the chloroplasts range in size between 152,214–171,315 bp and exhibit a typical four part structure. The IR region is between 23,901–36,747 bp, with the large single copy region spanning 83,691–91,249 bp and the small single copy region spanning 11,150–19,703 bp. In total, 123–133 genes are present in the chloroplasts including 77–81 protein coding genes, four rRNA genes and 30–31 tRNA genes. The GC content was 36.9–38.9%, with the average GC content being 37%. The GC content in the LSC, SSC and IR regions was 34.7–37.3%, 30.6–36.8% and 39.7–43.5%, respectively. By analyzing nucleotide polymorphism of the chloroplast, we propose 21 hypervariable regions as potential DNA barcode regions for Myrtales. Phylogenetic analyses showed that Myrtales and its corresponding families are monophyletic, with Combretaceae and the clade of Onagraceae + Lythraceae (BS = 100%, PP = 1) being sister groups. The results of molecular dating showed that the crown of Myrtales was most likely to be 104.90 Ma (95% HPD = 87.88–114.18 Ma), and differentiated from the Geraniales around 111.59 Ma (95% HPD = 95.50–118.62 Ma). Conclusions The chloroplast genome structure of Myrtales is similar to other angiosperms and has a typical four part structure. Due to the expansion and contraction of the IR region, the chloroplast genome sizes in this group are slightly different. The variation of noncoding regions of the chloroplast genome is larger than those of coding regions. Phylogenetic analysis showed that Combretaceae and Onagraceae + Lythraceae were well supported as sister groups. Molecular dating indicates that the Myrtales crown most likely originated during the Albian age of the Lower Cretaceous. These chloroplast genomes contribute to the study of genetic diversity and species evolution of Myrtales, while providing useful information for taxonomic and phylogenetic studies of Myrtales. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-021-02985-9.


Background
The Myrtales belong to the Rosidae, which is one of the most speciose groups in the Rosanae clade of angiosperms [1,2]. According to APG IV [3], Myrtales consists of nine families, 380 genera, and approximately 13, 000 species. The nine families in the order are Alzateaceae, Combretaceae, Crypteroniaceae, Lythraceae, Melastomataceae, Myrtaceae, Onagraceae, Penaeaceae and Vochysiaceae. The species richness of families is unbalanced with relatively few species found in Alzateaceae, Crypteroniaceae and Penaeaceae. Species are widely distributed in the tropics, with Vochysiaceae showing an amphi-Atlantic disjunct distribution [2]. Species in Combretaceae are mainly distributed in tropical and subtropical regions, especially in African savannahs [4]. The order is morphologically diverse with herbaceous herbs, lianas, trees, and mangroves, as well as a wide variety of fruit types (berry, capsule, samara and drupe) [1] (Fig. 1). There are two main wood anatomical characteristics of Myrtales: bilateral vascular bundles in the primary stem and vascular bundles in the marginal depressions of secondary xylem, which are not common in other flowering plants. The combination of these two anatomical characteristics is exceedingly rare [5][6][7]. Many of the species of Myrtales have important economic [8], ornamental [9] and medicinal value [10,11].
With the rapid development of second-generation sequencing technology, the cost of sequencing has made phylogenomic approaches feasible on large scales, ushering in a new exploration of plant identification and classification. Complete plastome sequences have become powerful tools to answer questions about plant evolution from inferred phylogenies [12][13][14][15][16][17][18]. The plastome is an essential organelle in photosynthetic cells, playing an important role in maintaining life [19] and is mainly maternally inherited in angiosperms. Most plastome DNA consist of double chains with a length of 120-220 kb [20] and a highly conserved typical four part genome structure. In recent years, researchers have been devoted to structural and phylogenetic analyses of chloroplasts in many groups, including Myrtales [21][22][23]. Structural characteristics of the chloroplasts have been useful for examining the genetic diversity and species evolution, and vital in developing policies for the protection of germplasm resources [24][25][26].
Reginato et al. [21] reported comparisons of chloroplast genomes in Melastomataceae for the first time. The structure, gene content and general characteristics of 16 chloroplast genomes of Melastomataceae and eight published chloroplast genomes of Myrtales were compared and analyzed. They found that the chloroplast genomes of Melastomataceae, like most angiosperms, have a typical tetrad structure with a large single copy region containing 84 protein coding genes (CDS), 37 tRNA and eight rRNA, for a total of 129 genes [21]. Gu et al. [22] reported the plastome of Heimia myrtifolia, an important medicinal plant with a variety of pharmacological alkaloids in the Lythraceae. Later, combined with 22 samples of other species in the Lythraceae, the chloroplast genome structure was comprehensively analyzed and compared with that of other species in Myrtales. The chloroplast genomes of 22 species of Lythraceae ranged from 152,049 bp to 160,769 bp, and included 10 variation hot spots that were selected as potential molecular markers [23]. In addition, other chloroplast genomes of Myrtales have been reported recently. Rodrigues et al. [27] compared the structure, gene number and genome size of six chloroplast genomes of Myrtales finding them to be similar to those of other Myrtales species. However, previous studies on chloroplast genomes of Myrtales have not been consistent, with some based on families, genera or species. Up to now, the comprehensive analysis of chloroplast genome structure of Myrtales is lacking.
In addition to studying the chloroplast genomes structure of Myrtales, researchers also explored the divergence time and phylogeny of Myrtales, but most studies were based on gene fragments. A strong phylogenetic framework is necessary to provide a basis for studying speciation. In previous molecular phylogenetic studies, a handful of chloroplast loci along with the internal transcribed spacer (ITS) and other ribosomal regions of nuclear DNA have been used for phylogenetic analysis of Myrtales [2,7,28]. Conti et al. [7] used 50 taxa (including 39 species and 11 outgroups) and the chloroplast gene rbcL to reconstruct the phylogeny of Myrtales. The results showed that Onagraceae and Lythraceae were closely related to Combretaceae [7]. Sytsma et al. [28] constructed the phylogenetic divergence time of Myrtales based on the chloroplast gene fragments rbcL and ndhF from 79 species of Myrtales and five fossil calibration points, indicating that Myrtales differentiated in the early Albian (111 Ma) with Combretaceae being the earliest branch of Myrtales with low support. Berger et al. [2] amplified and sequenced 6 gene fragments (rbcL, ndhF, matK, matR, 18S and 26S) from 102 taxa of Myrtales, and estimated the divergence time of Myrtales using 10 fossil calibration points. The results showed that the crown of Myrtales was most likely dated to 116 Ma (95% HPD = 113.7-118.8 Ma), while the phylogeny also showed that the Combretaceae is a sister group of all other families of Myrtales [2]. More recently, Li et al. [18] used 80 genes from 2881 plastomes and 62 fossil calibrations to reconstruct an angiosperm wide phylogeny showing that Myrtales and all of its families were monophyletic. The resulting phylogeny showed that the clade of Myrtales and Geraniales had a crown age of 112.26 Ma, as well as Combretaceae and Onagraceae + Lythraceae being sister groups with strong support. Most of the studies based on chloroplast gene fragments inferred relationships with low support, so using chloroplast genomes to explore the time of species differentiation and reconstruct phylogenetic relationship has credibility.
Currently there are few previous studies on the chloroplast genome structure of Myrtales. Although the phylogenetic position and relationships of Myrtales has been studied using molecular methods, the support for the placement of Myrtales is generally weak due to the lack of phylogenetic signal and sparse taxonomic sampling. Therefore, we set out to expand the sampling, reconstruct the phylogenetic relationship of Myrtales by using whole chloroplast genomes and comparatively analyze the plastome structure of Myrtales to provide the foundation for future research. In this study, we sequenced the chloroplast genomes of nine new species (including species of Myrtaceae, Melastomataceae and Combretaceae) and combined them with existing plastome data for Myrtales from NCBI to obtain a total of 95 chloroplast genomes, representing six families, 78 genera, and three outgroups. The main objectives of this study were to 1) analyze the chloroplast genome structure and elucidate the genetic diversity of Myrtales, 2) reconstruct the phylogenetic relationship of Myrtales to specifically determine the phylogenetic position of Combretaceae, and 3) infer the divergence time of Myrtales.

Characteristics of chloroplast genomes
Six families were represented with the 92 Myrtales chloroplast genomes used in this study: Melastomataceae (42 species in five subfamilies), Myrtaceae (including 19 species in five subfamilies), Vochysiaceae (seven species), Lythraceae (13 species in three subfamilies), Onagraceae (three species in two subfamilies), and Combretaceae (eight species in one subfamily). All chloroplast genomes have a typical four part structure: large single copy region (LSC), small single copy region (SSC) and two inverted repeat regions (IRs) (Fig. 2) (Table 1). In addition, a total of 123-133 genes are encoded, of which 106-116 are single copy with 17 genes duplicated in the IR regions. Of the unique genes 77-81 are protein coding genes, 29-31 are tRNA genes, and four are rRNA genes. The total GC content of the chloroplast genomes are highly similar (36.9-38.9%), with the average GC content across the entire chloroplast genomes being 37%, while the different regions had slightly variable GC content with the LSC, SSC and IR ranging from 34.7-37.3%, 30.6-36.8%, and 39.7-43.5%, respectively (Tables 1 and 2).

Boundaries between IR and SC regions
In total, we analyzed and compared the differences between boundary regions of the SC and IR in 24 chloroplast genomes (15 samples from NCBI and the nine newly sequenced chloroplast genomes covering 16 subfamilies/ families within Myrtales). We found that most chloroplast genomes have similar characteristics. The junction of the LSC/IRb region of 23 chloroplast genomes was located at the rps19 and rpl2 genes, while the junction of LSC/IRb region of Salpinga maranonensis (NC_031888.1) was unique with the boundary at the rpl2 gene. Except for Oenothera villaricae (NC_ 030532.1) the boundary of IRb/ SSC was ccsA -ndhD. The ndhF gene was detected at the boundary of IRb/SSC in all other species. The ndhF gene of 11 species crossed the boundary of IRb/SSC, while ndhF of 12 species was completely found in the SSC region, ranging between 3 and 235 bp from the boundary. The gene ycf1 is at the SSC/IRa boundary except in Vochysia acuminata (NC_043811.1) and Oenothera villaricae (NC_ 030532.1). In total there are 20 species for which ycf1 crosses the boundary between SSC/IRa, two species in which ycf1 is completely in the SSC ranging from 63 to 381 bp away from the boundary, and one species in which ycf1 is completely in the IRa 1063 bp away from the boundary. The genes rpl2 and trnH (rpl2 is located in IRa, 53-139 bp away from the boundary, trnH is located in LSC, 0-216 bp away from the boundary) were detected in the IRa/ LSC boundary for 20 species. The genes rps19 and trnH (rps19 is located in IRa, 0-3 bp away from the boundary, trnH is located in LSC, 1-41 bp away from the boundary) were detected in the IRa/LSC boundary for three species, and rpl23 and trnH were detected in the IRa/LSC boundary for Salpinga maranensis (NC_031888.1) (Fig. 3).

Phylogenetic results
Both ML and BI analyses of the complete chloroplast generated almost identical topologies with strong support at every node [ML bootstrap (BS) = 100%, Bayesian posterior probabilities (PP) = 1] (Fig. 6). Melastomataceae, Myrtaceae, Vochysiaceae, Onagraceae, Lythraceae, and Combretaceae were fully supported as monophyletic, with Combretaceae resolved as sister to Onagraceae + Lythraceae clade (BS/PP = 100/1; (Fig. 6). Melastomataceae was recovered as sister to Myrtaceae + Vochysiaceae (BS/PP = 100/1). A clade of Melastomataceae + Myrtaceae + Vochysiaceae was recovered as sister to the clade of Combretaceae + Onagraceae + Lythraceae with strong support (BS/PP = 100/1). In addition, the phylogenetic trees constructed using the coding regions (CR), noncoding regions (NCR), LSC, SSC and NO-IRa phylogenetic trees (ML / BI) have the same topological structure at the family level as the phylogeny inferred from the full chloroplast with strong support ( Figure S1, S2, S3. S4 and S5). Observed differences were found in the phylogenetic relationships constructed by the IRb region, in which Melastomataceae was resolved as sister to Myrtaceae + Vochysiaceae + Lythraceae + Combretaceae, and Lythraceae was resolved as a sister to Combretaceae albeit with low support (Figure S6). Additionally, we expanded the outgroups to construct the phylogenetic relationship of Malvids, and the phylogenetic relationship of Myrtales was also strongly supported ( Figure S7).

Divergence time estimation of Myrtales
The results of the BEAST analysis of species divergence time in Myrtales are shown in Fig. 7

Discussion
Plastome structure comparisons and sequence divergence hotspots Previous studies have shown that the size of chloroplast genomes in angiosperms are between 120 and 180 kb, and the size of IR region is 20-30 kb [29]. The largest plastome is in the Vochysiaceae, and the smallest plastome is in the Lythraceae. The difference of plastome length between different families mainly lies in the difference of IR region length. The change in the overall length of chloroplast genomes is generally related to the expansion and contraction of IR regions [30]. The presented results are similar to those found in Pelargonium hortorum, Cryptomeria fortunei, Geranium, Pisum sativum, Vicia faba, and Erodium in which the size of the IR is increased, decreased or even completely lost [31][32][33][34]. In angiosperms, high conservation of the IR region is common, and is  important for stabilizing plastome gene structure [35] though changes have been reported including in some early diverging eudicots [36,37]. The nucleotide content of chloroplasts is relatively stable and the gene structure is highly conserved, though mutation hotspots do exist. Genes with a relatively high mutation rate can be used as DNA barcodes to help distinguish between accessions within a given taxon [38,39] and varieties in germplasm resources [40,41]. In this study, we used mVISTA to compare the whole chloroplast of 24 species of Myrtales and used DnaSP to analyze the percentage of variable loci in 74 coding genes and 114 non-coding regions. Similar to previous results, the variation of noncoding regions is greater than that of coding regions [42,43]. As observed in members of Adoxaceae and Panax notoginseng, the variation of the IR region of Myrtales is smaller than that of the SC region [44,45]. Previous studies investigating the phylogeny of Myrtales using only rbcL failed to resolve the phylogenetic position of the order. Our analyses showed that the nucleotide diversity of rbcL is relatively low compared to other loci (PI < 0.05) (Fig. 5, Table S1), which helps explain the low support found in phylogenies inferred with this gene [7]. We detected nine hot spots in coding regions and 12 hot spots in noncoding regions, which can be used as candidate DNA barcodes for future studies. These variable regions may also be useful for assessing phylogenetic relationships and interspecific differences of Myrtales species.

Phylogenetic relationships of Myrtales
Compared with previous studies based on a few chloroplast genome fragments, our results based on the major lineages of Myrtales (six families with more species within Myrtales) showed a highly resolved phylogenetic relationship of Myrtales by using whole chloroplast genomes [2,6,28]. Six major clades representing the major families are fully resolved with strong support (Fig. 7). Previous studies of Myrtales have provided an improved understanding of phylogenetic relationships among families based on both morphological and molecular analyses, however, the placement of Combretaceae has not been fully established with high confidence [2,6,28]. The phylogenetic location of Combretaceae is critical since its placement directly affects the age of Myrtales, hypotheses of diffusion and variation scenarios, species diversification rates, and features of trait reconstructions [2]. Most recent phylogenetic studies use a limited number of taxa and gene regions as placeholders for Combretaceae [7,28,46,47]. Our plastome phylogenomic analysis of Myrtales provides strong support for the sister relationship between Combretaceae and a clade of Onagraceae + Lythraceae (BS = 100%, PP = 1; Fig. 7), which is in agreement with some previous molecular studies, and a clade of Combretaceae + Onagraceae + Lythraceae is sister to a clade of Melastomataceae + Myrtaceae + Vochysiaceae [18,48]. The sampling of our study is not comprehensive at the family level with the phylogenetic relationship reconstructed including six of the nine families (lack samples from Crypteroniaceae, Penaeaceae and Alzateaceae). However, according to previous studies, this does not affect our determination of the phylogenetic position of the Combretaceae. We used the whole chloroplast genome to construct the phylogenetic relationships, as well as using multiple chloroplast gene data sets (excluding the chloroplast genome of IRa region, coding genes, noncoding genes, LSC, SSC, IRb) to compare the phylogenetic relationship comprehensively. We also reconstructed the phylogenetic relationship by adding extra taxa (within the branch of Malvids), providing an additional degree of credibility for the obtained phylogenetic trees [49,50] and determining the phylogenetic position of the Combretaceae. Further research should include sampling more individuals from wild populations and obtaining more extensive nuclear data to determine whether our results are consistent with those from nuclear genes.    Fossil limitations, different methods, size of molecular data and taxonomic sampling cannot be perfectly compared across all studies, with changes leading to differences in age estimates. Our analysis estimated that the diversity of major lineages of Myrtales occurred about 60-90 Ma [2,18]. In this period the species within Myrtales may have begun to differentiate rapidly, which is consistent with the common hypothesis that many species experienced rapid diversification events after the Cretaceous-Paleogene (K-Pg) boundary due to mass extinction and opening of new habitats [55][56][57]. Our results show that the species diversity of the main stem lineages of Myrtales increased at the end of the Campanian and may have been affected by the continental breakup of Gondwana in the Cretaceous [2].

Conclusions
In this study, we analyzed and compared the structural characteristics of chloroplast genomes of Myrtales, and inferred the phylogenetic divergence time of Myrtales. The chloroplast genomes of Myrtales has a typical four part structure, including 77-81 protein coding genes, 29-31 tRNA genes and four rRNA genes, with a total length of 152,214-171,315 bp. We found 21 mutation hotspots, which can be used as potential DNA barcodes in the future phylogenetic study of Myrtales. Phylogenetic relationships (Ml / BI) based on whole chloroplast genome and multiple datasets showed that Myrtales and its families were monophyletic, as well as Combretaceae and Onagraceae + Lythraceae strongly supported as a clade, ( In the future, we will expand genomic sampling, including nuclear genomes, to comprehensively compare and discuss the phylogeny and evolution of Myrtales species.

Taxon sampling
Leaf material from nine species, representing seven genera and three families in Myrtales, was collected and  Table 1. We also downloaded 17 chloroplast genomes from NCBI, including six different orders to serve as outgroups to construct a branch of Malvids to explore the topological changes of Myrtales (Table S2).

DNA extraction, sequencing and assembly
We used a modified cetyltrimethyl ammonium bromide (CTAB) method to extract high quality DNA from dried leaves [58]. Quality of DNA was determined on an Agilent 2100 BioAnalyzer by using ≥0.8 μg at the University of California Davis Genome Center (Davis, California, USA). We constructed paired-end sequencing libraries with insert sizes of 200-400 bp with Illumina TruSeq™ Nano DNA Sample Prep Kit and sequenced using the BGISEQ-500 at the Beijing Genomics Institution (BGI; Shenzhen, China). Raw reads were filtered with SOAPfil-ter_v2.2 for quality control with the following parameters: 1) remove low quality reads (> 10% Ns and/or > 40% low quality bases), 2) remove PCR duplicates, and 3) trim adaptor sequences. We selected the rbcL gene of Arabidopsis thaliana from NCBI (accession number: U91966) as a seed and assembled chloroplast genomes for each species using the clean reads with NOVOPlasty [59]. The longest contig assembled by NOVOPlasty was compared with chloroplasts deposited in the NCBI database, and obtained the chloroplast genome sequence with the highest homology (minimum requirement: evalue < 10-7, identity > 95%) to us as the reference (Table 3) for subsequent assembly using MITObim v1.8 [60]. Quality of the assemblies were assessed by mapping clean reads using BWA MEM (Burrows-Wheeler Aligner) v0.7.17 [61] to verify the integrity of newly assembled plastome [62].

Plastome comparative analysis and molecular marker identification
Plastome comparisons across 24 Myrtales species (15 samples from NCBI and the nine newly sequenced chloroplast genomes covering 16 subfamilies/families within Myrtales) were performed in Shuffle-LAGAN mode on the mVISTA program (genome.lbl.gov/vista/ index.shtml [66];), using the annotation of Vochysia acuminate (NC_043811) as a reference. To reveal highly variable regions for future species identification studies and to evaluate different plastome regions that may show different evolutionary patterns, we sequentially extracted both coding regions and noncoding regions (including intergenic spacers and introns) after alignment with MAFFT v7 [67] using the criteria that the aligned length is > 200 bp and at least one mutation per site was present. The nucleotide variability of the selected regions was evaluated using DNASP v5.10 [68]. The IR / SC boundary map of these 24 Myrtales chloroplast was drawn with Photoshop. The IR area was confirmed using UNIPRO ugene v1.32 [69].

Phylogenetic analysis
Phylogenetic analyses were conducted on 95 species, using Viviania marifolia (NC_023259), Pelargonium tetragonum (NC_031205), and Pelargonium quercifolium (NC_031203) as outgroups based on a previous study [2]. Plastome sequences were aligned using MAFFT v7 [67] and manually checked when necessary. The complete chloroplast genome sequence and chloroplast genome minus one copy of the inverted repeat (No-IRa) were used to construct the phylogenetic topology using maximum likelihood (ML) and Bayesian inference (BI).
To evaluate alternative hypotheses, phylogenetic topologies were inferred using both maximum likelihood (ML) and Bayesian inference (BI) methods using the complete plastome sequences and whole plastome minus one copy of the Inverted Repeat (No-IRa). We also included other data sets (i.e., coding area, noncoding area, LSC, SSC and IRb) for analyses. The best-fitting model of molecular evolution (GTR + GAMMA+I) ( Table 4) was determined using Akaike Information Criterion (AIC) in JMODELTEST v2.1.7 [70]. Maximum likelihood analyses were conducted in RAxML-HPC v8.2.8 [71] with 1000 bootstrap replicates on the CIPRES Science Gateway portal [72]. Bayesian analyses were performed in MRBAYES v3.2 [73]. Two independent Markov Chain Monte Carlo chains were conducted simultaneously for 5 million generations with trees sampled every 1000 generations. The effective sample size (ESS > 200) was determined using Tracer v1.7 [74] and the first 25% of trees were discarded as burn-in, and a consensus tree was constructed from the remaining trees to estimate posterior probabilities (PPs). FigTree v1.4.4 [75] were used for visualizing the resulting phylogenetic trees.

Divergence time estimation
The complete 92 plastome dataset of Myrtales was analyzed using the GTR + GAMMA+I model selected by MrModelTest [76] in BEAST v.1.8.4 [75] to simultaneously search for the best tree topology and estimate node ages. The divergence time between lineages was estimated using a Yule speciation prior and an uncorrelated lognormal model of rate change with a relaxed clock. Four fossil-based calibration points and one secondary calibration point were used to constrain the crown node age of Myrtales.