Comparative chloroplast genome analyses of Avena: insights into evolutionary dynamics and phylogeny

Background Oat (Avena sativa L.) is a recognized health-food, and the contributions of its different candidate A-genome progenitor species remain inconclusive. Here, we report chloroplast genome sequences of eleven Avena species, to examine the plastome evolutionary dynamics and analyze phylogenetic relationships between oat and its congeneric wild related species. Results The chloroplast genomes of eleven Avena species (size range of 135,889–135,998 bp) share quadripartite structure, comprising of a large single copy (LSC; 80,014–80,132 bp), a small single copy (SSC; 12,575–12,679 bp) and a pair of inverted repeats (IRs; 21,603–21,614 bp). The plastomes contain 131 genes including 84 protein-coding genes, eight ribosomal RNAs and 39 transfer RNAs. The nucleotide sequence diversities (Pi values) range from 0.0036 (rps19) to 0.0093 (rpl32) for ten most polymorphic genes and from 0.0084 (psbH-petB) to 0.0240 (petG-trnW-CCA) for ten most polymorphic intergenic regions. Gene selective pressure analysis shows that all protein-coding genes have been under purifying selection. The adjacent position relationships between tandem repeats, insertions/deletions and single nucleotide polymorphisms support the evolutionary importance of tandem repeats in causing plastome mutations in Avena. Phylogenomic analyses, based on the complete plastome sequences and the LSC intermolecular recombination sequences, support the monophyly of Avena with two clades in the genus. Conclusions Diversification of Avena plastomes is explained by the presence of highly diverse genes and intergenic regions, LSC intermolecular recombination, and the co-occurrence of tandem repeat and indels or single nucleotide polymorphisms. The study demonstrates that the A-genome diploid-polyploid lineage maintains two subclades derived from different maternal ancestors, with A. longiglumis as the first diverging species in clade I. These genome resources will be helpful in elucidating the chloroplast genome structure, understanding the evolutionary dynamics at genus Avena and family Poaceae levels, and are potentially useful to exploit plastome variation in making hybrids for plant breeding.


Background
Most chloroplast (cp) genomes (plastomes) of land plants have a typical quadripartite structure with a pair of inverted repeats (IRs) separated by a large single-copy (LSC) region and a small single-copy (SSC) region, and genome size ranging from the reduced plastome of 85 kbp in the non-photosynthetic gymnosperm Parasitaxus usta, up to 218 kbp in Pelargonium [1,2]. With high throughput sequencing methods [3], complete chloroplast genome sequences are widely used to improve phylogenetic resolution at the interspecific level and resolve the parentage of hybrid or polyploid taxa [4]. Some 866 plastid genomes of Poaceae have been assembled and deposited in the National Center for Biotechnology Information (NCBI:txid4479; accessed on 26 June 2020) Organelle Genome Resources since the publication of the first angiosperm chloroplast genome sequence of Nicotiana tabacum [5]. The massive complete chloroplast genome sequences, together with knowledge about chloroplast sequence variation, means complete plastid sequencing has become an effective tool for plant phylogenomic analysis.
The plastome datasets have been widely used for resolving the recalcitrant phylogenetic relationships in plants, in part because the plastome has long been considered as a single evolutionary unit, meaning that genes can be concatenated in order to dissect phylogenetic signals [6,7]. Chloroplast genomes have also been identified with polymorphic regions caused by genomic expansions/contractions [8], inversions [9], gene rearrangements [10], etc. These mutational events in turn become synapomorphies for different plastid loci, sometimes yielding incongruent topologies. For example, three LSC inversions from trnS-GCU to psbA had been reported in some species of Poaceae [10], a LSC inversion from trnE-UUC to rpoB and a SSC inversion from rps15 to ndhF were detected in some Asteraceae lineages [11,12]. Whether such intermolecular recombinations exist in Avena plastomes remains uncertain, and their phylogenetic utilization merits further comparative study [13].
Oat (Avena sativa L., genomes AACCDD) is a recognized health food because of oat beta-glucan, which can actively reduce low-density lipoprotein cholesterol level and coronary heart disease risk [14]. The c. 29 species of Avena (Poaceae) are diverse in morphological and ecological terms, occurring across Asia, Europe and the Mediterranean Basin, Eastern Africa, and the Americas [15]. The genus includes diploids with A-or C-genomes (2x = 14), and a polyploid series with AB-or AC-genome tetraploids (4x = 28) and ACD-genome hexaploids (6x = 42) [16]. The A-and C-genome diploids are distinguished by the glume relative length and the chromosome structural differentiation [17], while the B and D genomes are not found in any extant diploids. The large size and highly repetitive nature of A. sativa genome has precluded the development of genomic breeding and selection of new oat varieties [18]. In recent phylogenetic analyses, oat was inferred to experience allopolyploidy events involving A-, C-and D-genome ancestors [16,19], while the dynamic genomes with frequent chromosome translocations make it difficult to disentangle candidate progenitor species [20]. Plastid phylogenomics can be used for resolving contentious relationships [6,7,11], so the plastome-inferred phylogenies between oat and its wild relatives deserve detailed evaluation.
Comparative plastome studies show evolution of tandem repeats, insertions/deletions (indels) and single nucleotide polymorphisms (SNPs), and the role of tandem repeats in generation of substitutions and indels [21,22], with certain plastome regions being predisposed to indel and substitution mutations. If the distribution of plastome repeat sequences can be determined, it is feasible to predict microstructural variations by the correlation analyses between repeats, indels and substitutions. In addition to the paucity of genomic resources, the A-genome lineage phylogeny is enigmatic in Avena [19,23,24]. Thus, it is important to fully address polymorphic regions of Avena chloroplasts in an evolutionary context.
In current study, we report the chloroplast genome structure characterization, the polymorphic regions, and plastid phylogenomics in Avena using new plastid sequences and comparisons with published sequences. Our objectives were to: (1) gain insight into plastome structure features; (2) examine the intermolecular combination and microstructural variation domains; and (3) understand the evolutionary dynamics in selected species among 25 published Avena plastomes.

Plastome structure of Avena species
Avena plastomes display a typical quadripartite circular structure containing one large single copy (LSC), one small single copy (SSC), and two inverted repeat (IRB and IRA) regions by GView [25] and OGDRAW [26] ( Fig. 1 Figure S1).
Avena plastomes contain the same set of 131 genes encoding 84 proteins, eight ribosomal RNAs (rRNAs) and 39 transfer RNAs (tRNAs) ( Table 2). In Avena plastomes, small fragments of truncated ndh genes are detected in LSC (ndhC, 363 bp; ndhJ, 480 bp) and SSC regions (ndhE, 306 bp) while only partial ndhH in IRB region, the C (carboxy)-terminal part of ndhH is encoded by the SSC block and its N (amino)-terminal part by the neighbouring IRB region. The remaining ndh genes are the complete gene size.
In oat plastome, the rpoC1 intron is absent (Additional file 13: Figure S2). The predicted amino acid sequence is extremely hydrophilic and anionic, suggesting that the corresponding region may be post-translationally removed. Neither sequences nor amino acids encoded by rpoC1 gene have been altered excessively by the intron loss (Additional file 14: Figure S3). ClpP had lost its two introns in Avena plastomes. Nineteen genes are duplicated in IRs from rps19 to rps15, including seven protein-coding genes (rps19, rpl2, rpl23, ndhB, rps7, rps12, and rps15), of which two use non-ATG start codons (rps19, GTG and rpl2, ACG; Table 2, Fig. 1). There are 16 different intron- Fig. 1 Chloroplast genome map of Avena sativa_Liu312 (GWHAOPK01000000; the outer circle and rings "s-t") and GView comparison [25] of thirteen Avena species and three outgroup species plastomes (rings "c-r"). Genes belonging to different functional groups are shown in different colors. Genes shown on the outside of outer circle are transcribed counter-clockwise and on the inside clockwise. The tRNA genes are represented by one letter code of amino acids with anticodons. LSC, large single copy region; IR, inverted repeat; SSC, small single copy region. Rings "a-b" from the innermost ring denote GC skews and GC content deviations from A. sativa_Liu312 plastome GC content, respectively; rings "c-r" denote the plastome sequence comparison by BLAST between A. sativa_Liu312 and other species plastomes outwards in turn: A. eriantha_Liu435, A. ventricosa_Liu275, A. atlantica_Liu437, A. wiestii_Liu439, A. strigosa_Liu 315, A. nuda_Liu443, A. hirtula_Liu299, A. sterilis_NC031650.1, A. sativa_NC027468.1, A. brevis_Liu289, A. sativa_Liu312, A. murphyi_Liu442, A. longiglumis_Liu438, wheat (Triticum aestivum_NC002762.1), maize (Zea mays_NC00166.1), and rice (Oryza sativa_NC008155.1); plastome similar and highly divergent locations are represented by continuous and interrupted track lines (except for 14 non-overlapping tracking bins across 16 rings "c-r"), respectively, with a heuristic considering the total number of plastome bases (135,887 bp to 135,998 bp) contributing to hits and their scores with 0-14 nonoverlapping 10 kbp tracking bins. Rings "s-t" denote AT and GC content of Avena sativa_Liu312 plastome (GWHAOPK01000000) by OGDRAW [26]  containing genes, of which six are tRNA coding genes and rps12 and ycf3 contain two introns. The trnK-UUU has the largest intron (2435 bp) with matK located within it (Table 3). BLAST analyses of Avena species, wheat (Triticum aestivum_NC002762.1), maize (Zea mays_NC00166.1), and rice (Oryza sativa_NC008155.1) plastomes reflect the shared structural features. Avena plastomes share equivalent distribution patterns of GC islands where G and C are distributed unevenly between DNA strands (rings "a-b" in Fig. 1). The lower overall sequence identity is shared by wheat, maize, and rice (rings "p-r" in Fig. 1) compared to the congeneric Avena species. In Avena, a high level of similarity is restricted to IRs, and major differences originate from LSC and SSC regions. Avena eriantha and A. ventricosa plastomes share five highly divergent sequence locations in LSC region (rps16-trnQ-UUG, trnC-GCA-rpoB, trnT-UGU-trnL-UAA, ycf4-cemA, and psaJ-trnP-UGG) and one location in SSC region (ndhF-rpl32) to A. sativa_Liu312 plastome (rings "c-d" in Fig. 1). Avena atlantica, A. wiestii, A. strigosa, A. nuda, and A. hirtula share two highly divergent locations in LSC region (rps16-trnQ-UUG and trnV-UAC-trnM-CAU) and one highly divergent location in SSC region (ndhF-rpl32) to A. sativa_Liu312 plastome (rings "e-i" in Fig. 1). Wheat, maize, and rice outwards in turn according to the phylogenetic location.
Avena sterilis, A. sativa, A. brevis, A. murphyi, and A. longiglumis plastomes show high similarity in LSC, SSC and IR regions (rings "j-o" in Fig. 1), ribosomal gene clusters, tRNAs and the junction areas, where rpl22, rps19, ndhH, ndhF, and psbA are encoded (Additional file 15: Figure S4). Photosynthesis-related protein-coding genes in Avena plastomes show similarity to other angiosperms, including genes for ATP synthase, photosystem I and II, and RuBisCO. Other genes such as rps3 and ycf3 show less conservation.
The mVISTA [28] analysis shows overall sequence identity and divergent regions in Avena. A high degree of synteny and gene order conservation indicate an evolutionary conservation at plastome level ( Fig. 4). Notably, LSC and SSC regions are more divergent than IRs, and non-coding regions show a higher sequence divergence than in coding regions (Additional file 3: Table S3).
IR size is different among eleven Avena plastomes presented here available in Genome Warehouse database  Figure S4b).
There are 276 simple sequence repeats in Avena plastomes by Perlscript MicroSAtellite (MISA) [32], with the mononucleotide simple sequence repeats (SSRs) being the most abundant (Additional file 5: Table S5a). In each case, the 21-24 mononucleotide SSRs, one dinucleotide Sequence comparison of thirteen Avena species, wheat, maize, rice, and Taraxacum amplum plastomes. The mVISTA based similarity graphical information portrays sequence identity to A. sativa_Liu312 as a reference plastome. Grey arrows above the alignment denote the gene orientation. A cut-off of 50% identity is used for the plots. In each plot, the Y-scale axis represents percent identity (50 to 100%). Dashed rectangles indicate highly divergent regions compared with the reference plastome SSRs, and one-three composite SSRs are identified. No trinucleotide or tetranucleotide cpSSR exists in Avena. In general, the intergenic cpSSRs are more abundant than the genic cpSSRs in Avena, where the 15-20, fourfive, and two-six SSRs are found in the intergenic, coding, and intron regions respectively (Additional file 5: Table S5b, Additional file 17: Figure S6). This also occurs in Asteraceae [11], Orchidaceae [33] and Fabaceae [34], probably due to an associated lower polymorphism of coding regions in contrast to non-coding regions. The C-genome diploids A. eriantha or A. ventricosa plastomes harbour 24 mononucleotide SSRs (92.31%) with 20 SSRs (77.00%) in the intergenic regions (Additional file 17: Figure S6). SSRs are more abundant in LSC region than in SSC and IR regions (Additional file 5: Table S5a). The majority of mononucleotide SSRs  Table S2).
We have visualized the extent to which 868 tandem repeats and 221 indel mutations are nonnormally distributed among Avena plastomes within 150 bp windows (Additional file 7: Table S7) using R package circlize [35]. There are 81 insertions (66.94%) and 57 deletions (57.00%) located within tandem repeats (rings "c-e" in Fig. 6). We have explored the extent of genome-wide association between tandem repeats, indels and SNPs in Avena species alignments with A. sativa_Liu312 as a reference (Additional file 6: Table S6a, S6b). Due to the nonrandom distribution of mutation data in Avena plastomes, Spearman's Rho correlations are performed among tandem repeats, indels and SNPs (Table 4). All these correlations are observed with high significance (p < 0.01). The average of correlations is stronger  Fig. 6 Circos plot based on complete chloroplast genome alignment between eleven Avena species presented here available in Genome Warehouse database and two Avena species available in NCBI (NC031650.1 and NC027468.1) with A. sativa_Liu312 as a reference. All data in rings showing the location relationship between tandem repeats, insertions/deletions (indels) and single nucleotide polymorphisms (SNPs) by R package circlize [35] in the non-overlapping 150 bp bins. For plastome alignment, rings "a-b" show the location of substitutions and SNPs, respectively. Dots with a high relative positions in rings "a-b" represent more polymorphic loci in the 150 bp window. Rings "c-d" show the location of deletions and insertions, respectively. Ring "e" shows tandem repeat (7 bp-95 bp) location by Phobos [31] with parameters of repeat length ≤ 100 bp and sequence identify ≥85%. The relatively height of columns in rings "c-e" represent the relative number of polymorphic loci belonging to deletions, insertions, or tandem repeats within the 150 bp window, respectively. Ring "f" shows A. sativa_Liu312 chloroplast map with coding genes labeled in green, rRNAs in yellow and tRNAs in blue. Blue shadows denote the surrounding location of tandem repeats and deletions; Red shadows denote the surrounding location of tandem repeats and insertions. LSC, SSC, IRs denote large single-copy, small singlecopy, and inverted repeat regions of Avena plastome alignment. Total 702 tandem repeats, 141 insertions, 100 deletions, 992 SNPs, and 33 substitutions shown in the diagram Table 4 Spearman's Rho correlation analysis result among tandem repeats, indels and SNPs using R v.3.5.3 [36] with correlation strengths of Akoglu [37] based on plastome alignments between eleven Avena species presented here available in Genome Warehouse database and two Avena species available in NCBI (NC031650. 1  between tandems and indels, followed by tandems and SNPs then by indels and SNPs. The average value of correlations between tandems and indels is 0.3585, between tandems and SNPs is 0.2607, and between indels and SNPs is 0.2606 in thirteen Avena species. Mann-Whitney U test results show that a significant difference in the number of indels and SNPS from the tandem window and the non-tandem windows, with the corresponding p-vaules being 1.251 × 10 − 9 and 1.477 × 10 − 9 , respectively.

Codon usage bias
The  Fig. 7, Additional file 19: Figure S8). Leucine (10.76-10.78% of each species) and cysteine (1.09% of each species) are the most and the least abundant amino acids except for stop codons (0.42% of each species; Additional file 9: Table  S9b). Our findings match the trend reported across other angiosperm plastomes [7], which show leucine and isoleucine to be the most common codons. No codon bias can be shown by methionine (AUG) and tryptophan (UGG), encoded by only one codon. Codon usage is biased towards A and T at the third codon position (Additional file 10: Table S10a, S10b, Additional file 19: Figure S8). The AT content for first, second and third codons average 52.7, 60.3 and 70.0% in Avena (Additional file 8: Table S8a) Table S11a, S11b).

Phylogenomic analyses
Our phylogenomic analyses substantially increased resolution and provided robust phylogenetic relationships in Avena (Fig. 8). Maximum likelihood (ML) topologies were constructed using not only the combined sequences of LSC and IRB intermolecular recombination fragments (28, 164 bp, 14,262 bp), but also those of combined ten most polymorphic genes (11,028 bp, 8212 bp) and those of combined ten most polymorphic intergenic regions (3099 bp, 2374 bp) in order to develop suitable plastid markers for inferring the interspecific phylogenetic relationships (Additional file 20: Figure S9 [39][40][41]). Numbers above nodes are the trustable Shimodaira-Hasegawa test-approximate likelihood-ratio test support/Ultrafast bootstrap support (SH-aLRT > 80.0%, UFBoot > 80.0%). Phylogenetic tree of LSC intermolecular recombination fragments is consistent with complete plastome ML tree topology (Additional file 20: Figure S9a), showing the basal position of A. longiglumis within clade I with strong support (SH-aLRT = 100%, UFBoot = 100%). The remaining topologies show that A. longliglumis is sister Fig. 8 Maximum likelihood tree inferred from Avena plastome sequences. Two clades are identified: clade I includes A-genome diploids, tetraploid A. murphyi and hexaploid A. sativa, and clade II includes C-genome diploids. Node support denotes the maximum likelihood bootstrap value. Pink, red and green taxa correspond to A-, C-genome diploid and polyploid species in Avena, respectively to subclade I with medium support (SH-aLRT = 82.8%, UFBoot = 86.0%) by IRB intermolecular recombination fragments (Additional file 20: Figure S8b) with weak support (SH-aLRT < 50.0%, UFBoot < 50.0%) and by the phylogenetic topologies of protein-coding genes and intergenic regions with high nucleotide diversities (Additional file 20: Figure S9b-S9f). ML topology is further constructed using plastomes of Avena species presented here available in Genome Warehouse and 25 Avena species available in NCBI [19,24]. It is without significant topological change compared to a previous Bayesian maximum clade credibility (MCC) tree [19]. Namely, clade I contains the A. sativa inserted subclade I (from A. agadirianan to A. sterilis), the A. nuda inserted subclade II (from A. atlantica to A. nuda) and the basal position of A. longiglumis together with extra three diploid species (A. canariensis, A. damascene and A. lusitanica) in ML tree of 38 Avena plastomes (Additional file 21: Figure S10).

Plastome evolution
All Avena plastomes possessed the typical gramineous plastome structure including the LSC and IRB intermolecular recombinations that are present in nearly all Poaceae, e.g. Oryza, Hordeum, Triticum, Secale, Cenchrus, Saccharum, Sorghum and Zea (Fig. 3). Liu et al. [11] suggested that the LSC recombination region has the gene (partially) missing in certain lineages, e.g. the missing clpP introns or rpoC1 intron, and this could be a particularly active region for sequence rearrangement in Poaceae plastomes [42]. Avena plastome average GC content (38.48%) is similar to other monocot plastomes, such as Elodea canadensis (37%) [43], Smilax china (37.25%) [11], and Najas flexilis (38.2%) [44]. According to report, 70 up to 88 protein-coding genes are present in angiosperm plastomes [45], there are 84 such genes in Avena plastomes. The ndh gene size and content vary widely among some heterotrophic species, which retain photosynthetic ability due to the non-functional role of ndh genes [44]. In the Asteraceous dandelion (Taraxacum amplum Markl. in T. officinale F.H. Wigg. agg.) plastome, rpoC1 is interrupted by a ∼ 680 bp intron, and two exons expand for 450 and 1638 nt respectively [12]. In oat plastome, the rpoC1 intron is absent. Intron excision from the primary transcript by splicing represents a prerequisite for translation of messenger RNAs (mRNAs) into the correct full-length protein, while the selective pressure of group II intron splicing events may have been important and their role underappreciated [46]. The rps12 is trans-spliced with exon 1 coded in LSC region and exons 2 and 3 in IRs. Spliceosomal introns are a feature of eukaryotic nuclear genes, and intron splicing can enhance gene expression [47]. Other aspects of mRNA metabolism, including pre-mRNA polyadenylation, editing transcription and mRNA decay are also influenced by removal of introns by the spliceosome [48].
The LSC/IRs and SSC/IRs borders are relatively conserved among angiosperm plastomes, mostly positioned within rps19 or ycf1 [49]. Compared with the ancestral angiosperm genome structure (represented by Nicotiana tabacum [50]), the Avena plastome IRs have expanded into the LSC region, resulting in the movement of rps19. Significant expansions have been reported in other plants, such as the extreme 50 kbp expansion found in Pelargonium × hortorum L.H. Bailey [51], and the 4 kbp expansion in Jasminum nudiflorum Lindl [52].. Here, no significant IR length variation was detected among Avena plastomes.
It is well known that certain plastome regions show different mutation rates. Dispersed repeats may facilitate intermolecular recombination and plastome diversity creation, because the genome regions with increased sequence diversity could be formed by repeat sequence abundance in prokarya and eukarya [53]. The cpSSRs and indels are mainly distributed in the non-coding regions of Avena plastomes, the similar distribution preference of cpSSRs and indels has been reported in Olea europaea, Pseudoroegneria libanotica and Salvia miltiorrhiza [54][55][56].
The nonrandom distribution of tandem and mutation locations have been found in Avena plastome visualization (Fig. 6), and correlations at the interspecific level also exhibits variations. We observe moderate correlations between tandems and indels, and weak correlations between tandems and SNPs and between indels and SNPs in Avena plastomes. Comparisons of tandempresence-bin and tandem-absence-bin show strong difference for numbers of indels and SNPs, these differences are also statistically significant (p < 0.001) by Mann-Whitney U test results. Araceae and Malvaceae plastomes have associations between repeats, indels and substitutions [21,22], and the adjacent position relationships suggest a role for tandem repeats in SNP and indel mutations in Avena. These results support the hypothesis that tandem repeats play an important role in causing plastome mutations [21].
All protein-coding genes were found to be under purifying selection. This pattern has also been demonstrated in other Poaceae plastomes such as rice, maize and wheat [57], reflecting the typically conservative plastid genome across most angiosperms. Changes of nucleotide substitution rates have been correlated with obligate parasitism rather than loss of photosynthesis [58]. In parasites, there is less selective pressure on plastid function, so purifying selection on genes encoding proteins for DNA maintenance and expression may be relaxed. Increasing specialization on external carbon (e.g., improved nutrient acquisition efficiency) thus leads to changes of evolutionary rates of plastid housekeeping machinery because of the lifestyle, although with most plastid proteins being encoded in nucleus, the selection strength at protein-coding loci in plastids is unclear.

Phylogenetic analysis in Avena
To resolve relationships among closely related diploid species, it is imperative to identify rapidly evolving loci [11]. Concordant interspecific relationships are recovered by the phylogenies inferred from LSC intermolecular recombination fragment and from complete plastome sequences. There might be evolutionary convergence between the fragment from psbD to trnR-UCU and complete plastome sequences as a result of gene interaction and co-evolution to conserve the chloroplast functions. The complete plastome ML tree supports the basal position of A. longiglumis within clade I with strong support (MLBS = 100%; Fig. 8), and those highly supported within the A-genome diploid-polyploid lineage in previous studies [19,24] and within clade I of ML tree of 38 Avena plastomes (Additional file 21: Figure S10). Diploid A. longiglumis together with A. canariensis, A. damascena and A. lusitanica are proposed to be oat potential ancestors [19,24], whose contributions for oat speciation are deserve further investigation in the context of genomics and cytogenetic data.

Conclusion
Diversification of Avena plastomes is explained by the presence of highly diverse genes and intergenic regions, LSC intermolecular recombination, and the cooccurrence of tandem repeats and indels or single nucleotide polymorphisms. The study demonstrates that the A-genome diploid-polyploid lineage maintains two subclades derived from maternal ancestors, and the diverged A-genomes originate from the diploid A. longiglumis, the optimum candidate ancestor for the Agenome in polyploid species. The genus does deserve attention as a model system to understand the underlying mechanisms of plastome evolution, because of the need to mine genetic resources from both chloroplast and nuclear genomes for crop variety breeding.

Isolation of DNA and sequencing
Seeds of eleven Avena species (sample origin and genome designation [16] given in Additional file 1: Table  S1) were obtained from CN-Saskatchewan and USDA-Beltsville Germplasm System in this study. Healthy leaf samples were collected from South China Botanical Garden Greenhouse in Guangzhou, China. Whole genomic DNA was extracted from 100 mg fresh leaf tissue using the DNeasy Plant Mini Kit following the manufacturer's protocol (Biomed, Beijing, China).
DNA libraries were prepared and sequenced with the Illumina HiSeq2500 platform (Illumina, San Diego, CA, USA) for A. brevis, A. hirtula, A. strigosa and A. sativa with PE250 bp reads from 300 bp insert libraries and the remaining seven species with PE250 bp reads from 500 bp insert libraries. Project data have been deposited at the Genome Sequence Archive (GSA) of National Genomics Data Center with accession number CRA003107 (https://bigd.big.ac.cn/search/?dbId=&q=CRA003107). Three coding gene sequences with the highest coverage were used the seed sequence for de novo assembly of Avena plastome by NOVOPlasty v.2.6.2 [64] with A. sativa (NC027468.1) as a reference for IR regions correction (Fig. 1). Then Geneious R11 [65] was used for contigs merging and de-redundancy. Chloroplast circularization and initiation site determination were manually processed.

Plastome assembly and annotation
Many protocols are available for chloroplast genome assembly including SOAPdenovo2 [66] and Velvet [67]. Velvet is suitable for small genome assembly, and we found SOAPdenovo2 and Velvet cannot independently assemble one complete plastome sequences of Avena without auxiliary gap-filling softwares. Since NOVO-Plasty v.2.6.2 [64] is capable of extending one read into a complete circular genome through extending the given seed to form circular genome, three single copy coding gene sequences with the highest coverage were used as starting seed sequences for de novo assemblies of Avena plastomes by NOVOPlasty separately with k-mer values of 79, 89, 99 and 109 (Fig. 1). Chloroplast cyclization and initiation site determination were optimized manually. The integrity of Avena plastomes was also verified by Velvet contigs covered 99% of the NOVOPlasty assemblied plastomes. Average coverage depth (1634 × to 5339 ×) was calculated by Geneious R11 [65] by mapping the total clean reads to de novo assembled plastome of each species (Table 1, Additional file 12: Figure S1).

Complete chloroplast genome comparison
The data gave maximum coverage depth of 2643 × to 9373 × for Avena plastomes (Additional file 12: Figure  S1). Chloroplast genome similarity of eleven Avena species, wheat, maize, and rice were assessed using BLASTN by GView with 10 kbp connection windows [25]. The plastomes of eleven Avena species, two accessions of A. sativa, gramineous outgroups and Taraxacum amplum were respectively aligned by Mauve [27] in order to investigate intermolecular recombination events (Figs. 2, 3).
Plastome structures among Avena species, wheat, maize, rice, and dandelion were compared by mVISTA percent identity plot in Shuffle-LAGAN mode [28,70] to reveal several major genomic variations located in LSC and SSC regions (Fig. 4). Subsequently, nucleotide diversity of single copy genes and intergenic regions was estimated for two sequence datasets (thirteen Avena spe-cies+Triticum aestivum plastomes) by DnaSP v.6 [29] (Additional file 4: Table S4, Fig. 5). Plastome genetic architecture of eleven Avena species available in Genome Warehouse, Triticum aestivum plastomes and fourteen Avena species available in NCBI for LSC/IRs and SSC/IRs borders were analysed by IRscope [71] (Additional file 15: Figure S4).
Circos plot, showing the locations and relationships between tandem repeats, insertions/deletions (indels), and single nucleotide polymorphisms (SNPs), was generated by R package circlize [35]. Comparative plastome studies show that certain plastome regions are predisposed to mutations due to nonrandom distribution of tandem and mutation locations (Fig. 6), so Spearman's Rho correlation analyses are performed to measure the degree of association between tandems and mutations and between mutations by R v.3.5.3 [36]. The strengths of Spearman rank correlations between two variables have set threshold values as follows: negligible or very weak (0.1-0.19), weak (0.20-0.29), moderate (0.30-0.39) and strong (0.4-0.69) [37]. The probability (p) of significance of the correlations was tested at α-level of 0.01. Mann-Whitney U tests are further performed to decide whether tandem repeat presence/absence affects the numbers of indels and SNPs in each 150 window by R v.3.5.3 (Additional file 7: Table S7), using grouping variables (tandem repeat number > 0 or = 0) and outcome variables (indel number and SNP number) in each 150 bp window.

Phylogenomic analysis
Phylogenetic trees were constructed by ML analyses using Avena and gramineous outgroup plastome sequences (162,505 bp) based on recent phylogeny [73]. The sequences were aligned and then manually adjusted by BioEdit [74]. The best-fitting nucleotide substitution model was determined using the Akaike Information Criterion (AIC) in iModeltest v.2.1.10 [75]. The GTR + I + G model was used in ML analyses, which were performed by MEGA v.6.0 with 1000 bootstrap replicates [76].
In order to evaluate the alternative hypotheses of phylogeny, ML trees were constructed using not only the combined sequences of LSC or SSC intermolecular rearrangement fragments (28,164 bp, 14,262 bp), but also those of combined ten most polymorphic protein-coding genes (11,028 bp, 8212 bp) and combined ten most polymorphic intergenic regions (3099 bp, 2374 bp) (Table S4, Fig. 8, S8) by IQ-TREE v.1.6.8 [39] employing the model-testing function to infer the best-fit substitution model for such six datasets under the Bayesian information criterion. To evaluate node reliability, we implemented a Shimodaira-Hasegawa-like approximate likelihood ratio test (SH-aLRT) [40], and branch support was assessed using Ultrafast Bootstrap Approximation (UFB) [41] using 1000 bootstrap replicates for each method. FigTree v.1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/) was used to visualize ML trees (Additional file 20: Figure S8). For phylogenetic analyses of oat and its diploid wild species, the plastome sequences of 25 published Avena species were downloaded from NCBI [24]. ML tree was constructed using above procedure (Additional file 20: Figure S10).