Title: The complete chloroplast genome of Onobrychis gaubae (Fabaceae-Papilionoideae): comparative analysis with related IR-lacking clade species

Plastid genome sequences provide valuable markers for surveying the evolutionary relationships and population genetics of plant species. In the present study, the complete plastid genome of Onobrychis gaubae, endemic to Iran, was sequenced using Illumina paired-end sequencing and was compared with previously known genomes of the IRLC species of legumes. The O. gaubae plastid genome was 123,645 bp in length and included a large single-copy (LSC) region of 81,034 bp, a small single-copy (SSC) region of 13,788 bp and one copy of the inverted repeat (IR b ) of 28,823 bp. The genome encoded 110 genes, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes and four ribosome RNA (rRNA) genes and possessed 89 simple sequence repeats (SSRs) and 28 repeated structures with the highest proportion in the LSC. Comparative analysis of the chloroplast genomes across IRLC revealed three hotspot genes (ycf1, ycf2, clpP) which could be used as molecular markers for resolving phylogenetic relationships and species identication. IRLC plastid genomes also showed multiple gene losses and inversions. Phylogenetic analyses revealed that O. gaubae is closely related to Hedysarum. The complete O. gaubae genome is a valuable resource for investigating evolution of Onobrychis species and can be used to identify related species.


Introduction
Chloroplast is a vital organelle in plant cells that has an important role in plant carbon xation and numerous metabolic pathways 1,2 . In angiosperms, the chloroplast genome (plastome) typically has a circular structure that ranges from 120 to 180 kb in length. Plastomes mostly exhibit a quadripartite structure in which a pair of inverted repeats (IRa and IRb; usually around 25 kb, but can vary from 7 to 88 kb each) separate the large single-copy (LSC, ca. 80 kb) and the small single-copy (SSC, ca. 20 kb) regions 1,2 . Most plastomes encodes 80 protein-coding genes primarily involved in photosynthesis and other biochemical processes along with 30 tRNA and 4 rRNA genes 3,4 . In contrast to mitochondrial and nuclear genomes, the plastomes across seed plants are highly conserved with respect to gene content, structure and organization 5,6 . However, mutations including duplication, rearrangements, and losses have been reported at the genome and gene levels among some angiosperm lineages, including Asteraceae 7 , Campanulaceae 8 , Onagraceae 9 , Fabaceae 10 and Geraniaceae 11 . Fabaceae (legumes) is the third largest family of angiosperms which shows the most extensive structural variation 12 . Currently accepted classi cation of the legumes based on plastid gene matK includes six subfamilies: Caesalpinioideae, Cercidoideae, Detarioideae, Dialioideae, Duparquetioideae, and Papilionoideae 13 . Gene content and gene order in plastomes of subfamilies are highly conserved and similar to the ancestral angiosperm genome organization except for Papilionoideae, which exhibits numerous rearrangements and gene/intron losses and have a smaller genome 5 . In this subfamily, a loss of one of the IR 14 , the presence of many repetitive sequences 15 and the presence of a localized hypermutable region 15,16 have been documented. The Papilionoideae is further divided into six major clades: the Genistoids, Dalbergioids, Mirbelioids, Millettioids, Robinioids and the inverted-repeat lacking clade (IRLC) 14 . IRLC is the largest legume lineage which contains over 4000 species in 52 genera and eight tribes 14,17−19 . Recently, with the advent of next generation sequencing (NGS) technology, plastomes of several taxa from different tribes in this clade have been sequenced. The majority of IRLC plastomes sequenced to date were restricted to the tribes Fabeae, Trifolieae and Caraganeae. Thus it is essential to investigate the members from other lineages to better understand plastome evolution within the IRLC, and more broadly within Papilionoideae. In the tribe Hedysareae 20 with nine genrea, the plastomes of some Hedysarum species and only one species of Onobrychis have been reported. In the present study, the complete plastome of O. gaubae Bornm. belonging to Hedysareae was sequenced (GenBank accession number: ???). Onobrychis has more than 130 species and is the second largest genus after Hedysarum and mainly found throughout temperate and subtropical regions of Eurasia, N and NE Africa 21 . Onobrychis gaubae is a polymorphic species restricted to the southern slopes of Alborz mountain range, The main goal of this study is to assemble the chloroplast genome of O. gaubae, and to annotate the genome and characterize its structure to provide new genomic resource of this species. We also performed comparative analyses of the genome and phylogenetic reconstruction to evaluate the sequence divergence in the plastomes across the IR-lacking clade.

Results And Discussion
General features of the O. gaubae plastid genome It was previously reported that the plastomes of Papilionoideae, particularly IR-loss clade, are not conserved in their genomic structure in terms of gene order and gene content and exhibit numerous rearrangements and gene/intron losses 5,24,25 . The plastid genome of O. gaubae with 123,645 bp in length and having only one copy of the IR region, which is in accordance with the reports and its genome structure is similar to those of other IRLC species. In this context, the lack of rps16 and rpl22 genes and intron 1 of clpP in the plastome of O. gaubae should be noted; these genes, are absent from the chloroplast genomes of entire IRLC 24,26 . The assembled chloroplast genome of O. gaubae contained 110 genes, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes and four ribosome RNA (rRNA) genes ( Fig. 1, Table 1). The LSC (81,066 bp), SSC (13,777 bp) and IR (28,802 Table S1). The rps12 gene is a transsplicing gene which does not have introns in the 3'-end. The trnK-UUU has the largest intron encompassing the matK gene, with 2,495 bp, whereas the intron of trnL-UAA is the smallest (542 bp). The overall GC content of the O. gaubae chloroplast genome sequence was 35.1%, which is consistent with other IRLC species, whose plastomes have GC-contents ranging from 33.6-35.1% (Table 2). Different GC content occurs in the LSC (34.4%), SSC (30.8%) and IR (39.2%) regions (Supplementary Table S2). Higher GC content was usually detected in the IRs compared to the other regions of plastome, which is mainly due to the presence of rRNA genes (rrn23, rrn16, rrn5, rrn4.5) with high GC content (50%-56.4%) in IRs 6,27 .
The GC content of the protein-coding regions of O. gaubae chloroplast genome comprised 36.02%. Within these regions, the GC contents for the rst, second and third positions of the codons were 42.4%, 36.1% and 29.5%, respectively. Table 1 Genes predicted in the chloroplast genome of O. gaubae. The number of asterisks after the gene names indicates the number of introns contained in the genes.

Category of genes
Group of genes Name of genes   Table S3). Most protein-coding genes employ the standard ATG as the initiator codon. Among the O. gaubae protein-coding genes, three genes used alternative start codons; ACG for psbL, GTG for rps8 and ACG for ndhD.
The chloroplast genomes of the IRLC were analyzed for their codon usage frequency according to sequences of protein-coding genes and relative synonymous codon usage (RSCU). RSCU is an important indicator to measure codon usage bias in coding regions. This value is the ratio between the actual observed values of the codon and the theoretical expectations. If RSCU = 1, codon usage is unbiased; if RSCU > 1, speci c codon frequency is higher than other synonymous codons, otherwise, the frequency is low 28,29 . The total number of codons among protein-coding genes in the IRLC species varies from 20,381 codons in Hedysarum taipeicum (as the smallest number) to 24,765 codons in O. gaubae. The most often used synonymous codon was AUU, encoding isoleucine, and the least used was CGC/CGG, encoding arginine (Supplementary Table S4). In the IRLC, the standard AUG codon was usually the start codon for the majority of protein-coding genes and UAA was the most frequent stop codon among three stop codons. Methionine (AUG) and tryptophan (UGG) showed RSCU = 1, indicating no codon bias for these two amino acids. The highest RSCU value was for UUA (~ 2.04) in leucine and the lowest was GGC (~ 0.35) in glycine. Leucine preferred six codon types (UUA, UUG, CUU, CUC, CUA, and CUG) and actually showed A or T (U) bias in all synonymous codons (Supplementary Table S4). The result of distributions of codon usage in the IRLC species showed that RSCU > 1 was recorded for most codons that ended with an A or a U, except for UUG codon, resulting in the bias for A/T bases. As well as, more codons with the RSCU value less than one, ended with base C or G. So, there is high A/U preference in the third codon of the IR-loss clade coding regions, which is a common phenomenon in cp genomes of higher plants 30 Table S5). Among the 28 repeats, 75% are located in the LSC region, 14.28% in the SSC region and 10.71% in the IR region. Also, most of the repeats (60.71%) were found in the intergenic spacer regions (IGS), 25% were distributed in coding region (psaA, psaI, psbJ, trnR-UCU, trnS-UGA) and 14.28% were located in the introns (ndhA and ycf3). In the majority of the studied IRLC species, the most frequently observed repeats were forward, then palindromic, and the least was the reverse (Fig. 2). The forward type was the most abundant repeat with length in the range 30-50 bp in all the IRLC species. The longest repeats were also of the forward type, with length of 560 bp were detected in the Hedysarum taipeicum, followed by Vicia sativa of 517 bp and Caragana microphylla of 455 bp, which were much longer than other species studied. Furthermore, in the IRLC, repeat sequences involved in genome rearrangement, were mainly distributed in non-coding regions (IGS). Repeat structures induce indels and substitutions resulting in the mutation hotspot in the recon guration of genome 6 ; therefore, these repeats can provide valuable information for phylogenetic and population studies 29 .
Simple sequence repeats (SSRs), or microsatellites, are a type of tandem repeat sequences which contain 1-6 nucleotide repeat units and have wide distribution throughout the genome 29,31 . Accordingly, microsatellites play a crucial role in the genome recombination and rearrangement. These nucleotide motifs show a high level of polymorphism that can be widely used for phylogenetic analysis, population genetics and species authentication 29,32 . A total of 89 SSRs were detected in the O. gaubae plastome, which were composed by a length of at least 10 bp. Among them, 53 (59.55%) were mono-repeats, 21 (23.55%) were di-repeats, 10 (11.23%) were tri-repeats and ve (5.61%) were tetra-repeats. No penta-and hexanucleotide SSRs existed in O. gaubae genome (Supplementary Table S6). These SSR loci were located primarily in the LSC region (68.53%), followed by the SSC region (16.85%) and IR (14.6%) (Fig. 3A). In the mononucleotide repeats, A/T motifs were the most abundant but no G/C motif was detected in the cp genome. Likewise, the majority of the dinucleotides and trinucleotides were found to be particularly rich in AT sequences. Therefore, the AT richness in the SSRs of the chloroplast genome of O.
gaubae is consistent with the results of previous studies 27,29 which have shown that in the cp genome, SSRs generally composed of polythymine (poly T) or polyadenine (poly A) repeats 33 . The number of SSRs in the cp genomes (cpSSRs) ranged from 68 (Vicia sativa and Lens culinaris) to 151 (Melilotus albus) across the IRLC species (Fig. 3B). The mononucleotide repeats (P1) were identi ed at a much higher frequency, which varied from 45 (Tibetia liangshanensis, Glycyrrhiza glabra) to 93 (Melilotus albus). In all cases, the P1s were AT-rich (Fig. 3C). Strong AT bias in SSR loci was also observed in other legumes such as Vigna radiate 34 , Arachis hypogaea 35 and Stryphnodendron adstringens 27 which, like other plastomes of species, may contribute to the bias in base composition 6 . The results showed that SSR loci of LSC regions appeared more frequently than in SSC or IR regions, which may be hypothesized that this phenomenon is relevant to the lack of one IR region in IR-loss clade. In general, cpSSRs show abundant variation and might provide useful information for detecting intra-and interspeci c polymorphisms at the population level 31,33 .

Divergent hotspots in the IRLC chloroplast genomes
The average nucleotide diversity (Pi) among the protein-coding genes of 19 species of the IRLC was estimated to be 0.05736. Furthermore, comparison of nucleotide diversity in the LSC, SSC and IR regions indicated that the IR region exhibits the highest nucleotide diversity (0.11549) and the SSC region shows the least (0.04132). We detected three hyper-variable regions with Pi values > 0.1 among the IRLC species; ycf1 and ycf2 from IR region and clpP from LSC region (Fig. 4). These genes might be undergoing rapid nucleotide substitution in IR-loss clade at the genus and species levels. Among these, ycf1 encoding a protein of 1800 amino acids has the highest nucleotide diversity (0.18745). ycf1 gene is more variable than matK gene and it can be useful for molecular systematics at low taxonomic levels 36,37 . Numerous studies 14,18,38 analyzed the phylogenetic reconstructions of IRLC species at various taxonomic levels based on different fragments of plastid coding genes such as matK, ndhF and rbcL, the nuclear ribosomal ITS and the combined sequences of these genes/spacers. We could use the highly variable regions acquired from this study to develop the potential phylogenetic markers which can be useful for species authentication and reconstruction of phylogeny within different tribes/genera of IR-loss clade in further studies.
The non-synonymous (Ka) to synonymous (Ks) rate ratio (Ka/Ks) was estimated for 75 protein-coding genes across the 28 IRLC species analyzed (Supplementary Table S7). In general, the Ka/Ks values were lower than 0.5 for almost all genes. The ycf4 gene which is involved in regulating the assembly of the photosystem I complex had the highest nonsynonymous rate, 0.165691, while the ycf1 gene with unknown functions had the highest synonymous rate, 0.181067. The Ka/Ks ratio (denoted as ω) is widely used as an estimator of selective pressure for protein-coding genes. An ω > 1 indicates that the gene is affected by positive selection, ω < 1 indicates purifying (negative) selection, and ω close to 1 indicates neutral mutation 39 . In present study, the Ka/Ks ratio was calculated to be 0 for psbL gene which encodes one of the subunits of photosystem II. The Ka/Ks ratio indicates purifying selection in 73 protein-coding genes. The highest Ka/Ks ratio which indicates positive selection was observed in accD gene which encodes a subunit of the acetyl-CoA carboxylase (ACCase) enzyme. Some studies have investigated whether selective pressure is acting on a particular protein-coding gene in different genera/tribes of IR-loss clade. For instance, tests for positive selection suggested that Lathyrus, Pisum and Vavilovia, all belonging tribe Fabeae, have undergone adaptive evolution in the ycf4 gene 15,16 .
Legumes chloroplast genome, and in particular IRLC, have regions with high mutation rates, including rps16-accD-psaI-ycf4-cemA region. rps16 gene was lost from cpDNA in the common ancestor of the IRloss clade 15 . accD coding region was completely absent in the Trifolium subgenus Trifolium and has nuclear copies in Medicago truncatula and Cicer arietinum 25 . Three consecutive genes psaI-ycf4-cemA is situated in a local mutation hotspot and has been lost in some species of Lathyrus 15,16 . Comparative analysis of genome structure We compared whole chloroplast genome sequences of different taxa of IRLC to analyze gene order and content. We found that, similar to other plant species, the gene coding regions were more conserved than the noncoding regions (Fig. 5). High nucleotide variations were observed across IRLC for the proteincoding regions ycf1, ycf2 and clpP. Similar results were also obtained from calculation of nucleotide diversity (Pi). Papilionoideae, in particular the IRLC, displays structural variations which provide informative characters to increase phylogenetic resolution and make the taxon an excellent model for genome evolution studies 5,25 . The plastomes of several members of the IRLC have regions with signi cant variation and rearrangement and accelerated mutation rates, including loss of introns from rps12 and clpP genes 24 , absence of rps16 gene 26 and transfer/loss of rpl22 to the nucleus 24 . Numerous studies have also shown some other rearrangements in some IRLC taxa, such as loss of accD gene in six species of Trifolium 10,25 , loss of rpl23 and rpl33 genes in some species of Lathyrus, Pisum and Vicia 32 and loss of ycf4 gene in some species of Lathyrus and Pisum 15,16 . As revealed in other studies, there are several reasons for occurrence of rearrangements in the plastome, such as the lack of one IR region, variable IR region size and many tandemly repeated sequences 40 . For example, the loss of the rps16 gene was probably due to the presence of a nuclear rps16 copy, which contributed to pseudogenization of the plastid copy 41 . Likewise, the lack or expansion of the accD gene was explained by the presence of tandemly repeated sequences 6,15 .
Plastid RNA editing prediction RNA editing is one of the post-transcriptional events which converts cytidine (C) to uridine (U) or U to C at speci c sites of RNA molecules and modi es the genetic information from the genome in the plastids and mitochondria of land plants. RNA editing serves as a mechanism to correct missense mutations of genes by inserting, deleting and modifying nucleotides in a transcript 42 . RNA editing sites of O. gaubae plastid genes were predicted using Prep-CP prediction tool (Supplementary Table S8). In total, 58 editing sites were present in 19 chloroplast protein-coding genes and all of the editing sites were C-to-U conversions (Supplementary Table S8). Among them, nine editing sites, the highest number, were found in the region encoding ndhB gene followed by seven editing sites in petB (Fig. 6). There were six editing sites detected each in ndhA and rpoB genes. accD, ndhG and petD had three editing sites, and ndhD and ndhF had two editing sites. Two editing sites were also found in ccsA, matK and rpoC1 genes. The remaining seven genes had only one editing site. The results showed that ndh genes exhibited the most abundant editing sites which were nearly 39.6% of the total editing sites. In owering plants, the highest number of plastid editing sites was found in the ndh group genes 42 . Moreover, the ndh genes encoding for a thylakoid Ndh complex, have been lost or pseudogenized in different species of algae, bryophytes, pteridophytes, gymnosperms, monocots, eudicots, magnoliids, and protists [43][44][45] . The RNA editing is probably important for the NDH protein complex function and may also lead to improved photosynthesis and display positive selection during evolution 42 .

Phylogenetic relationships
Phylogenetic relationships within the IRLC were reconstructed using the representative taxa (27 species from different tribes) and two species as outgroup based on 75 protein-coding genes of their chloroplast genomes. The total concatenated alignment length from the 75 protein-coding genes was 87,455 bp. The reconstructed phylogeny is in agreement with previous studies 5,14,25,46 indicated that IRLC was monophyletic and consisted of several clades. As shown in the previous studies, Glycyrrhiza + Meristotropis were monophyletic, along with the tribe Wisterieae was sister to the rest of the IRLC 19,47 .
Then, there are two major clades: clade I and II (Fig. 7).  46 , show that Oxytropis is sister to the tribe Coluteae. In clade II, tribe Cicereae is the basal branch and formed a sister group relationship with the paraphyletic Trifolieae and the monophyletic tribe Fabeae. The results of the present study suggest that there is no con ict between the phylogeny made by whole cp genome and that inferred by individual gene datasets. Therefore, a phylogenetic reconstruction for IR-loss clade species studied here showed that plastid genome database will be a helpful resource for molecular phylogeny at the higher taxonomic level (generic to tribal rank).

Materials And Methods
Chloroplast DNA extraction and sequencing  57 with a cutoff value of 0.8.

Phylogenetic reconstruction
Seventy-ve protein-coding genes were recorded from 27 species within IRLC, as well as from two outgroups (Robinia pseudoacacia L. and Lotus japonicus (Regel) K.Larsen). All genes sequences were obtained from GenBank (Supplementary Table S9). The concatenated data were analyzed using maximum likelihood and Bayesian inference methodologies. Prior to maximum likelihood and Bayesian analyses, a general time reversible and gamma distribution (GTR + G) model was selected using the MrModeltest2.2 58 under the Akaike Information Criteria (AIC) 59 . Maximum likelihood analyses were performed using the online phylogenetic software W-IQ-TREE 60 available at http://iqtree.cibiv.univie.ac.at.
Nodes supports were calculated via rapid bootstrap analyses with 5000 replicates. Bayesian inference was performed using MrBayes v.3.2 in the CIPRES 54 with the following settings: Markov chain Monte Carlo simulations for 5,000,000 generations with four incrementally heated chains, starting from random trees and sampling one out of every 1,000 generations. The rst 25% of the trees were regarded as burnins. The remaining trees were used to construct a 50% majority-rule consensus tree and to estimate posterior probabilities. Posterior probabilities (PP) > 0.95 were considered as signi cant support for a clade.

Conclusions
In this study, the complete plastome sequence of O. gaubae (123,645 bp) was determined. The gene contents and gene orientation of O. gaubae plastome are similar to those found in the plastid genome of other IRLC species. Comparison of plastomes across IRLC showed that the coding regions are more conserved than non-coding regions and IR is more conserved than LSC and SSC regions. The present study also analyzed genetic information in the IRLC plastomes including the distribution and location of repeat sequences and SSRs, codon usage, RNA editing prediction, hotspot regions and phylogenomic analysis. Moreover, we identi ed three hotspot genes (ycf1, ycf2, clpP) which provided su cient genetic information for species identi cation and phylogenetic reconstruction of the IRLC species. Finally, the data obtained from this study could provide a useful resource for further research on tribe Hedysareae and also IR-loss clade at the genomic scale.

Declarations
Competing Interests: