Implications for integration of transgenes
Chloroplast genetic engineering offers several advantages, including a high-level of transgene expression , multi-gene engineering in a single transformation event , transgene containment via maternal inheritance  or cytoplasmic male sterility , lack of gene silencing [9, 13], position effect due to site specific transgene integration , and lack of pleiotropic effects due to sub-cellular compartmentalization of transgene products [15–17]. Apart from expressing therapeutic agents , biopolymers  or transgenes to confer valuable agronomic traits, including herbicide resistance , disease resistance , insect resistance , drought tolerance , salt tolerance , and phytoremediation , chloroplast genetic engineering has been used to study chloroplast biogenesis and function, revealing the mechanisms of DNA replication origins, intron maturases, translation elements and proteolysis, import of proteins, and several other processes . Despite the potential of chloroplast genetic engineering, this technology has only recently been extended to the major crops, including soybean , carrot , lettuce , and cotton .
The availability of complete sequences of chloroplast genomes enhances their use for genetic engineering. In chloroplast transformation, finding appropriate intergenic spacer regions is very important for efficient integration of transgenes. In tomato and potato, researchers have used trnfM-trnG, rbcL-accD, trnV-3'-rps12, and 16S rRNA-orf 70B intergenic spacer regions of tobacco to integrate transgenes [29–31]. Unfortunately, none of these regions have 100% sequence identity . For example, the intergenic spacer region between rbcL and accD of potato and tobacco shows only 94% sequence identity. Subsequently, potato chloroplast transformants are generated at 10–30 times lower frequencies than tobacco . Similarly, the trnfM and trnG intergenic spacer region used for tomato chloroplast transformation has only 82% sequence identity with tobacco, resulting in inefficient transgene integration. There are major deletions in the tomato chloroplast genome in this intergenic spacer region when compared to tobacco, which was used for transformation . Therefore, the development of species-specific vectors for transgene integration would enable the use of any of the intergenic spacer regions within the respective chloroplast genomes . Moreover, genome organization is different among some species. For instance the rbcL and accD genes are adjacent in tobacco and most other angiosperm chloroplast genomes, including Citrus. However, they are not adjacent in the soybean chloroplast genome because an inversion has altered gene order . These examples emphasize the importance of choosing appropriate intergenic spacer regions for chloroplast transformation.
Gene order of the Citrus genome is identical to the published genome sequences of the Solanaceae , which have the inferred ancestral angiosperm genome organization . The rps19 gene and the first 84 amino acids of rpl22, which generally are single copy in the LSC on the IRb side, have been duplicated in Citrus. Thus, there is a complete, second copy of rps19 and a truncated copy of rpl22 adjacent to trnH. This duplication is likely due to an expansion of IRb at the LSC junction, a common process in chloroplast genomes . The gene content of Citrus is also very similar to most other angiosperm chloroplast genomes. However, infA, a gene coding for a translation initiation factor in other plant species, is absent in the Citrus genome, and rpl22 is apparently not functional due to a frame shift mutation. Millen et al.  demonstrated at least 24 independent losses of infA in angiosperms, and in four lineages this gene has been shown to be transferred to the nucleus. Three of these losses are evident in our phylogeny based on cpDNA sequences (indicated by bars in Figs. 2, 3). Among the rosid genomes sequenced the infA loss has occurred only once and this change supports the basal split between Vitis and the rest of the rosids (Figs. 2, 3). The rpl22 gene in the IRb region has a nonsense mutation resulting in 9 stop codons indicating that this gene is not functional. This was confirmed by PCR amplification and sequencing using primers that flank the IR/LSC boundaries. The rpl22 gene has been reported to be missing in legume chloroplast genomes and the import of nuclear encoded protein has been demonstrated [32, 35]. Our group recently reported that rpl22 was also missing in the cotton chloroplast genome  but it turns out that this was an annotation error. The lack of a functional copy of rpl22 in Citrus should be investigated further, including an expanded sampling of members of the Rutaceae and Sapindales.
Repeat analysis identified 29 direct and inverted repeats 30 bp or longer with a sequence identity ≥ 90% in the Citrus chloroplast genome with the longest repeat, other than the IR, 53 bp in length (Table 1). The presence of dispersed repeats in chloroplast genomes, especially in intergenic spacer regions, has been reported in a number of angiosperm lineages, including other rosids .
Phylogenies based on 61 protein-coding genes (Figs. 2, 3) generally agree with several recent studies based on multiple genes or complete chloroplast genomes [37–39]. Areas of congruence that are strongly supported include the monophyly of monocots and their sister relationship to eudicots, monophyly of rosids and asterids, and the sister relationship between Caryophyllales (represented by Spinacia) and asterids.
Our chloroplast genome trees (Figs. 2, 3) indicate that the earliest diverging angiosperm lineage is either Amborella or Amborella + Nymphaeales. This incongruence between MP and ML trees was noted previously [37, 39]. This same incongruence was observed in a multigene phylogeny that includes nine genes from the chloroplast, mitochondrial and nuclear genomes . In this case, phylogenies for chloroplast genes supported the Amborella basal hypothesis, whereas mitochondrial genes supported Amborella + Nymphaeales as the earliest angiosperm lineage.
A second incongruence between MP and ML trees concerns the position of the magnoliid Calycanthus, although bootstrap support for the different relationships is weak (Figs. 2, 3). The MP tree places Calycanthus sister to eudicots, whereas the ML tree positions this taxon sister to a clade that includes both monocots and eudicots. This same incongruence was observed in previous phylogenetic analyses based on the 61 protein-coding chloroplast genes [37, 39]. The position of magnoliids continues to be controversial. Several molecular phylogenies have suggested different sets of relationships among magnoliids, monocots, and eudicots. Phylogenies based on phytochrome  and 17 chloroplast  genes placed magnoliids sister to monocots + eudicots but bootstrap support was weak. Several studies supported monocots as the sister group of magnoliids + eudicots [43–45] but bootstrap support was again weak. Both matK  and three gene  phylogenies suggested that eudicots are sister to magnoliids + monocots. Finally, the nine-gene phylogeny of Qiu et al.  recovered all three of these sets of relationships depending on the phylogenetic methods (MP or ML) and the genes used but support was very weak in each case. The different resolutions of relationships of magnoliids are greatly affected by taxon sampling and phylogenetic methodology. The affects of both of these phenomena have been discussed in several recent papers on the utility of whole chloroplast genomes for phylogenetic reconstruction of angiosperms [37, 39, 47–52]. Clearly, additional complete chloroplast genome sequences are needed to resolve the relationships among magnoliids, monocots, and eudicots.
A third incongruence between the MP and ML trees concerns the monophyly of the eurosid I clade (Figs. 2, 3). The MP tree (Fig. 2) strongly supports the monophyly of eurosid I (100% bootstrap), whereas in the ML tree the eurosid I clade in not monophyletic because Cucumis is strongly sister to the Myrtales instead of the Fabales. This same incongruence was detected in Jansen et al.  and was attributed to limited taxon sampling and model misspecification in ML analyses, two phenomena that are known to have adverse effects on phylogenetic reconstruction [53–57]. Expanded taxon sampling of rosids is needed to critically evaluate the monophyly of the eurosid I clade, especially since there is only moderate support for monophyly of eurosid I in previous phylogenies based on a single or few genes [reviewed in 58].
Both MP and ML trees are congruent with regard to the phylogenetic placement of Citrus. The genus is positioned as a member of the eurosid II clade, which has very strong bootstrap support in both MP (98%) and ML (100%) trees (Fig. 2). The eurosid II clade, which currently includes the four groups Brassicales, Malvales, Sapindales, and Tapisciaceae, has received strong support in previous DNA sequence phylogenies based on one to three genes , although relationships among these groups remain unresolved. Previous phylogenies based on whole chloroplast genomes [36, 37, 39, 59] have included only one or two groups (Arabidopsis, Brassicales and/or Gossypium, Malvales). The addition of Citrus from the Sapindales expands the sampling to three of four currently recognized groups of eurosids II. Both MP and ML trees (Figs. 2, 3) provide strong support (98 – 100% bootstrap) for a sister relationship between the Brassicales and Malvales. This same relationship was weakly supported based on phylogenies using one or two chloroplast genes [46, 60]. In contrast, the three gene phylogeny of Soltis et al.  weakly supported a sister relationship between the Malvales and Sapindales. Although taxon sampling is still somewhat limited, our 61-gene phylogeny provides very strong support for a close relationship between the Brassicales and Malvales. Expanded taxon sampling of the eurosid II clade is needed to confirm these results.