Targeted enrichment of novel chloroplast-based probes reveals a large-scale phylogeny of 412 bamboos

Background The subfamily Bambusoideae belongs to the grass family Poaceae and has significant roles in culture, economy, and ecology. However, the phylogenetic relationships based on large-scale chloroplast genomes (CpGenomes) were elusive. Moreover, most of the chloroplast DNA sequencing methods cannot meet the requirements of large-scale CpGenome sequencing, which greatly limits and impedes the in-depth research of plant genetics and evolution. Results To develop a set of bamboo probes, we used 99 high-quality CpGenomes with 6 bamboo CpGenomes as representative species for the probe design, and assembled 15 M unique sequences as the final pan-chloroplast genome. A total of 180,519 probes for chloroplast DNA fragments were designed and synthesized by a novel hybridization-based targeted enrichment approach. Another 468 CpGenomes were selected as test data to verify the quality of the newly synthesized probes and the efficiency of the probes for chloroplast capture. We then successfully applied the probes to synthesize, enrich, and assemble 358 non-redundant CpGenomes of woody bamboo in China. Evaluation analysis showed the probes may be applicable to chloroplasts in Magnoliales, Pinales, Poales et al. Moreover, we reconstructed a phylogenetic tree of 412 bamboos (358 in-house and 54 published), supporting a non-monophyletic lineage of the genus Phyllostachys. Additionally, we shared our data by uploading a dataset of bamboo CpGenome into CNGB (https://db.cngb.org/search/project/CNP0000502/) to enrich resources and promote the development of bamboo phylogenetics. Conclusions The development of the CpGenome enrichment pipeline and its performance on bamboos recommended an inexpensive, high-throughput, time-saving and efficient CpGenome sequencing strategy, which can be applied to facilitate the phylogenetics analysis of most green plants. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-020-02779-5.


Background
The subfamily Bambusoideae belongs to the grass family Poaceae and exhibits substantial phenotypic diversity, with 1,642 species in 125 genera, three tribes, and 15 subtribes, which have been classi ed into ~ 75 clades 1

. The
Bambuseae consists of tropical woody bamboos (Bambuseae), temperate woody bamboos (Arundinarieae) and herbaceous bamboo tribe (Olyreae). Bambusoideae predominantly distributed in the Old World, such as China, Japan, Thailand, Indonesia, and the countries of Southeast Asian. As one of the most ecologically and industrially valuable tribes of Bambusoideae, woody bamboos were used for furniture, paper, ber textiles, and fuel 2 . In total, about 500 bamboos are distributed in Asia, spanning a wide geographic and temperature range. However, infrequent, incongruent, and unpredictable owering events as well as unstable vegetative characteristics, severely restricted the identi cation and classi cation of woody bamboos. The phylogenetic relationships based on more massive amounts of woody bamboos remain elusive due to the lack of extensive and high-quality genomic resources.
The chloroplast genome (CpGenome) is an essential resource for the study of plant evolution 3 . This high-copy organelle is one of the most technically accessible regions of the genome. The chloroplast genomic DNA of green plants commonly exhibits a conserved genome structure that contains two copies of inverted repeat (IR) separating the small single-copy region (SSC) and the large single-copy region (LSC) 2,4,5 . The CpGenome has been a popular source of reconstructing the phylogeny of green plants, and many chloroplast DNA loci are contributing to the development of plant taxonomy. To obtain chloroplast DNA suitable for whole chloroplast genome sequencing, it can be traditionally enriched by using the sucrose gradient centrifugation method 6 , the high salt method 7 , long PCR technology by using primers 8 . The characters of the strategies above are the use of physical methods to extract chloroplast DNA or the need for high quality, su ciently extracted cellar DNA and the appropriate primers. With the development of sequencing technology, next-generation sequencing (NGS) has the advantageous characteristics of high-throughput and e cient, resulting in a rapid increase in the amount of sequencing data. Chloroplast DNA generally accounts for only about 0.5-13% of the whole genome 9 . But, the chloroplast DNA sequencing data from the whole genome sequencing (WGS) data produced a lot of "useless" data except for "useful" ones, consuming much of the sequencing capacity and reducing the e ciency of parallelly chloroplast sequencing. The above methods for obtaining chloroplast DNA sequencing data cannot meet the needs of large-scale CpGenome sequencing, which signi cantly restricts and hinders the in-depth research of plant genetics and evolution.
In this study, the main goals were: (1) To develop and evaluate a pipeline to target-enrich and assembly the chloroplast data of bamboos. (2) To obtain high-quality and high coverage of bamboo CpGenomes by the pipeline, to reconstruct a phylogenetic tree, and to promote phylogenetic knowledge of bamboo. (3) To share the new sequenced bamboo CpGenomes, allowing researchers to quickly compare suspect chloroplast data and explore the bamboo CpGenomes.

Species selection for probe design and evaluation
To improve the variability and versatility of the probes, we selected 567 representative species from the 3,654 published CpGenomes species (collected from NCBI, Released Dec 2018) to design and evaluate probes for a targeted enrichment strategy of CpGenomes (Supplementary Table S1 and S2). Among the 567 species, 22 are bamboo species. For data preprocessing, we elucidated our approach in a ow chart (Supplementary Figure S1). A phylogenetic tree (Supplementary Figure S2) was constructed based on the 567 complete CpGenomes, which spanned the phylogenetic diversity of 7 major clades, including 40 orders and 57 families. The model species in each clade were selected as core candidates. Thus, a total of 99 CpGenomes, including 6 bamboo CpGenomes, were chosen as the representative species for the probe design (Table 1), and the remaining (468 CpGenomes) were chosen as test data further to assess the e ciency of the probes for chloroplast capture. The species for probe design and the species for probe evaluation were different genera but belong to the same family (e.g., Danthonia and Chionochloa, both are Poaceae).

Construction of non-redundant chloroplast reference
Using the CpGenome of Arabidopsis thaliana as the initial reference sequence (as a database sequence), other selected CpGenomes (as query sequences) were aligned to the database sequence by BLAST + v2.2.25 software with default parameters. The sequences with more than 90% identity were masked from the query sequences. Then, the resulting sequences were subjected to a secondary round masking of redundant sequences, which were identi ed by an all-against-all BLAST+. Finally, a non-redundant chloroplast reference, as a pan-chloroplast genome (pan-CpGenome), was obtained by iterative analysis. Sequences with high similarity ( > = 90%) were masked with "Ns", and others were highly divergent sequences in the pan-CpGenome (Supplementary File F1). The visualization of the alignment of 98 CpGenomes to Arabidopsis thaliana CpGenome was conducted by BLAST Ring Image Generator (BRIG V0.9) 10 with default parameters.

Universal probes designed for bamboo CpGenomes
The regions of the pan-CpGenome sequences which have not been masked to "Ns" were extended by 40 bp on both sides for the design of the probes. Each region was divided into K-mers of 90 bp in length and the melting temperatures of the K-mers were calculated 11 . A comprehensive score of uniqueness, frequency, melting temperature, and GC content was calculated for each probe by Primer3 v2.4.0 12 . The probes with the highest comprehensiveness scores were selected in 20 bp window and slid along the target region at the xed interval. For ensuring high coverages of the probe sequences in the target region, the target region was covered at least 2 times by these selected probes. Finally, a total of 180,519 DNA oligonucleotides were synthesized by a CustomArray B3 Synthesizer (CustomArray, Washington, DC, USA) according to the manufacturer's instructions and dissolved in 10 × TE buffer (pH = 8.0).

Taxa sampling
All sampled species covering more than 30 genera (Supplementary Table S4)  DNA extraction and target enrichment sequencing for bamboos A total of 358 woody bamboo samples were sampled and sequenced in this study (Supplementary Table S4), as a practical application of target enrichment sequencing and an evaluation of the capture e ciency. Genomic DNA from each sample was extracted using the CTAB method 13 and fragmented to a peak size of 200 bp using a Covaris E220 sonicator (Covaris, Woburn, Massachusetts, USA), followed by the end-repair, addition of base "A", and adapter ligation. DNA fragments of the desired size (200 bp) were selected on an agarose gel and hybridized to the probes for 72 h. The probes captured DNA fragments were recycled by magnetic beads coated with streptavidin, which interacted with the biotin on the probes to wash away the uncaptured DNA fragments.
The captured DNA fragments were sequenced on the BGISEQ-500 platform at Beijing Genomics Institute, Shenzhen, China. High-quality reads ranging from 1 Gb to 9 Gb with 100 bp paired-end were acquired for each sample. For data preprocessing, we illuminated our method in a ow chart (Supplementary Figure S1). SOAP lter (v2.2) 14 was applied to remove low-quality reads and adaptors in the following criteria (1)

Development of universal chloroplast probes for bamboos
From the 3,654 CpGenomes collected from NCBI, 567 high-quality CpGenomes were selected for probe development and divided into two datasets, with 99 CpGenomes for probe design and 468 CpGenomes for probe evaluation.
Considering the applicability and robustness of the probes designing for bamboos, and the diversity of CpGenomes, the 99 CpGenomes were selected from different families. Details of the related methods were provided in Supplementary Figure S1.  Figure S3). To assess the broad spectrum of the probes, the BLAST + program was employed to align the probes to the 468 complete CpGenomes for evaluating the probes. The average coverage ratio in the 468 complete CpGenomes was 90.54% (Supplementary Table S8). In bamboos, the coverage ratio was all over 93.00%, with an average coverage of 94.78% ( Fig. 2B and Supplementary   Table S8). Moreover, some orders such as Magnoliales, Pinales, Poales also had high coverage.

Probe-based targeted enrichment and assembly of bamboo CpGenomes
A total of 358 fresh woody bamboo samples collected from China were included (Supplementary Table S4) and used to evaluate capture e ciency. A total of 1G-9G raw reads were obtained, and low-quality reads and adaptors were ltered in data preprocessing ( Fig. 2C  transfer RNAs, exhibiting a higher degree of conservation. We detected 15 overlapped bamboo CpGenomes that were present in both the in-house and published data (Fig. 2D). To assess the target enrichment, we mapped the raw reads to the corresponding CpGenome released previously and compared assembled bamboo CpGenome to corresponding released ones. The results showed more than 45.77% in average of the raw reads from in-house bamboo CpGenomes can be mapped to the corresponding published CpGenomes, and the mapping depth was higher than 1200×. Alignment with the published CpGenomes, the coverage of assembled CpGenomes was greater than 98.59% ( Fig. 2D and Supplementary Table S10).

A phylogenomic relationship based on 412 bamboo CpGenomes
For comprehensively collecting bamboo CpGenomes, 71 bamboo CpGenomes from NCBI were acquired, resulting in a total of 412 non-redundant bamboo CpGenomes after removing redundancy (Supplementary Table S6). We reconstructed a phylogenetic tree of bamboos based on the concatenated sequences of 75 protein-coding genes in the 412 bamboo CpGenomes. Phylogenetic analyses supported the relationship of (Arthrostylidiinae (Bambusinae, Olyreae)). We classi ed different clades in the phylogenetic tree based on previous studies 19,20 . The pattern of (XI((VIII, IV)VI)((IX, III)(VII, V))) was provided in Arthrostylidiinae (Supplementary Figure S4). Most of the newly sequenced species distributed in Clade V, Clade VI, and Clade Paleotropical. Clade XI (Ampelocalamus calcareus) was the earliest diverging Arthrostylidiinae species. The Phyllostachys was a representative genus in bamboo, with the clade embed into Clade V, which was the sister clade of Bashania fargesii. There are some non-Phyllostachys species were found in Phyllostachys genus clade. The Phyllostachys genus clade was divided into two groups based on the phylogenetic tree. Phyllostachys edulis, the most planted bamboo in China, distributed in Phy-II (Fig. 3). The sequences from NCBI clustered with corresponding in-house sequences. For example, Phyllostachys edulis sequence from NCBI clustered with in-house sequences of Phyllostachys edulis f epruinosa, Phyllostachys edulis f exaurita, Phyllostachys edulis f exuosa, et al.

China Bamboo Database in CNGB
We uploaded the bamboo CpGenomes sequenced in this study to CNGB to facilitate the accumulation of knowledge on bamboo phylogeny. Researchers can download the raw data and assembled CpGenome sequences from CNGB through Project ID: CNP0000502 (https://db.cngb.org/search/project/CNP0000502/). Moreover, researchers can search for all assembled bamboo plastid genomes in this study through web-based BLAST + service (https://db.cngb.org/blast/). The available plastid genome sequences of bamboos and the corresponding BLAST + server can promote researchers to explore the complex and elusive history of bamboo evolution.

Discussion
CpGenome provides an essential resource for plant evolution Hybridization-based probes for target enrichment in large-scale CpGenome sequencing Chloroplast DNA can be traditionally acquired by the sucrose gradient centrifugation method 6 or the high salt method 7 . Another method was to amplify the entire chloroplast DNA from the whole cellular DNA base on a long PCR technology by primers, which were designed on conserved sequences 8 . These methods were not suitable for large-scale samples due to the large amount of labor and material resources required to obtain chloroplast DNA, and the labor-intensive method used to prepare chloroplast DNA. Chloroplast reads also can be identi ed from WGS reads by aligning the WGS data with the reference CpGenome. It is a demanding bioinformatics technique and requires a closely related reference CpGenome. The method was not suitable for the species that are not closely related or have poor quality reference genome sequences. Moreover, to assemble only CpGenome based on this method, a great deal of useless sequencing data was thus generated, consuming much of the sequencing capacity and reducing the e ciency of parallelly chloroplast sequencing, since the chloroplast DNA sequencing data represents only a small fraction of WGS. Therefore, most existing methods for obtaining DNA and sequencing data suitable for whole CpGenomes cannot meet the needs of large-scale CpGenome sequencing, greatly limiting and hindering the in-depth research of plant genetics and evolution.
Target enrichment before sequencing is a useful method that allow for in-depth analysis of speci c portions of the genome. Moreover, a group of universal probes covering whole CpGenome in a tribe species can make target enrichment strategy exert it's advantages. Large scale CpGenomes target enrichment by universal probes can provide cost-effective, high density, and high coverage. In pan-CpGenome construction, unique sequences were selected, and the nal pan-CpGenome size was ~ 15 Mb. A total of 180,519 probes were designed and synthesized using a new hybridization-based approach to enrich chloroplast DNA fragments. Evaluation of the quality of the probes and pan-CpGenome showed a high mapping ratio, which was stable and e cient in bamboo CpGenomes. Besides bamboos, the ampli ed plant CpGenomes expanded variational sequences and universality of the probes in the pan-genome construction step. Thus, the probes also had high mapping rates in some orders, such as Malvales, Rosales, Pinales and Poales, et al, and indicated the applicability of the probes in these clades. Conversely, lower mapping rates were found in Nymphaeales, Solanales, Schizaeales, Lamiales, et al, which may due to inadequate and poor corresponding CpGenomes materials in pan-Genome constructing. It can be solved by amplifying corresponding CpGenomes to expand divergent sequences in pan-CpGenome or decreasing parameter restriction. Comparing of the assembled CpGenome with its published counterparts demonstrated a mapping coverage of over 98%, further con rming the e ciency of the probes in enriching chloroplast DNA fragments. In general, this pipeline of pan-CpGenome construction, pan-CpGenome-based probes design, and CpGenome enrichment showed its performance in bamboo CpGenomes and recommended a strategy of large-scale CpGenomes acquiring to green plants.
Bamboo CpGenomes could provide additional information on large-scale phylogenetic relationships There are more than 500 bamboo species in China, which play signi cant roles in economy, ecology, culture, aesthetics, and technology 33,34 . Bambusoideae is one of three subfamilies in Poaceae known as the BEP clade 35 .
Bamboo remains one of the most challenging groups for plant taxonomists and eld botanists 36 , due to infrequent, incongruent, unpredictable owering events, and diversity vegetative characters, which may result from frequent hybridization occurred in bamboos 36,37 . The Phyllostachys genus, with 59 species, is the most economically important among bamboos 44-46 . Phyllostachys edulis is the most signi cant Phyllostachys species, accounting for ∼73.8% bamboo-growing regions in China (4.43 million ha), and is the most abundant non-wood resource 34 . This study included 102 Phyllostachys CpGenome sequences, covering more than 90% Phyllostachys species, and provides an unprecedented opportunity to expand taxonomic knowledge of Phyllostachys genus. Traditionally, Phyllostachys genus can be divided into two groups, P. sect. Phyllostachys and P. sect. Heteroclada, based on morphological features such as in orescences and rhizomes et al 47,48 . But there is a controversy in this classi cation due to some in-between morphological features of two groups 44,47 . Compared to the traditional taxonomy, the species tree we constructed exhibited different phylogenetic relationships in P. sect. Phyllostachys and P. sect. Heteroclada, speci cally the two groups of species intermixed in the species tree. Incongruence between morphological taxonomy and the phylogenetic tree may be due to complex evolutionary processes or taxonomic treatments. with previous results based on non-genome wide nuclear sequences or morphological features 44,47,48 . The classi cation should be treated carefully because of the evolutionary complexity of bamboos. Moreover, The incongruence between plastid and nuclear gene phylogenies in Arundinarieae was found in the previous study 19 .
Though the species tree we constructed supports more than 90% species coverage of Phyllostachys, the taxonomy of Phyllostachys clade should be further tested within the phylogenies based on genome-wide nuclear genes.

Conclusions
A practical and large-scale approach to CpGenome acquisition will promote plant genetics and phylogenetics. We recommend a universal probe-based CpGenome enrichment pipeline, which successfully applied to bamboo CpGenomes, and 358 woody bamboo CpGenomes were acquired. Moreover, the universal probes we designed for bamboo exhibited a broad spectrum, which may also be applicable in Magnoliales, Pinales, Poales et al. We also reconstructed a phylogenetic tree of bamboos in China based on CpGenomes which supported the non-monophyly of the genus Phyllostachys. For promoting evolution, phylogenetic and population studies, we uploaded the sequences to CNGB to provide a BLAST + server. For further research, we will explore many divergent hotspot regions associated with repeat sequences of LSC regions, such as tRNA clusters, which can be used as genetic markers for phylogenetic studies.   Evaluation of mapping and coverage of the probes compared to the in-house and released bamboo CpGenomes.
The mapping ratio represents the proportion of reads obtained by the probes aligned with the released bamboo CpGenomes. Mapping coverage represents the proportion of the assembled CpGenomes based on the probes aligned with the released bamboo CpGenomes.

Figure 3
A species tree of Phyllostachys clade based on 75 chloroplast genes. The species tree divided into 2 parts, which labeled with different background colors. Numbers at the node indicated the bootstrap values and bootstrap values lower than 80 were concealed. The red, purple, grey, and blue blocks in the tree represented P. sect. Heteroclada species, P. sect Phyllostachys species, unlabeled and non-Phyllostachys species, respectively. Name with 'LOC' stands represented newly sequenced sequences in this study.