Chloroplast genome sequencing
To generate a complete TK chloroplast genome sequence, chloroplast DNA was extracted from a mixture of genetically distinct TK plants. To reduce polysaccharide content, which interferes with DNA extraction, young leaves were harvested from 1 to 2 month-old greenhouse grown TK plants subjected to a 2-day dark treatment before harvesting. About 20 g leaf tissue were ground in liquid nitrogen and suspended in 400 ml grinding buffer (0.35 M sorbitol, 50 mM HEPES/KOH, pH 7.5, 2 mM EDTA, 1 mM MgCl2, 1 mM MnCl2 and 4.4 mM sodium ascorbate (added just before use) (modified from [24, 25]). After filtering the tissue through four layers of miracloth, the filtrate was collected by centrifuging at 4500 × g for 20 min. The re-suspended pellets were placed on the top of a 30–50% sucrose gradient and centrifuged for 45 min at 10,000 × g, at 4 °C, in a swinging bucket rotor. The intact chloroplasts formed a layer between the 30 and 50% sucrose and were separated from the broken chloroplast remnants. Isolated chloroplasts were treated by DNase using Ambion® TURBO DNA-free™ Kit (Thermo Fisher Scientific Inc., Waltham, MA, USA) to degrade nuclear DNA. Chloroplast DNA was extracted using GenElute™ Plant Genomic DNA Miniprep kit (Sigma-Aldrich®, St. Louis, MO, USA) and enriched using the REPLI-g® Mini Kit (Qiagen, Inc., Hilden, Germany). DNA quality was initially checked and quantified using a NanoDrop® ND-1000 Spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE, USA). Distinctive individual band patterns shown after DNA digestion by restriction enzyme EcoRI indicated the high percentage of chloroplast DNA. DNA was submitted to The Molecular and Cellular Imaging Center (MCIC) at the Ohio Agricultural Research and Development Center (OARDC) for additional quality control and sequencing using the Illumina GAII sequencing platform.
To generate TK chloroplast genomes from multiple genotypes as well as complete TO and TB chloroplast genomes, three species were sequenced by MiSeq. A total of 24 genotypes were selected for TK, including 19 USDA lines, three mixed genotypes from USDA lines and a single cytoplasmic male sterile line (Additional file 1). All the USDA lines used in this study were obtained from the USDA-ARS National Plant Germplasm System (NPGS). These samples were collected in southeast Kazakhstan in 2008, from an area delineated by 42.79949 N to 43.06724 N, and 79.17952E to 80.08643E [26]. Detailed information of this collection can be obtained through the NPGS database, Germplasm Resources Information Network (GRIN) at http://www.ars-grin.gov/npgs/ [27]. Additional plants were selected from individual crosses between plants of specific USDA Accessions. All of the genotypes we selected to represent TK were self-incompatible and outcrossing, without variance in genome size. Twenty-four TO genotypes from a global collection of TO seed, including seed collected from North America, Europe and China, were used for sequencing (Additional file 2). All TO seeds used in this study were donated by weed scientists and other collaborators voluntarily, and collected by Prof. John Cardina (Ohio Agricultural Research and Development Center, The Ohio State University, Wooster, OH, USA). No permissions were required to obtain these seeds. TO seeds were identified based on the plant morphology and reproductive system. A TB “Clone A” donated by Peter van Dijk (Keygene, Wageningen, Netherlands), which originally came from the Botanical garden, Marburg, Germany, as well as 11 genotypes descended from plants collected from Kazakhstan and distributed broadly by Dr. Anvar Buranov (Nova-BioRubber Green Technologies Inc., Canada) were used for TB chloroplast sequencing (Additional file 3) [3]. All TO and TB plants used produced full seed set without pollination and exhibited apomixis after emasculation, with the exception of a single diploid, sexual TO accession, which was deliberately included. The total DNA from 60 leaf samples was extracted using a 2% cetyl trimethylammonium bromide (CTAB) DNA extraction protocol [28]. DNA amount was normalized to 1 ng μL−1 and used for entire chloroplast genome amplification by Long Range Polymerase Chain Reaction (PCR) using Q5® High-Fidelity DNA Polymerase (New England Biolabs Inc., Ipswich, MA, USA). Primers were designed on the conserved regions of the draft TK chloroplast sequence generated by the Illumina GAII data (Additional file 4). Amplified fragments were normalized within each species to have the same molarity and submitted for MiSeq sequencing. The 24 genotypes of TK, 24 genotypes of TO and 12 genotypes of TB were sequenced in a single MiSeq run. A library was made for each species, which was tagged using different barcoding sequences to separate short reads for each species. Individual accessions were not tagged separately.
Chloroplast genome assembly and annotation
Paired-end reads were generated for multiple genotypes of TK, TO, and TB by the Illumina GAII and MiSeq sequencing platforms. Quality control was conducted using the FASTX-Toolkit [29]. For TK GAII data, the quality cutoff score was 40 (-q). A quality score of 20 was used for all Miseq data. By using the assembly program Velvet (version 1.2.10), with parameters, kmer = 35, -cov_cutoff = 20, a complete TK chloroplast genome sequence was generated from high quality GAII short reads [30]. Three contigs sized at 18,568, 24,353 and 84,064 bp long were generated. The 18,568 and 84,064 bp contigs had coverages of 344 and 343, respectively, representing the single copy regions. The 24,353 bp contig had a higher coverage of 834, as there are two copies of this region in a chloroplast haplotype. No Ns were included in the contigs. TO and TB short reads were assembled using the same method mentioned above with the quality score of 20. Assembled contigs were further mapped to the TK chloroplast genome as a reference by BLASTn to generate the entire chloroplast genomes [31].
Complete chloroplast genomes of TK, TO, and TB were annotated using the Dual Organellar GenoMe Annotator (DOGMA) [32]. Annotation errors were manually corrected. An annotation map was generated using OrganellarGenomeDRAW (OGDRAW) [33].
Phylogenetic analysis in the Asteraceae and comparative analysis within Taraxacum genus
Phylogenetic analysis was conducted using the Rubisco (Ribulose-1, 5-bisphosphate carboxylase/oxygenase) large subunit gene rbcL from TK, TO, TB and other 27 species in the Asteraceae with available chloroplast genome sequences (Additional file 5). Multiple sequence alignments were carried out using ClustalW, followed by phylogenetic tree generation using MEGA6 [34]. The Maximum Likelihood method was used and the tree with the highest log likelihood was obtained [35].
To analyze the similarities and divergences of the TK, TO, and TB chloroplast genomes, complete chloroplast sequences of these three species were input into the mVISTA program, along with their annotation information [36, 37]. The Shuffle-LAGAN mode was chosen for comparative analysis [38]. The TK chloroplast sequence was used as the reference genome.
Chloroplast species-specific marker discovery
To develop chloroplast species-specific markers between TK and TO, TO short reads were mapped to the TK chloroplast genome sequence using Bowtie 2 [39]. Variants between TK and TO were detected by Freebayes using the default parameters [40]. TK short reads were further mapped to the TK chloroplast genome to eliminate variants which were not fixed within TK. Variants between TK and TO, but fixed within each species, were considered candidate species-specific markers.
Nuclear species-specific marker discovery
To develop nuclear species-specific markers using available Expressed Sequence Tag (EST) resources, 41,294 ESTs of TO (GenBank accession numbers: DY802201-DY843494) and 16,441 ESTs of TK (GenBank accession numbers: GO660574-GO672283, DR398435-DR403165) were obtained from NCBI [22, 23] (Collins J, Whalen MC, Nural-Taban AH, Scott D, Hathwaik U, Lazo GR, Cox K, Durant K, Woolsey R, Schegg K, et al. Genomic and proteomic identification of candidates genes and proteins for rubber biosynthesis in Taraxacum kok-saghyz (Russian dandelion). 2009. Unpublished; Shintani D. Using EST from Taraxacum kok-saghyz root cDNA library to generate candidate rubber biosynthetic genes. 2005. Unpublished). Using the pipeline described by Kozik (2007) [41], ESTs were assembled into contigs and filtered. Interspecific variants were selected manually, by screening alignments flagged as containing interspecific variations.
Species-specific marker validation
Markers were validated through gel based assays in larger populations than those used for sequencing for each species. The number of genotypes used for TK, TO and TB were 102, 103 and 24, respectively (Additional files 1, 2 and 3). Primers were designed by Primer 3 [42, 43] to validate Cleaved Amplified Polymorphic Sequences (CAPS), which were identified by CAPS Designer [44] using the following PCR procedure: 5 min initial denaturation at 95 °C, followed by 35 cycles of 40s denaturation at 95 °C, 60s annealing at 54 °C or 56 °C, 60s elongation at 68 °C, as well as a final extension step at 68 °C for 5 min. Tetra-primer ARMS-PCR was also carried out to detect SNPs using the similar PCR procedure with a 58 °C annealing temperature [45]. All the PCR reactions were conducted using reagents obtained from New England Biolabs (Inc., Ipswich, MA, USA) in a 10 μL reaction, following the manufacturer’s instructions.