UGbS-Flex, a novel bioinformatics pipeline for imputation-free SNP discovery in polyploids without a reference genome: finger millet as a case study
© The Author(s). 2018
Received: 6 December 2017
Accepted: 23 May 2018
Published: 15 June 2018
Research on orphan crops is often hindered by a lack of genomic resources. With the advent of affordable sequencing technologies, genotyping an entire genome or, for large-genome species, a representative fraction of the genome has become feasible for any crop. Nevertheless, most genotyping-by-sequencing (GBS) methods are geared towards obtaining large numbers of markers at low sequence depth, which excludes their application in heterozygous individuals. Furthermore, bioinformatics pipelines often lack the flexibility to deal with paired-end reads or to be applied in polyploid species.
UGbS-Flex combines publicly available software with in-house python and perl scripts to efficiently call SNPs from genotyping-by-sequencing reads irrespective of the species’ ploidy level, breeding system and availability of a reference genome. Noteworthy features of the UGbS-Flex pipeline are an ability to use paired-end reads as input, an effective approach to cluster reads across samples with enhanced outputs, and maximization of SNP calling. We demonstrate use of the pipeline for the identification of several thousand high-confidence SNPs with high representation across samples in an F3-derived F2 population in the allotetraploid finger millet. Robust high-density genetic maps were constructed using the time-tested mapping program MAPMAKER which we upgraded to run efficiently and in a semi-automated manner in a Windows Command Prompt Environment. We exploited comparative GBS with one of the diploid ancestors of finger millet to assign linkage groups to subgenomes and demonstrate the presence of chromosomal rearrangements.
The paper combines GBS protocol modifications, a novel flexible GBS analysis pipeline, UGbS-Flex, recommendations to maximize SNP identification, updated genetic mapping software, and the first high-density maps of finger millet. The modules used in the UGbS-Flex pipeline and for genetic mapping were applied to finger millet, an allotetraploid selfing species without a reference genome, as a case study. The UGbS-Flex modules, which can be run independently, are easily transferable to species with other breeding systems or ploidy levels.
Efficient genotyping methods, whether used for mapping or population genetic studies, must be simple and reliable, and provide the allele composition at thousands or more of polymorphic loci that cover the entire genome . Despite the recent advances in sequencing technologies, whole genome sequencing is still not cost effective for large-genome species, especially when multiple-fold coverage needs to be achieved of several hundred individuals. Several reduced representation methods based on selective sequencing of restriction fragments have been developed to simultaneously conduct high-throughput marker discovery and genotyping [2–4]. We collectively refer to these methods as ‘genotyping-by-sequencing’ (GBS). To keep sequencing costs low, there is typically a trade-off between marker number and sequencing depth. As a result, genotyping-by-sequencing data sets often have large amounts of missing data and low sequence coverage at each locus [2, 5, 6]. Imputations can be used to infer missing genotypes . Low sequence depth, however, is highly problematic when analyzing diversity panels in outcrossing species, and biparental backcross, F1 (in outcrossing species) and F2 populations where sufficient sequence depth at each locus is a prerequisite to unambiguously differentiate homozygous and heterozygous alleles. We therefore implemented several modifications to the experimental GBS protocol developed by Elshire et al.  and Poland et al. , and tested their effect on reducing the GBS fragment pool and providing more even read coverage across pooled samples for high-confidence imputation free single nucleotide polymorphism (SNP) identification.
Analysis of GBS reads is fairly straightforward if whole genome sequence data is available to which the GBS reads can be aligned. Pipelines for reference-based GBS analysis include TASSEL-GBS [9, 10], Fast-GBS  and Stacks . Pipelines such as TASSEL-UNEAK , Stacks  and GBS-SNP-CROP  can generate a de novo GBS reference from the experimental data and are aimed at the analysis of GBS data from species without reference genomes. Most pipelines are geared towards single-end sequencing data from diploid organisms.
Finger millet, Eleusine coracana (L.) Gaertn. subsp. coracana, is an inbreeding allotetraploid (AABB) cereal belonging to the subfamily Chloridoideae with a haploid genome size of 1.7 Gb . Despite being an important food security crop for Eastern Africa and parts of southern India, it has persistently been neglected by the international research community. The wild progenitor of finger millet, E. coracana subsp. africana, originated by hybridization between E. indica (AA genome) and an unknown B-genome species. The timing of the allopolyploidization event is not known. To date, only a single linkage map has been generated of finger millet consisting of 332 loci, mostly detected by restriction fragment length polymorphism (RFLP) markers and single strand conformation polymorphic expressed sequenced tags [15, 16]. The map was generated in an F2 population derived from a cross between a wild accession, MD-20, and a cultivated accession, Okhale-1. Based on similarity between RFLP fragment sizes in the A-genomes of E. indica and E. coracana, linkage groups were allocated to subgenomes . The linkage map was also used to establish gross comparative relationships between the finger millet and rice genomes . There is a need for high-density SNP maps of finger millet to assist with trait analyses and planned genome sequencing efforts.
Even with less than 10,000 markers, construction of an accurate linkage map is extremely challenging, in particular when dealing with less than perfect data sets such as, for example, obtained in F3-reconstituted F2 populations. Traditional software like MAPMAKER [17, 18] and JoinMap [19, 20] use maximum likelihood based three-point and multi-point analyses which provide highly accurate marker ordering but are highly memory- and time-intensive for large data sets. To deal with large marker numbers, software packages such as MSTmap  and Lep-Map  based on the traveling salesman principle have been developed. Map generation is fast, but marker ordering is more sensitive to genotyping errors. To be able to take advantage of MAPMAKER’s high accuracy for ordering large marker sets, we modified the original MAPMAKER package to run efficiently in a Windows Command Prompt Environment and developed in-house python scripts to automate several steps of the MAPMAKER mapping process. The mapping pipeline was applied to generate a new high-density genetic map of finger millet comprised of several thousand high-quality SNP markers.
Our paper thus provides a modified GBS protocol, a new pipeline (UGbS-Flex) for analysis of paired-end GBS data suitable for application in species with different ploidy levels, breeding systems and polymorphism levels, irrespective of the availability of a reference genome sequence. We also provide a comprehensive solution for post-GBS data analysis, and a high-density genetic map of finger millet with new information on the organization of the allotetraploid finger millet genome.
Plant materials and DNA extraction
The F2 mapping population was generated from a cross between E. coracana subsp. africana accession MD-20 and E. coracana subsp. coracana accession Okhale-1 . One hundred and thirty-four F2:3 families with a minimum of 15 plants per F3 family plus the parents were grown in the greenhouse at UGA under day/night temperatures in the range 26–30 °C. KNE 796, an accession for which whole genome sequencing is ongoing, and E. indica accessions Ei-0, Ei-2 and Ei-5, collected in the wild in Kenya, were grown under the same conditions. All germplasm was obtained in compliance with national and international accords on export/import of seeds for research purposes. DNA was extracted using a modified CTAB method adapted from Doyle and Doyle  from leaves bulk-harvested for each F3 family from eight weeks old seedlings. DNA concentrations were measured on a Nanodrop (Thermo Scientific) and samples were diluted to 50 ng/μl. The quality of the DNA was assessed on a 0.8% agarose gel.
GBS sample preparation
Two-hundred nanogram of high-molecular-weight DNA was digested with a cocktail of either PstI/MspI, PstI/NdeI or PstI/MspI + ApeKI. Digestions were done in volumes of 30 μl with 4 U PstI-HF and 8 U MspI in NEB CutSmart buffer for PstI/MspI digestions, 4 U PstI, 8 U MspI and 4 U ApeKI in NEB-buffer 3.1 for PstI/MspI plus ApeKI digestions, and 4 U PstI-HF and 4 U NdeI in NEB CutSmart buffer for PstI/NdeI digestions. Samples were incubated for 2 h at 37 °C. This was followed by an additional 2-h incubation at 75 °C for reaction mixtures comprising the enzyme ApeKI. Samples without ApeKI were incubated at 75 °C for 20 min to inactivate the restriction enzymes.
Twenty microliter restriction digest was mixed with 1 μl barcoded PstI adapter (stock: 0.1 μM), 1.5 μl common MspI or NdeI Y-adapter (stock: 10 μM), 4 μl 10X T4-DNA ligase buffer and 200 U T4 DNA ligase (NEB) in a total volume of 40 μl. The common Y-adaptor and barcoded adapters were as described by Poland et al. . Ligation was conducted at 22 °C for 2 h. Following ligation, fragments smaller than 300 bp were removed by incubating the samples with 0.7 volumes of Sera-Mag SpeedBeads (GE Healthcare Life Sciences) prepared according to Rohland and Reich  at room temperature for 5 min. The beads were separated from the supernatant using a magnetic stand and washed three times with 200 μl freshly prepared 70% ethanol. DNA was eluted from the air-dried beads with 40 μl 10 mM Tris.HCl (pH 8.0).
Three microliter of the resulting eluate was added to a cocktail of 16 μl H2O, 5 μl 5X Taq master mix (NEB), 0.5 μl of a forward primer specific to the barcoded adaptor (stock: 10 μM) and 0.5 μl of a reverse primer with homology to the common adaptor (stock: 10 μM). PCR amplification was done for each sample separately using an initial denaturation at 95 °C for 30 s, 16 cycles of denaturation at 95 °C for 30 s, primer annealing at 62 °C for 20 s and fragment elongation at 68 °C for 15 s, followed by a final fragment elongation step at 68 °C for 5 min. Eight microliters of PCR product were checked on a 1.5% agarose gel. The DNA concentration of each GBS library was measured on a Qubit 2.0 using a Qubit™ dsDNA HS assay kit. Only GBS libraries with concentrations > 5.0 ng/μl were sequenced. Thirty nanograms of each GBS library were pooled. The number of samples pooled depended on the sequencing platform used; we aimed to obtain two million reads per sample. Primers, dNTP and small DNA fragments were removed from the pooled DNA with 0.7 volumes of AMPure Beads or Sera-Mag SpeedBeads. Pooled GBS libraries (100 ng) were sequenced on an Illumina NextSeq platform with paired-end 150 bp reads. The parents and 115 F2:3 samples were sequenced as part of the same sequencing run. An additional three and 26 F2:3 samples were sequenced as part of two separate NextSeq runs. Ten samples were sequenced in duplicate from independently generated libraries to ensure consistency across libraries and runs.
GBS analysis pipeline with optional de novo generation of a reference
The full UGbS-Flex pipeline is described below. All in-house perl and python scripts used in the UGbS-Flex pipeline with information on their use are provided under ‘Programs and Scripts’ on http://research.franklin.uga.edu/devoslab/. Detailed information on how to apply the UGbS-Flex pipeline is given in Additional file 1: Data S1 and Additional file 2: Figure S1.
Preprocessing of the reads
The read quality was checked with ‘FastQC’ v. 0.11.4 . Reads were split by barcode using the module ‘Process_Radtags’ within the ‘Stacks’ program  with option –r (rescue barcodes and RAD-tags). Forward reads passed the filter if they carried both the barcode and the PstI restriction site. Because the first few bases of the reverse reads were low quality in some Illumina NextSeq sequencing runs, no selection was carried out for the restriction site of the second enzyme (MspI or NdeI) in the reverse reads. ‘FASTX_trimmer’ from the ‘FASTX Toolkit’ package (http://hannonlab.cshl.edu/fastx_toolkit/) was used to remove (1) the restriction sites, (2) the last 5 bp (typically) of each read that were more likely to contain errors (or, for lower quality runs, all bases at the 3′ end of a read with FastQC quality scores lower than 20) and (3) an additional 0 (for a 10 bp barcode) to 5 bases (for a 5 bp barcode) at the 3′ end of the forward read to make all reads the same length. Identical read length is a prerequisite of the ‘Stacks’ program  used in the generation of a de novo reference from the GBS reads.
De novo generation of a GBS reference
Filtering of the GBS representative tags to generate a reference
Two filtering steps were conducted on the representative tags identified across samples by either ‘cstacks’ or ‘ASustacks’ (Fig. 1). Firstly, representative tags that were present in less than a user-defined percentage of samples (50 and 70% in this study) in the ‘cstacks’ output were removed using the in-house perl script ‘FCT.pl’. This filtering step has been integrated into the ‘ASustacks’ module. Secondly, we removed representative tags that had similarity levels to another representative tag equal to or higher than a user-defined percentage (98% in this study). To achieve this, a blastn all-against-all analysis of the consensus tags was conducted (e-value threshold: 10e− 5). For tags with ≥98% homology, only a single tag was retained using python script ‘Ref_98.py’. The representative tags remaining were used as the GBS reference.
SNP/indel calling and filtering
Preprocessed reads were aligned against the GBS reference using Bowtie 2 v. 2.2.0 with default parameters . If a whole genome reference sequence is available, de novo generation of a GBS reference can be omitted and trimmed reads are aligned against the reference genome. For SNP/indel calling, we tested both the ‘Unified Genotyper (GATK v. 3.4.0)’ (parameters –dcov 1000, −glm BOTH) and ‘Haplotype Caller (GATK v. 3.4.0)’ (parameters --genotyping_mode DISCOVERY –stand_emit_conf 10 –stand_call_conf 30 –minPruning 1 –emitRefConfidence GVCF) modules within the GATK suite . Indels were treated the same as SNPs and, for simplicity, we use the term ‘SNPs’ to cover both SNPs and indels. Raw SNPs were filtered within GATK to only retain biallelic SNPs with an allele frequency between 10 and 90%. We also removed adjacent SNPs (scripts ‘SNPs_ISL.pl’ and ‘Rm_adj_SNPs.pl’) because some of these were artefacts caused by misalignments at the junction of the forward and reverse reads. The allelic depth (AD) information provided in the GATK .vcf file was then used to score the allelic status of the SNPs at each locus (script ‘SNP_genotyper.py’). Loci with a total AD < 8 were scored as missing data points (−). Loci with an ADref(erence allele)/ADalt(ernate allele) ratio ≥ 10 were scored as A (homozygous for the Parent 1 allele), ADref/ADalt ≤ 0.10 as B, 10 > ADref/ADalt > 4 as D (ambiguous A or heterozygous (H)) and 0.25 > ADref/ADalt > 0.1 as C (ambiguous B or H). Loci with other ratios were scored as H. For F2 populations generated from two inbred parents, as was the case for finger millet, all SNPs that were not homozygous for different alleles in the parents were removed (script ‘SNP_selectByParent.py’). SNPs that were missing in more than 30% of the samples or had an A/B ratio < 10% or > 90% were also removed. This was done manually in Excel. Similarly, samples with more than 30% of missing data were removed.
Comparison of the UGbS-flex and GBS-SNP-CROP pipelines
To compare the performance of the UGbS-Flex and GBS-SNP-CROP  pipelines, the raw sequence data from 48 tetraploid kiwiberry genotypes used by Melo et al.  were downloaded from NCBI (SRR2296676). Guided by the fastQC report, we trimmed the forward and reverse reads to 121 bp. The parameters used for ‘ustacks’ were –m 2 –M 2 –N 4, and for ‘ASustacks’–m 1 –M 2 –N 4. We selected the same missing data threshold (25%) and sequencing depth for SNP scoring as Melo and colleagues . The H-threshold was set at 4.
Reducing the dataset by consolidating SNPs located in the same GBS tag
Schematic representation of the approach used to consolidate SNPs within the same GBS reference tag
Reducing the dataset by removing cosegregating markers
To generate a set of high-quality non-redundant markers for genetic map construction, each SNP marker was given a penalty score for the presence of a ‘C’ or ‘D’ (penalty = 1) or a missing data point (penalty = 2). Using the in-house python script ‘SNP_cosegregation.py’, the mapping scores of all SNP markers were compared in an all-against-all analysis with a greedy algorithm. SNP markers with the same multilocus genotypes were identified, and the marker with the smallest penalty score in each set was selected for mapping.
Construction of genetic maps
To construct the genetic maps, we removed all SNPs that had missing data in more than 20% of the progeny. The set of SNP markers was split into linkage groups using MSTmap . Initial map orders were established with MST map and checked for double recombination events using MAPMAKER (adapted from ; adapted version available from http://research.franklin.uga.edu/devoslab/). Markers with more than a defined number (four in this study) of double recombination events were removed. The process of MSTmap mapping and checking of double recombination events was repeated until the number of double recombination events between each marker and its flanking markers was ≤4. The corresponding MSTmap maps were used as starting points for map generation using MAPMAKER (adapted from ; adapted version available from http://research.franklin.uga.edu/devoslab/).
Because the MAPMAKER version used was limited to ordering ~ 100 markers due to inherent program limitations, MSTmap maps with more than 100 markers were split into smaller subgroups of 60 to 100 markers. Subgroups overlapped by 40 markers. Genetic maps were constructed for each of the subgroups using the ‘order’ and ‘try’ commands. Subgroup maps were merged based on common marker orders in the overlapping segments. Each linkage group was split again into subgroups of < 100 markers, and marker orders were further adjusted using the ‘ripple’ command. ‘Try’ and ‘ripple’ were done in a semi-automated manner using the scripts ‘try.py’ and ‘ripple.py’, respectively. Final marker orders were merged across subgroups. Genetic map distances (in Kosambi) were obtained using the ‘map’ command with ‘error detection on’ in MAPMAKER. Map orders were scrutinized manually and, if necessary, further adjusted. Markers with the same multi-locus genotype were added to the map as cosegregating with their representative marker. Finally, we placed markers with ambiguous orders (not separated by clear recombination events) in bins.
Identifying a and B genome linkage groups
Reads from three E. indica accessions (AA genome) were aligned against the GBS reference tags using Bowtie 2 v. 2.2.9 with default parameters . Presence (present in at least two of the three E. indica accessions analyzed) and absence (absent from all three E. indica accessions analyzed) of mapped GBS tags in the E. indica genome were charted along the length of the genetic map using Excel scatterplots. Linkage groups with E. indica tags along their entire length were allocated to the A genome and those that were largely devoid of E. indica tags were allocated to the B-genome.
Efficiency of different enzyme combinations in generating polymorphic markers
We tested two two-enzyme combinations (PstI/MspI and PstI/NdeI) and one three-enzyme combination (PstI/MspI + ApeKI) on three finger millet accessions for their efficiency in generating largely overlapping fragment pools that, when sequenced, yielded SNPs that were present in all three accessions at a depth of at least 8×. All samples were sequenced (paired-end 150 bp) on an Illumina NextSeq platform. The number of reads obtained for each of the nine sample/enzyme combinations (three accessions, three enzyme combinations) is given in Additional file 3: Table S1. To estimate the effect of read depth on the de novo generation of a GBS reference, we analyzed subsets of 0.2 million (M), 0.5 M, 1 M, 2 M and 3 M paired-end reads for each accession/enzyme combination with our newly developed UGbS-Flex pipeline (Fig. 1). The smaller read numbers were subsets of the larger read sets.
For the enzyme combinations PstI/MspI and PstI/MspI plus ApeKI, as expected, the number of GBS tags that were common to all three accessions tested increased with increasing total read numbers, reaching a plateau around 2 M reads (Additional file 3: Table S1 ‘By Enzyme Combination’). For the enzyme combination PstI/NdeI, however, the number of common GBS tags increased from 0.2 M to 1 M total reads, but then decreased when total read numbers were increased from 1 M to 3 M. To test whether this was an artefact generated by the ‘cstacks’  module, which clusters reads across samples, we developed the script ‘across-sample ustacks’ (‘ASustacks’). ‘ASustacks’ generated an artificial fastq file from each sample’s ‘ustacks’ output and these files were used as input for ‘ustacks’. The ‘ASustacks’ approach yielded similar numbers of reference tags as ‘cstacks’ except for read numbers ≥1 M in enzyme combination PstI/NdeI. We now saw the expected increase in common GBS tags with increasing read numbers across all enzyme combinations. More than 97% of GBS reference tags that were identified with ‘cstacks’ were also found in the reference generated using ‘ASustacks’. Interestingly, the read depth of the GBS reference tags that were identified by both ‘cstacks’ and ‘ASustacks’ was significantly lower than the read depth of the GBS reference tags that were uniquely identified by ‘ASustacks’ (Additional file 4: Table S2). This suggests that high read depth hampered the performance of ‘cstacks’, possibly because a higher read depth led to a higher absolute presence of SNPs caused by PCR or sequencing errors in allelic reads. The ‘cstacks’ module may have eliminated these clusters as likely consisting of repetitive DNA. We conducted a blastn analysis of all finger millet reference tags that were identified by ‘ASustacks’ in the subset of 3 M reads in the PstI/NdeI digested samples against the repeat-masked rice (Oryza sativa) genome. Blastn hits were identified (e-value threshold of e-5) for 30% of the tags that were common to ‘cstacks’ and ‘ASustacks’, and for 37% of the tags that were uniquely identified by ‘ASustacks’. This shows that the tags eliminated by ‘cstacks’ were not enriched for repeats. To further validate the ‘ASustacks’ approach, we compared read clusters generated by ‘cstacks’ and ‘ASustacks’ in GBS data from DNA of a set of 96 diverse foxtail lily lines belonging to different Eremurus species digested with PstI/MspI. Foxtail lily has an ~ 7.9 Gb (1C) genome (I. Leitch, pers. comm.). The number of tags that were common to at least 50% of the lines based on ‘cstacks’ and ‘ASustacks’ was 376 and 3552, respectively, with 98% of ‘cstacks’ clusters being present in the ‘ASustacks’ output. In addition to yielding higher numbers of reference tags, the ‘ASustacks’ approach was computationally much less intensive than ‘cstacks’.
Summary statistics obtained for each of the three enzyme combinations for a subset of 1 M reads
Total read number
Number of stacks (tags) within samplesa
Number of stacks (tags) common to all samplesb
Number of polymorphic tagsc
Number of polymorphic SNPsc
Average depth of scored SNPs
PstI/MspI + ApeKI
Generation of a GBS reference in the MD-20 x Okhale-1 mapping population
Number of SNPs identified with different combinations of SNP callers and GBS references
Unified Genotyper (GATK)
Haplotype Caller (GATK)
No. of GBS tags
No. of SNPs
Comparison of UGbS-flex and GBS-SNP-CROP pipelines
Using the same dataset and, to the extent possible, same thresholds as reported by Melo and colleagues , UGbS-Flex yielded, before filtering, a total of 86,810 SNPs compared to 56,598 reported by Melo and colleagues for GBS-SNP-CROP (Additional file 7: Table S4). After filtering, the total number of SNPs retained from the UGbS-Flex pipeline was 50,139 compared to 21,318 by GBS-SNP-CROP (Additional file 7: Table S4). The SNP number reported for the GBS-SNP-CROP pipeline is when a single accession with the most abundant read number was used in the generation of a reference. When all 48 accessions were used, the total number of filtered SNPs obtained by GBS-SNP-CROP was 14,712. We used all 48 accessions to generate a reference with UGbS-Flex.
Number of markers and map length of each of the nine A and nine B genome linkage groups
Number of markers
Map length (cM)
Number of markers
Map length (cM)
Allocating linkage groups to a and B subgenomes
Optimization of the GBS technology
We tested several modifications to the experimental GBS protocol developed by Elshire et al.  and Poland et al. . The aim was to reduce the fragment pool for sequencing and to provide more even read coverage across pooled samples to increase read depth at each locus and SNP representation across samples. A reduction in the number of fragments that will be sequenced can be attained by preventing the addition of Illumina sequencing adapters to a subset of the DNA fragments during the PCR step. This can be achieved using primers with one or more selective bases as applied in Tunable GBS (tGBS®, Data2Bio, ISU Research Park, Iowa) or by cutting PCR-amplifiable fragments with a restriction enzyme. We pioneered the latter approach in finger millet, an inbreeding tetraploid species. DNA of three accessions was either double digested with the enzyme combination PstI/MspI or PstI/NdeI, or triple digested by adding ApeKI to the PstI/MspI digest. PstI and NdeI are 6-bp cutters, MspI is a 4-bp cutter and ApeKI is a 5-bp cutter. PstI, NdeI and ApeKI are sensitive to CNG methylation. No adapters were ligated to sites generated by the third enzyme. While employing a third enzyme did reduce the fragment pool significantly, random ligation between fragments that originated from different parts of the genome during the adapter-ligation step resulted in chimeric fragments. These fragments were sequenced but did not align within or across accessions. Use of a three-enzyme mix therefore did not provide an advantage over a double digest. A third enzyme could likely be employed more effectively after ligation of the adapters, but this would require an additional step in the GBS protocol. The use of NdeI, a 6-bp cutter, in the double digest was advantageous for low read numbers (500,000 or less) because a smaller fragment pool allowed more fragments to be sequenced that were common across samples to a depth of at least 8×. For target read numbers in the range of 1 M to 2 M per sample and SNP scoring at a minimum read depth of 8×, we recommend the use of PstI/MspI. The PstI/MspI combination generated a larger fragment pool than PstI/NdeI but this fragment pool remained sufficiently small that many fragments that were common across samples were sequenced to the desired 8× depth. Obtaining more than 2 M reads was not cost-effective for fragment pools generated by PstI/MspI (Additional file 3: Table S1).
Generation of a GBS reference
We wanted a GBS protocol and analysis pipeline that would provide several thousand robust SNP markers in a variety of species, irrespective of the breeding system, ploidy level, polymorphism level and availability of a whole-genome sequence. To maximize the chances of finding SNP variation in species with unknown or low levels of variation, we opted to sequence the reduced representation libraries from both ends (2 × 150 bp on a NextSeq). Furthermore, the longer paired-end reads were more likely to identify orthologous sequences in related species, and availability of comparative information is key to the study of orphan crops with few genetic and genomic resources. While a few pipelines (e.g. TASSEL-GBS , STACKS ) were available when we initiated our analyses that could build a genomic reference from the GBS tags, none could deal with paired-end reads or had been developed for data from polyploid species. These shortcomings were recently addressed with the publication of the GBS-SNP-CROP pipeline . We present an alternative pipeline, UGbS-Flex, that can generate a GBS reference from paired-end data from inbreeding diploids as well as outcrossing allopolyploids. This intuitive and flexible pipeline consists of publicly available software packages and in-house perl and python scripts to cover data analysis from read processing to SNP scoring (Fig. 1). Each UGbS-Flex module can be run independently.
Much of the UGbS-Flex pipeline development was done with GBS sequencing data generated in an F2 population of finger millet. However, we have successfully used the UGbS-Flex pipeline for the analysis of GBS data generated in diversity panels in finger millet and foxtail lily, and in mapping populations in seashore paspalum, an outcrossing diploid, and switchgrass, an outcrossing tetraploid. To facilitate working with paired-end data, we merged overlapping sequences, and considered non-overlapping paired-end reads as contiguous sequences. We then grouped reads by similarity within samples using the ‘ustacks’ module from the STACKs package  and across samples using either ‘cstacks’ or the novel ‘ASustacks’ approach. The latter applies ‘ustacks’ to a file that comprises consensus sequences for per sample read clusters from all samples. Because the ‘ASustacks’ approach is more efficient in terms of computing resources and numbers of identified GBS reference tags, it has been integrated in the UGbS-Flex pipeline for analysis of reduced representation sequencing data generated using methylation-sensitive restriction enzymes and hence enriched for low copy sequences.
Optimal ‘ustacks’ and ‘cstacks/‘ASustacks’ parameters are dependent on the breeding system, ploidy, and polymorphism level of the species under investigation. For example, in an inbreeding species, most loci are homozygous, and stacks within an accession should be built with reads containing no or only a single mismatch. Use of very stringent conditions in the stack building within inbreeding individuals will promote separation of paralogs and, in polyploids, of homoeologous loci. In outcrossing polyploid species, however, stacks within accessions need to be generated that comprise homo-alleles but not homoeo-alleles. Finding the optimal parameters to achieve this can be challenging, in particular in highly polymorphic polyploids, and needs to be done empirically. Similar considerations need to be taken into account when generating stacks across accessions. For finger millet, we allowed a 1 bp mismatch both for clustering within samples and across samples. Polymorphism levels in finger millet are relatively low, and because we were working with an F2 population, allelic variation within and between samples was similar. The ‘cstacks’/‘ASustacks’ outputs were filtered to retain only those consensus sequences that were present in a specified percentage of the samples, and these sequences were used as a reference. This step limits downstream identification of SNPs with large amounts of missing data.
We tested both GATK’s Haplotype Caller and Unified Genotyper for calling SNPs. Unified Genotyper identified some 17% more SNPs than Haplotype Caller (Table 3). On average, 25% of the SNPs identified with Unified Genotyper were not identified by Haplotype Caller (Fig. 2, Additional file 6: Figure S2). Conversely, some 12% of SNPs identified with Haplotype Caller were not identified by Unified Genotyper. These comparisons were made on filtered data sets, and hence concerned robust SNPs. The percentage of SNPs uniquely identified by Unified Genotyper and Haplotype Caller that were incorporated in the genetic maps and hence were validated, was very similar (41.3% vs 40.1% for Ref50_98 and 47.5% vs 40.8% for Ref70_98). This indicates that, at least in our dataset, the high sensitivity/aggressiveness of Unified Genotyper in calling SNPs did not increase the false positive rate. For comparison, the percentage of SNPs identified by both SNP callers that were incorporated in the genetic maps was 56% for Ref50_98 and 55.4% for Ref70_50. Freebayes , a haplotype-based SNP caller, also yielded a lower number of SNPs than Unified Genotyper when tested on the parents of the mapping population. Although we did not validate the SNPs identified by Freebayes, the number of SNPs identified in common with Unified Genotyper and the percentage of common SNPs mapped suggest that the lower SNP detection rate in Freebayes would have resulted in a lower percentage of mappable polymorphisms.
The threshold number of samples in which each representative GBS tag had to be present in order to be included in the GBS reference also greatly affected the number of identified SNPs, although the trend was opposite to what we expected (Table 3). To our surprise, we identified fewer SNPs, after filtering, when a representative tag was selected to be present in at least 50% of the samples (Ref50) compared to at least 70% (Ref70), despite the fact that more reference tags were present in the Ref50 compared to the Ref70 reference. We hypothesized that, although the Ref70 reference carried fewer tags, more loci were represented by only a single allele. If alleles from the same locus (e.g. Parent 1 allele and Parent 2 allele) formed separate stacks during the clustering performed by ‘ustacks’, both would be included in the GBS reference. Reads from the same locus in heterozygous individuals would align to either the allele 1 GBS reference tag or the allele 2 GBS reference tag during the bowtie alignment, and be scored as two homozygous loci in GATK. To test this hypothesis, we removed tags that were potentially allelic to another tag. When a GBS tag had more than 98% homology to another tag following a blast-all-against-all analysis, only one of the tags was included in the GBS reference. Removing allelic tags from Ref50 (referred to as Ref50_98) resulted in the calling of 136% new SNPs, while less than 2% of SNPs were lost (Additional file 6: Figure S2). As expected, SNP gain was significantly smaller (15%) when allelic tags were removed from Ref70 (referred to as Ref70_98), but still greatly outpaced SNP loss (1%) (Additional file 6: Figure S2). SNP discovery now followed the expected trend with approximately 24% higher SNP numbers being identified with Ref50_98 compared to Ref70_98 (Fig. 2). Interestingly, only 75% of the SNPs identified with Ref70_98 were detected with Ref50_98. The percentage of SNPs uniquely found by only a single reference was 25 and 39% for Ref70_98 and Ref50_98, respectively. This suggest that a substantially improved reference could be obtained and SNP calling maximized by combining the tags in both GBS references.
The UGbS-Flex pipeline compares favorably to GBS-SNP-CROP , which can also use paired-end reads as input. Using the same filtering criteria, the number of SNPs identified by UGbS-Flex was more than 3-fold that identified by GBS-SNP-CROP when the reference in both pipelines was generated with all 48 accessions. Interestingly, more heterozygotes were identified by GBS-SNP-CROP than UGbS-Flex. While a larger percentage of the SNPs identified by GBS-SNP-CROP than UGbS-Flex (99.9% vs 73.7%) had read depths ≥20, it seems unlikely that the read depth explains the difference in the calling of homozygotes vs heterozygotes between the two pipelines because of the stringent criteria used in SNP calling. As per Melo and colleagues , to score a SNP locus as homozygote required a minimum read depth of 11 when the secondary allele count was zero, and a minimum read depth of 48 when the secondary allele count was one. Because kiwiberry is a tetraploid, it is likely that some of the heterozygote SNPs are variants between homoeo-alleles rather than homo-alleles. The parameters used to cluster reads may have separated homoeologs from homologs at a higher frequency in UGbS-Flex than in GBS-SNP-CROP.
Construction of a high-density genetic map
We had previously constructed a 332-loci genetic map in an F2 population generated from a cross between the wild E. coracana subsp. africana acc. MD-20 and the cultivated E. coracana subsp. coracana acc. Okhale-1 using restriction fragment length polymorphism (RFLP), simple sequence repeat (SSR) and expressed sequenced tag (EST) markers [15, 16]. The same population was used here to generate a high-density genetic map. The decision to use a three-enzyme combination (PstI/MspI plus ApeKI) for GBS of the mapping population predated data availability on the relative efficiency of different enzyme combinations in generating polymorphic markers. The choice of enzyme combination does not affect map quality, only marker number. Because DNA from individual F2 plants was no longer available, we used DNA extracted from bulked F2:3 families for mapping. The drawback of this was that the mapping data was not quite as clean as when genotyping actual F2 plants, especially in heterozygous regions that underwent recombination or displayed segregation distortion. Mapping programs such as MSTmap  and Lep-Map  are based on the traveling salesman principle (TSP) and can very quickly generate maps with large numbers of markers. However, because marker ordering relies on two-point linkage information, TSP mapping programs are more affected by missing data and provide less robust genetic maps than programs such as MAPMAKER  that use multipoint analyses  (Additional file 11: Figure S4). Furthermore, C (ambiguous B or H) and D (ambiguous A or H) values could be incorporated in MAPMAKER, but had to be converted to missing data points in MSTmap. We therefore used a hybrid approach in which we identified linkage groups and did the initial marker ordering with MSTmap. Using the MSTmap marker orders, we generated maps in MAPMAKER with the option ‘error detection on’ to identify markers with high levels of genotypic errors. These markers were removed from further analyses. We then used the MSTmap marker orders to select overlapping marker groups for fine-scale ordering using three/multipoint analyses in MAPMAKER. The version of MAPMAKER employed (available from http://research.franklin.uga.edu/devoslab/) had been modified to run efficiently in a Windows Command Prompt Environment and to handle larger numbers of markers than the original MAPMAKER version . Despite the modifications, marker ordering was still limited to groups of approximately 100 markers due to inherent software limitations. The ‘try’ and ‘ripple’ commands in MAPMAKER were semi-automated with in-house developed python scripts. Because the overlapping segments were selected based on the initial marker orders defined by MSTmap, incorrect placement of one or more markers in the MST map could affect the final MAPMAKER-generated map. Recombination events in the final maps were therefore scrutinized manually and, if necessary, potentially problematic map regions were reanalyzed. Two blocks of markers, one on linkage group 2A and one on 2B, were removed because they were flanked on either side by a large number of recombination events. Following the manual assignment of markers to recombination bins and addition of the cosegregating markers, we obtained a robust map consisting of 4453 SNP markers organized in 18 linkage groups (Figs. 3, 4 and 5; Additional file 9: Table S6).
The finger millet genetic map and its characteristics
We provide a detailed analysis of the effect of parameter and module changes in the GBS methodology, reference generation and SNP calling that were aimed at maximizing the discovery of high confidence SNPs with minimal missing data. The newly developed UGbS-Flex pipeline provides a useful addition to the currently available tools for GBS data analysis, in particular for genotyping of heterozygous and polyploid individuals from paired-end sequencing reads in the absence of a whole-genome reference. The UGbS-Flex pipeline was applied here to identify high confidence SNPs that were subsequently used to generate the first high-density genetic map of finger millet, Eleusine coracana. Map robustness was achieved by applying multipoint marker ordering. We demonstrated that cross-species application of GBS is feasible between closely related species and, when applied between an allopolyploid and one of the diploid genome donors, can be used to identify the subgenome origin of linkage groups and the occurrence of translocations between homoeologous chromosomes. Finger millet, an inbreeding allopolyploid species with few genomic resources was used here as a case study, but applicability of the UGbS-Flex pipeline to other species, including the outcrossing tetraploid switchgrass has been successfully achieved.
This research was funded in part through the BioInnovate Africa Program, the University of Georgia and DOE grant DE-SC0010743. DS was funded through a Fulbright-Nehru Senior Research Fellowship grant (Award 1865/FN-SR/2013) from the United States India Educational Foundation, New Delhi, India.
Availability of data and materials
All datasets used and scripts generated are included in this manuscript as supplementary data and/or are available through the Devos Lab website (http://research.franklin.uga.edu/devoslab/).
PQ developed most of the scripts and led the data analysis. DG conducted GBS on the mapping population and assisted with the data analysis. DC and XW contributed scripts. DS and SS participated in optimization of the GBS experimental protocol. MMD generated the mapping population and collected the E. indica germplasm. RLM generated the updated version of MAPMAKER. KMD conceived the research, participated in the analyses and led the manuscript preparation. All authors participated in preparation of the manuscript and approved the manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Luikart G, England PR, Tallmon D, Jordan S, Taberlet P. The power and promise of population genomics: from genotyping to genome typing. Nat Rev Genet. 2003;4(12):981–94.View ArticlePubMedGoogle Scholar
- Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One. 2011;6Google Scholar
- Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA, Johnson EA. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 2008;3Google Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407(6803):513–6.View ArticlePubMedGoogle Scholar
- Lu F, Lipka AE, Glaubitz J, Elshire R, Cherney JH, Casler MD, Buckler ES, Costich DE. Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based SNP discovery protocol. PLoS Genet. 2013;9:e1003215.View ArticlePubMedPubMed CentralGoogle Scholar
- Gardner KM, Brown P, Cooke TF, Cann S, Costa F, Bustamante C, Velasco R, Troggio M, Myles S. Fast and cost-effective genetic mapping in apple using next-generation sequencing. G3: Genes|Genomes|. Genetics. 2014;4(9):1681–7.Google Scholar
- Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet. 2011;12(7):499–510.View ArticlePubMedGoogle Scholar
- Poland JA, Brown PJ, Sorrells ME, Jannink J-L. Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS One. 2012;7:e32253.View ArticlePubMedPubMed CentralGoogle Scholar
- Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics. 2007;23(19):2633–5.View ArticlePubMedGoogle Scholar
- Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9(2):e90346.View ArticlePubMedPubMed CentralGoogle Scholar
- Torkamaneh D, Laroche J, Bastien M, Abed A, Belzile F. Fast-GBS: a new pipeline for the efficient and highly accurate calling of SNPs from genotyping-by-sequencing data. BMC Bioinformatics. 2017;18:5.View ArticlePubMedPubMed CentralGoogle Scholar
- Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH. Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes|Genomes|. Genetics. 2011;1(3):171–82.Google Scholar
- Melo ATO, Bartaula R, Hale I. GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data. BMC Bioinformatics. 2016;17(1):29.View ArticlePubMedPubMed CentralGoogle Scholar
- Mysore KS, Baird V. Nuclear DNA content in species of Eleusine (Gramineae): a critical re-evaluation using laser flow cytometry. Plant Syst Evol. 1997;207:1–11.View ArticleGoogle Scholar
- Dida MM, Srinivasachary RS, Bennetzen JL, Gale MD, Devos KM. The genetic map of finger millet, Eleusine coracana. Theor Appl Genet. 2007, 114:321–32.Google Scholar
- Srinivasachary DMM, Gale MD, Devos KM. Comparative analyses reveal high levels of conserved colinearity between the finger millet and rice genomes. Theor Appl Genet. 2007;115:489–99.View ArticlePubMedGoogle Scholar
- Lander ES, Green P, Abrahamson J, Barlow A, Daly MJ, Lincoln SE, Newburg L. MAPMAKER: an interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics. 1987;1:174–81.View ArticlePubMedGoogle Scholar
- Lincoln S, Daly M, Lander ES: Constructing genetic maps with MAPMAKER/EXP 3.0. Cambridge, Mass., USA.: Whitehead Institute for Biomed Res; 1993.Google Scholar
- Van Ooijen JW: Joinmap ® 4, Software for the calculation of genetic linkage maps in experimental populations. In. Wageningen, Netherlands: Kyazma B.V.; 2006.Google Scholar
- Stam P. Construction of integrated genetic linkage maps by means of a new computer package: JoinMap. Plant J. 1993;3:739–44.View ArticleGoogle Scholar
- Wu Y, Bhat PR, Close TJ, Lonardi S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet. 2008;4(10):e1000212.View ArticlePubMedPubMed CentralGoogle Scholar
- Rastas P, Paulin L, Hanski I, Lehtonen R, Auvinen P. Lep-MAP: fast and accurate linkage map construction for large SNP datasets. Bioinformatics. 2013;29(24):3128–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Doyle JJ, Doyle JL. A rapid DNA isolation procedure for small quantitites of fresh leaf tissue. Phytochemical Bulletin. 1987;19:11–5.Google Scholar
- Rohland N, Reich D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome Res. 2012;22(5):939–46.View ArticlePubMedPubMed CentralGoogle Scholar
- Andrews S: FastQC: a quallity control tool for high throughput sequence data. 2010. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
- Magoc T, Salzberg S. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27Google Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Meth. 2012;9:357–9.View ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303.View ArticlePubMedPubMed CentralGoogle Scholar
- Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. In: arXiv:12073907; 2012.Google Scholar
- Ahfock D, Wood I, Stephen S, Cavanagh CR, Huang BE. Characterizing uncertainty in high-density maps from multiparental populations. Genetics. 2014;198(1):117–28.View ArticlePubMedPubMed CentralGoogle Scholar
- Riley R, Chapman V. Genetic control of the cytologically diploid behaviour of hexaploid wheat. Nature. 1958;182:713–5.View ArticleGoogle Scholar
- Jenczewski E, Eber F, Grimaud A, Huet S, Lucas MO, Monod H, Chevre AM. PrBn, a major gene controlling homeologous pairing in oilseed rape (Brassica napus) haploids. Genetics. 2003;164(2):645–53.PubMedPubMed CentralGoogle Scholar
- Gaeta RT, Chris Pires J. Homoeologous recombination in allopolyploids: the polyploid ratchet. New Phytol. 2010;186(1):18–28.View ArticlePubMedGoogle Scholar
- Devos KM, Sorrells ME, Anderson JA, Miller TE, Reader SM, Lukaszewski AJ, Dubcovsky J, Sharp PJ, Faris J, Gale MD. Chromosome aberrations in wheat nullisomic-tetrasomic and ditelosomic lines. Cereal Res Commun. 1999;27:231–9.Google Scholar
- Chennaveeraiah MS, Hiremath SC. Genome analysis of Eleusine coracana Gaertn. Euphytica. 1974;23:489–95.View ArticleGoogle Scholar
- Hiremath SC, Salimath SS. The a genome donor of Eleusine coracana Gaertn (Gramineae). Theor Appl Genet. 1992;84:747–54.PubMedGoogle Scholar