Construction of a high-density genetic map by specific locus amplified fragment sequencing (SLAF-seq) and its application to Quantitative Trait Loci (QTL) analysis for boll weight in upland cotton (Gossypium hirsutum.)

Background Upland Cotton (Gossypium hirsutum) is one of the most important worldwide crops it provides natural high-quality fiber for the industrial production and everyday use. Next-generation sequencing is a powerful method to identify single nucleotide polymorphism markers on a large scale for the construction of a high-density genetic map for quantitative trait loci mapping. Results In this research, a recombinant inbred lines population developed from two upland cotton cultivars 0–153 and sGK9708 was used to construct a high-density genetic map through the specific locus amplified fragment sequencing method. The high-density genetic map harbored 5521 single nucleotide polymorphism markers which covered a total distance of 3259.37 cM with an average marker interval of 0.78 cM without gaps larger than 10 cM. In total 18 quantitative trait loci of boll weight were identified as stable quantitative trait loci and were detected in at least three out of 11 environments and explained 4.15–16.70 % of the observed phenotypic variation. In total, 344 candidate genes were identified within the confidence intervals of these stable quantitative trait loci based on the cotton genome sequence. These genes were categorized based on their function through gene ontology analysis, Kyoto Encyclopedia of Genes and Genomes analysis and eukaryotic orthologous groups analysis. Conclusions This research reported the first high-density genetic map for Upland Cotton (Gossypium hirsutum) with a recombinant inbred line population using single nucleotide polymorphism markers developed by specific locus amplified fragment sequencing. We also identified quantitative trait loci of boll weight across 11 environments and identified candidate genes within the quantitative trait loci confidence intervals. The results of this research would provide useful information for the next-step work including fine mapping, gene functional analysis, pyramiding breeding of functional genes as well as marker-assisted selection. Electronic supplementary material The online version of this article (doi:10.1186/s12870-016-0741-4) contains supplementary material, which is available to authorized users.


Background
Upland cotton (Gossypium hirsutum L., 2n = 52) is widely grown because it provides superior natural fiber for the textile industry and daily life [1][2][3]. Increased industrial demand for the fiber makes it a challenge for cotton breeders to increase their yield. Boll weight is one of the important yield components of cotton. But cotton breeders struggle to increase their yield without compromising other fiber traits [4]. Through molecular marker assisted selection (MAS) we can directly select the plants through their genotype. Based on the construction of genetic linkage maps, further studies from identifying the quantitative trait loci (QTLs) of the target traits to identifying the functioning genes, to pyramiding breeding, could be facilitated. Based on MAS, the breeding efficiency could be improved while the breeding cycle is shortened. For the MAS, the density and quality of the genetic map is very important since it forms the basis for the next set of research activities including the detection of reliable and concise QTL confidence intervals, further identification of the functional genes in these concise confidence intervals. Currently most of the genetic maps are based on the simple sequence repeat (SSR) markers with low resolutions. The low polymorphic rate of SSR markers makes it difficult to construct a saturated SSR-based genetic map that covers the whole genome. With the development of the molecular markers, the single nucleotide polymorphism (SNP) markers became widely applied to genetic map construction and MAS due to its large number with a high density across the whole genome. Thus, it is a powerful tool to construct a high-density genetic map (HDGM) and to identify QTLs [5,6].
The next-generation sequencing (NGS) technique can be used to detect large quantities of SNP markers in the whole genome [7]. There are several methods of NGS including restriction site-associated DNA sequencing (RAD-Seq) [8], Genotyping-by-sequencing (GBS-Seq) [9] and specific locus amplified fragment sequencing (SLAF-seq) [10]. The common feature of these methods is that one or more kinds of restricted DNA-endonuclease(s) were applied to the genome DNA based on the characteristics of the genomes of different species to build a reduced representation library (RRL) of genomic DNA without knowing the detailed information of the whole genome. Thus, each of these methods of NGS was used to construct the HDGM of several species [7,11,12]. Zhang et al. [13] constructed an HDGM of Prunus mume using SLAF-Seq. The map linked 8007 makers and spanned 1550.62 cM in length with an average marker distance of 0.195 cM. Xu et al. [14] also construct an HDGM of Cucumis sativus using SLAF-Seq. The map included 1892 markers with a total distance of 845.7 cM and an average distance of 0.45 cM between adjacent markers. Li et al. [15] construct an HDGM of Glycine max with 5785 markers, with a total distance of 2255 cM and an average marker distance of 0.43 cM. Wang et al. [4] constructed an HDGM of cotton using the RAD-Seq method and the map linkage 3984 markers with a total distance of 3499.69 cM.
In this study, a recombinant inbred line (RIL) population, containing 196 individuals was developed from an intra-specific cross between two upland cotton 0-153 and sGK9708. We attempted to use this population to construct an intra-specific HDGM of upland cotton, to identify QTLs and possibly, the candidate genes correlated to cotton boll weight. Finally, a total 5521 SNP markers were successfully applied to genotype these 196 RILs along with parents and an intra-specific HDGM was thus constructed. This map was used to identify QTLs for cotton boll weight across 11 environments.

Plant materials
The intra-specific F 6:8 recombinant inbred lines (RIL) population of upland cotton with 196 individuals was developed from a cross between homozygous cultivars 0-153 and sGK9708. Cultivar 0-153 harbored superior fiber quality traits while sGK9708 was derived from CRI41 which maintained high yield potential and wide adaptability. The details of the development of RILs have been already described by Sun et al. [16]. Additionally, the phenotypic evaluations of the RILs from 2007 to 2013 were detailed by Zhang et al. [17].

Phenotypic data analysis
Thirty normally opened bolls within five to eight fruiting branches and one to three fruiting nodes were sampled in annually September. The total seed-cotton of the 30 bolls was weighted and average boll weight was calculated accordingly. One-way ANOVA was used to test the significance of the differences in boll weight between two parents. Additionally, EXCEL 2010 was used to create the descriptive statistics including the mean value, standard deviation, skewness and kurtosis of the boll weight across the whole population.

DNA extractions and SLAF library construction and high-throughput sequencing
The leaves of the parents and the RIL population were sampled in July and stored at −70°C. The genomic DNA was extracted using the TaKaRa MiniBEST Plant Genomic DNA Extraction kit (TaKaRa, Dalian) and SLAF-seq strategy with some modifications was utilized in the library construction. Briefly, the reference genome of Gossypium hirsutum [18,19] was referred to make the pre-experiment in silico simulation of the number of markers generated by various endonuclease combinations. The SLAF library was constructed based on the SLAF pilot experiment in accordance with the predesigned scheme and eventually two endonucleases combination of HaeIII and SspI (New England Biolabs, NEB, USA) was applied to the genomic DNA digestion in our RIL population. The details of SLAF-seq strategy was described by Zhang et al. [13].
Grouping and genotyping of sequencing data SLAF markers were identified and genotyped with procedures described by Sun et al. [10] and Zhang et al. [13]. Briefly, after filtering out the low-quality reads (quality score < 20e), the remaining reads were sorted to each progeny according to duplex barcode sequences. Then each of the high-quality read was trimmed off 5-bp terminal position. Finally 80 bp pair-end clean reads were obtained from the same sample and were mapped onto the genome of Gossypium hirsutum [19] sequence using BWA software [20]. Sequences mapping to the same position with over 95 % identity were defined as one SLAF locus [13]. SNP loci in each SLAF locus were then detected between parents using the software GATK. SLAFs with more than three SNPs were filtered out first. As the sequenced size of the fragments was only 160 bp, three or more SNPs in one SLAF indicated a significantly high heterozygosity of upland cotton (more than 1 %). This would lead to a decreased accuracy and reliability of the sequencing and genotyping. The SLAFs were genotyped depending on the tags of the parents sequenced above tenfold depth and the individuals of the RIL population were genotyped based on the similarity to the parents.
As each SLAF locus harbored at most three SNP loci, it was possible that one SLAF locus could harbor at most, four SLAF alleles. The SLAF repetitiveness and polymorphism were defined based on the criteria described by Zhang et al. [13]. The repetitive SLAFs were discarded and only the polymorphic SLAFs were considered as potential markers. Only the SLAFs with consistency in the parental and RIL were genotyped.
The procedure of all polymorphic SLAF loci genotyping was described by Sun et al. [10] and Zhang et al. [13]. Before genetic map construction, all the SLAF markers were filtered using a criteria detailed by Zhang et al. [13] besides the markers with more than 40 % missing data were filtered out.

Linkage map construction
Linkage map was constructed based on the procedure detailed by Zhang et al. [13] and the cotton genome database [19]. HighMap strategy for ordering the SLAF and correcting genotyping errors within the chromosomes was detailed by Liu et al., Jansen et al. and van Ooijen et al. [21][22][23]. SMOOTH was also applied to the error correction strategy according to parental contribution to the genotypes of the progeny [24], and a k-nearest neighbor algorithm was used to impute the missing genotypes [25]. A multipoint method of maximum likelihood was applied to add the skewed markers into the linkage map. The Kosambi mapping function was applied to estimate the map distances [26].

Segregation distortion analysis
As the distortedly segregated markers showing significance between 0.001 and 0.05 (0.001 < p < 0.05) were still maintained to construct the HDGM, the region on the map with more than three consecutive adjacent loci that showed significant (0.001 < P < 0.05) segregation distortion was defined as a segregation distortion region (SDR) [11]. The size and distribution of SDRs on the map were analyzed.

Collinearity and recombination hotspot analysis
All the sequences of SNP markers that were constructed in the linkage map were aligned back to the physical sequence of the upland cotton genome through local Basic Local Alignment Search Tool (BLAST) to confirm their physical positions in the genome. Software CIRCOS 0.66 was used to compare the collinearity of markers based on their genetic positions and physical positions. The recombination hotspot (RH) was estimated based on the recombination rate of markers. If the value that the genetic distance between adjacent markers was divided by was higher than 20 cM/Megabase, the region between the two adjacent markers was regarded as RH [13].

QTL analysis using HDGM
Windows QTL Cartographer 2.5 [27] was used to identify QTLs by composite interval mapping method [28] on the environment by environment basis of the 11 environments. The LOD threshold for declaring significant QTLs included the QTLs across environments calculated by a permutation test with the mapping step of 1.0 cM, five control markers, and a significance level of P < 0.05, n = 1000. LOD score values between 2.0 and permutation test LOD threshold were used to declare suggestive QTL. Positive additive effect means that the favorable alleles come from the 0-153 parent while negative additive effect means that the favorable alleles come from sGk9708. QTLs were named and the common QTLs were identified as described by Sun et al. [16].

The candidate genes identification
The markers flanking the confidence intervals of the QTLs which can be detected in at least three environments were selected to identify the candidate genes. The sequences of these markers were aligned back to the physical sequence of upland cotton genome database [19]. Based on the position of these flanking markers, all the genes within the confidence interval were identified as candidate genes. For some of the QTLs with a large confidence interval, if the position of one marker flanking the confidence interval was too far from that of the nearest marker harbored in that confidence interval, the region between these two markers was excluded from the candidate gene identification. All the candidate genes were categorized through the gene ontology (GO) analysis. The first ten terms that have the smallest Kolmogorov-Smirnov (KS) values were considered as the enriched terms. The pathways correlated to the candidate genes were discovered by the Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. The first ten pathways with the smallest p values were considered as the enriched pathways. The candidate genes were also categorized based on their products through eukaryotic orthologous groups (KOG) database analysis.

Performance of boll weight of RIL populations
The one-way ANOVA result showed the p-value was 0.002, suggesting that significant differences of boll weight were found between the two parents. The descriptive statistical analysis results of the RIL population and parents across 11 environments were shown in Table 1. The absolute value of skewness of the mean value of the boll weight in the RIL population across 11 environments was less than one, indicating an approximately normal distribution. In all 11 environments, both the positive transgressive segregation (the observed values are higher than that of sGK9708) and the negative transgressive segregation (the observed values are lower than that of 0-153) of the boll weight in the RIL population were observed ( Table 1).

Analysis of SLAF-seq data and SLAF markers
After SLAF library construction and sequencing, 87.89 GB of data containing 443.56 M pair-end reads was generated with each read of 80 bp in length. Among them, 82.24 % of the bases were of high quality with Q20 (means a quality score of 20, indicating a 1 % chance of an error, and thus 99 % confidence) and guanine-cytosine (GC) content was 34.47 %. The SLAFs numbers of 0-153 and sGK9708 were 53,123 and 53,238, and their correspondent sequencing depths were 78.66 and 102.13 respectively. The coverage of both parents was 35 %. In the RIL population, the number of SLAFs ranged from 32,261 to 53,104 and the average number of SLAFs was 50,487. The average sequencing depth was 14.50, and the average coverage was 33.37 % (Fig. 1). The 443.56 M pair-end reads, consisting of 53,754 SLAFs, totally harbored 160,876 SNP markers, as usually one SLAF can harbor more than one and at most three SNP markers. Among the 160,876 SNP markers, 23,519 markers were identified polymorphic across the whole RIL population with a polymorphic rate of 14.62 %. All the polymorphic SNP markers were classified into four genotypes: aa × bb, hk × hk, lm × ll and nn × np. The aa × bb meant that both of the parents were homozygous in this SNP position, the genotype of one parent was aa and the other was bb; the hk × hk meant that both of the parents were heterozygosis, and the lm × ll and nn × np meant that one of the parent was heterozygosis and the other was homozygous. Only the genotype aa × bb, consisting of 18,318 SNPs, was used for further analysis. Among 18,318 markers, the marker with average sequence depths less than four were filtered with 16,490 markers left. Then the markers with polymorphism across the whole population but not between parents were excluded leaving 15,076 markers remaining. The 15,076 markers were further filtered by a criterion of more than 40 % missing data and 10,588 markers left. Finally, Markers with significant segregation distortion (P < 0.001) were filtered and the remaining 5521 markers, including the ones that showed significant segregation distortion between 0.05 and 0.001 (0.001 < P < 0.05) were used to construct the final genetic map ( Table 2).

Distribution of SNP markers' type on the genetic map
In total, 5521 SNP loci were mapped on the final linkage map and percentages of SNP types were investigated (Additional file 1: Table S1) Table S1).

Construction of the genetic map
The map harbored 5521 SNP markers, spanning a total distance of 3259.37 cM with an average marker interval of 0.78 cM. The A sub-genome harbored 3550 markers with a total distance of 1838.37 cM whereas the D subgenome harbored 1971 markers with a total distance of 1421 cM. The largest chromosome was chromosome 05, which contained 434 markers with a genetic length of 242.56 cM, and an average marker interval of 0.56 cM. The shortest chromosome was chromosome 15, which only harbored 29 markers with a genetic length of 41.39 cM and an average marker interval of 1.43 cM. The largest gap on this map was only 7.02 cM located on chromosome 26. There were totally 11 gaps greater than 5.00 cM, three of which were on chromosome 10 and with remaining eight on eight different chromosomes. The remaining chromosomes had no visible gaps (Additional file 2: Table S2, Fig. 2, Table 3).

The quality analysis of the high-density genetic map
In total, 1225 markers of the mapped 5521 showed significant (0.05 < P < 0.001) segregation distortion. These segregation distortion markers (SDMs) were located in the chromosomes with an uneven distribution in each. Among the 1225 SDMs, 579 of them were located in the   Table S3, Table 3). Collinearity analysis of the SNP loci between the genetic map and the physical map is shown in Fig 2. The results indicated that the genetic map constructed by the SNP markers which were discovered through SLAFseq had a sufficient coverage over the cotton genome. Most of the SNP loci on the linkage map were in same order as those on the corresponding chromosomes of the physical map of the cotton genome. D subgenome showed a better compatibility with the physical map as compared to the A subgenome. Chromosomes 1, 2, 3, 5, 7, and11 in the A subgenome and chromosomes 14, 15, 16 and 18 in the D subgenome showed some deviation in collinearity analysis (Additional file 4: Table S4, Fig. 3).
The result of the RH analysis showed that among the 26 chromosomes, 21 have RHs, 9 and 12 of which were in the A subgenome and D subgenome respectively. Chromosome 13 harbored the largest number of 106 RHs whereas the chromosomes 7, 15 and 18 only harbored one RH. Chromosomes 3, 5, 8, 11 and 16 did not harbor any RH. Additional information is shown in Additional file 5: Table S5, Fig. 4, and Table 3.

The candidate genes annotation
In total, 344 candidate genes were identified in the confidence intervals of stable QTLs. Except for the confidence interval of qBW-chr02-3 which has no candidate gene, the confidence intervals of all the remaining QTLs have candidate genes. The confidence intervals of qBW-chr07-4 and qBW-chr25-6 harbored only one candidate gene whereas the confidence interval of qBW-chr23-5 harbored 65 genes (Additional file 7: Figure S1, Additional file 8: Figure S2). In total, 340 of the 344 candidate genes had annotation information, among which 201, 81 and 163 had annotation information in GO, KEGG and KOG respectively. In GO analysis, 435 genes were identified in the cellular component category, 221 genes in the molecular function category, and 549 genes in the biological process category, as some of the genes had multiple functions and could be categorized into two or more function baskets. In the cellular component category, 102 genes were related to cell and 101 genes were related to cell part. In the molecular function category, 108 genes were related to catalytic activity. In the biological process category, 133 genes were related to metabolic process and 108 genes were related to cellular process (Additional file 9: Table  S7, Fig. 6). In the KEGG analysis, 81 genes were identified in 55 pathways. Six genes were found in the plant hormone signal transduction pathway, four genes were found in both the ribosome and protein processing pathways in endoplasmic reticulum In all the remaining pathways, there were no more than three genes found (Additional file 10: Table S8, Additional file 11: Table S9). In the KOG analysis, 24 genes only had the general prediction function and 12 genes had unknown function. Among the other 127 genes, 25 of them were related to posttranslational modification, protein turnover, and chaperones, 17 of them had a relation to signal transduction mechanisms, 12 of them had a relation to translation, ribosomal structure and biogenesis, 11 of them had a relation to carbohydrate transport and metabolism and 11 of them had a relation to transcription. No more than 10 genes were found in other functions in KOG classification (Fig. 6, Additional file 12: Table  S10, Additional file 13: Table S11, Table 5). Among all 344 candidate genes, 44 were identified at the nearest positions of the markers, of which the genetic position had the highest LOD values in the QTL mapping analysis (Additional file 7: Figure S1, Additional file 8: Figure S2). Among them, 43 candidate genes had annotation information except the gene Gh_D06G0216.
In the KEGG analysis, eight cand genes had annotation information, five of which were related to hypothetical protein, with the other three related s-adenosylmethionine synthetase, polygalacturonase precursor and indole-3acetic acid-amido synthetase GH3.3 respectively. In KOG analysis, 18 candidate genes had annotation information. Two had unknown function, three were correlated to signal transduction mechanisms, two were correlated to translation, ribosomal structure and biogenesis, two were correlated to posttranslational modification, protein turnover, and chaperones, two were correlated to inorganic ion transport and metabolism, two were correlated to secondary metabolites biosynthesis, transport and catabolism and two were correlated to carbohydrate transport and metabolism. There was an additional gene correlated to lipid transport and metabolism, one correlated to the cytoskeleton, one correlated to coenzyme transport and metabolism, one correlated to energy production and conversion, one correlated to RNA processing and modification and one correlated to cell cycle control, cell division, and chromosome partitioning. In the GO analysis, 26 of the 43 had annotation information, among which, 21 were correlated to biological process, 21 were correlated to molecular function and 15 were correlated to cellular component.

The characteristics of the method SLAF-seq
For the simplified genome sequencing, the key step was to make the simplified genome representative of the whole genome. This was completed through the election of suitable restriction endonuclease(s). When restriction endonuclease(s) were applied to the genome digestion and selected properly, the fragments generated by nextstep sequencing would be a better representation of the genome. In the previous studies, usually a few common restriction endonucleases such as EcoRI, SbfI and PstI were used to digest the genome of various species [29]. Typically, only one restriction endonuclease was applied to the genome digestion [30][31][32]. The genome specificity of the species was ignored [29][30][31][32][33]. This might lead to uneven distribution of the selected fragments in the whole genome and thus make the simplified genome less representative. Eventually the number of markers developed and reliability of the genetic map might both be negatively affected [29,33]. The SLAF-seq strategy, an effective NGS-based method for large-scale SNP discovery and genotyping, has been applied successfully in various species [12][13][14]. Compared with other tools for large-scale genotyping with NGS technology, such as RAD-seq and GBS, SLAF-seq displayed some unique superiorities. First, the pre-design scheme with different restriction endonuclease combinations was applied to simulate in silico the result script of endonuclease digestions based on the sequencing database of A, D and AD genomes of Gossypium [19,34,35] (Fig. 7). The information on genomic GC content, repeat conditions and genetic characteristics were referred to make up the digestion strategy. After two endonucleases combinations were applied to the genome digestion, the fragments ranging from 500 to 550 (including adapter) base pairs we harvested for sequencing create a better representation of the genome of Gossypium hirsutum L. Second, a dual-chr01 chr02 chr03 chr04 chr05 chr06 chr07 chr08 chr09 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chr23 chr24 chr25 chr26 Fig. 4 The genetic position of the recombination hotspots in the whole 26 chromosomes index will provide a higher sequence quality and more stable sequence depth among each sample, which is the key to developing high quality marker. Third, the marker underwent a series of dynamic processes to discard the suspicious markers during each cycle, until the average genotype quality score of all SLAF markers reached the cut-off value. As a result, the markers we developed might have a consistent distribution throughout the genome and    the thus-built map might have a better coverage of the genome and be more reliable for the next step research activities.

Genetic map construction
In previous studies, most of the genetic maps of cotton were based SSR markers. The low polymorphic rate of the SSR markers makes the SSR marker based maps unable to harbor a sufficient number of markers with a comparative poor coverage of the genome and low resolution. In most cases, these maps have large gaps, and sometimes the gap divides the chromosome into two or more linkage groups [16,36,37]. When the populations developed from interspecific crosses between G. hirsutum and G. barbadense were applied to the genetic map construction, the coverage and resolution of the map could be greatly improved [38][39][40]. However, the pragmatic applications of the genetic map developed from the interspecific populations have limited values as the polymorphic loci between G. hirsutum and G. barbadense may not show polymorphism within the cultivars of G. hirsutum. SNP markers could improve the coverage and resolution of the genetic map efficiently. Wang et al. [4] used SNP markers to construct a map through the RAD-seq, which harbored 3984 markers with a total distance of 3499.69 cM and an average distance of 0.88 cM. In our research, we constructed an HDGM through the SNP markers developed through the SLAF-seq method. Even though the map harbored a great number of markers and was more saturated than most of the previous ones, the total distance it covered was approximately the same as In the initial steps of marker development through SLAF-seq, the quantities of SLAFs developed were about the same sizes in the different chromosomes. After several steps of screenings, the remaining numbers of SNPs for map construction varied greatly among the chromosomes, and the reduced number of remaining SNPs contributed to the shortness of some chromosomes. The collinearity comparison between the genetic map and the physical one validates the reliability of the constructed map. However, a better understanding of the genetic structure of these chromosomes might need an integrative analysis.

The QTL of boll weight traits identification
Previous QTL studies were primarily focused on the fiber quality traits [1,2,40], while the research activities on yield traits especially the boll weight were seldom reported. The boll weight trait was significant and made a considerable contribution to the yield of cotton. Qin et al. [41] used the four-way cross (4WC) population to construct a map and identified only one QTL of boll weight on chromosome D2. The confidence interval of this QTL harbored three markers and spanned a distance of about 14.5 cM. Liu et al. [42] used RIL population to construct a map and identified the QTL of boll weight  -SNP161-CRI-SNP168  CRI-SNP161-CRI-SNP166  21363529-22191102  5  8   qBW-chr02-3  CRI-SNP511-CRI-SNP512  CRI-SNP511-CRI-SNP512  2428231- using the mean value of the data from four environments. Eighteen QTLs for boll weight were detected on 15 chromosomes. The confidence intervals of these QTLs harbored two or three markers. Yu et al. [43] used an interspecific backcross inbred line (BIL) population developed with a G. hirsutum and a G. barbadense to construct a genetic map and identified 10 QTLs on eight chromosomes (chromosomes 5, 11, 18, 21, 22, 24, 25, and 26). The confidence intervals of these QTLs also harbored two or three markers and spanned distances from 2 to   Fig. 6 The annotation of the candidate genes in the confidence intervals of the stable QTLs. a The annotation of the candidate genes in the confidence intervals of the QTLs that could be detected in at least three environments through GO analysis. b The annotation of the candidate genes in the confidence intervals of the QTLs that could be detected in at least three environments through KOG analysis 30 cM. In our study, we identified the QTL of the boll weight in 25 chromosomes except chromosome 8. Among them 16 QTLs were detected in at least three environments and were present on 11 chromosomes (chromosomes 1, 2, 3, 5, 7, 9, 13, 16, 22, 23, and 25 respectively). The confidence intervals of these QTLs harbored from two to 26 markers ranging from 0.7 to 13.9 cM. This implies that our results of QTL identification are more concise and accurate than previous studies and could be useful for future research looking at gene identification or cloning from these QTLs, or even breeding practices using MAS.

The direction of the QTLs
Among the 16 stable QTLs that can be detected in at least three environments, eight had positive additive effects whereas the other eight had negative additive effects. This indicates that both the higher boll weight value parent sGK9708 and lower boll weight value parent 0-153 could contribute positive additive QTLs to increase the boll weight. This could be a possible factor behind the difference in the boll weight trait between the parents 0-153 and sGK9708. Theoretically, the greater the difference of one trait between the two parents, the higher the possibility that the positive additive effect of the QTLs would come from one parent. The RIL population was constructed primarily based on differences in fiber quality traits especially fiber strength between the parents 0-153 and sGK9708, therefore, the difference of fiber strength is larger than that of any other traits between 0-153 and sGK9708. In Sun's report [16], seven QTLs of fiber strength were identified using this population, among which only one QTL had negative additive effects whereas the remaining six QTLs had positive additive effects. In Zhang's report [17], seven QTLs of fiber strength on chromosome 25 were identified using the same population, all of which had a positive additive effect. In identifying the QTL clusters, the clusters that harbor all desired QTL alleles would make the greater contribution to the breeding practice when MAS is applied.

Candidate gene functioning analysis
Among all 340 candidate genes being annotated in at least one channel of KOG, KEGG, and GO, some might be related to the boll weight trait. In KOG analysis, there were 21 function baskets. The posttranslational modification function, protein turnover, chaperones and signal transduction mechanisms harbored the largest number of candidate genes. Among the 44 genes located closest to the markers of genetic position, three genes Gh_A07G1188, Gh_A07G1197and Gh_D09G1606 had a relation to signal transduction mechanisms. Two genes, Gh_A05G1210 and Gh_D04G1531 were related the function posttranslational modification, protein turnover, and chaperones. Two genes, Gh_A07G1187 and Gh_A13G0858, had the translation function, ribosomal structure, and biogenesis, though this function basket did not harbor a large number of candidate genes. As the posttranslational modification, protein turnover and ribosomal structure were relative to the protein synthesis, it is probable that the genes correlated to this function contribute to the boll weight trait. In KEEG analysis, the first three pathways which harbored the largest number of genes were plant hormone signal transduction, and protein processing in endoplasmic reticulum and ribosome, harboring six genes, four genes and four genes respectively. Of these 14 genes, three were located at the nearest positions of the markers, genetic position of which had the highest LOD values in the QTL mapping analysis. The gene Gh_A13G0858 has a relationship to the ribosome, whereas genes Gh_A13G0392 and Gh_D06G0187 have a relationship to the plant hormone signal transduction. As the ribosome has a relationship to protein synthesis and some plant hormones such as auxin and gibberellin, these genes could contribute to the plant growth and eventually to the boll weight trait, particularly the gene Gh_A13G0858.
Although these genes were located the nearest position of the markers, genetic position of which had the highest LOD values in the QTL mapping analysis, but there still lacks direct evidence to prove that the function of these genes was correlated to the boll weight trait.

Conclusions
This research reported the first HDGM of Upland Cotton (Gossypium hirsutum) with a RIL population using SNP markers developed by SLAF-seq. The HDGM had a total number of 5521 markers and a total distance of 3259.37 cM with an average marker interval of 0.78 cM. There were no gaps greater than 10 cM.We also identified QTLs of boll weight trait across 11 environments and identified candidate genes. Totally, 146 QTLs of boll weight was identified and 16 of them were detected in at least three environments with a stable QTL. Three hundred forty-four candidate genes were identified in the confidence intervals of stable QTLs and 44 of them were located in the nearest positions of the markers. The result of this research would provide information for the next phase of research such as fine mapping, gene functional analysis, pyramiding breeding and marker-assisted selection (MAS) as well.