A Transcriptomic Variation Map Provides Insights into The Genetic Basis of Pinus massoniana Evolution and Association of Oleoresin Yield

Background: Masson pine (Pinus massoniana Lamb.), the dominant native coniferous species in South China, is commercially important for timber and oleoresin. However, knowledge of the genetic variability of the masson pine germplasm is still scarce. Here, the genetic diversity and population structure of masson pine germplasm were assessed using 204 wild accessions from 10 main distribution regions by 94,194 core SNPs obtained from transcriptome sequencing data. Results: The average expected heterozygosity was 0.2724, implying abundant genetic diversity within masson pine germplasm. Analysis of molecular variance (AMOVA) revealed that 3.29% variation sourced from the genetic differentiation. Structure analysis identified two geographically distinct groups. Discriminant analysis of principal components (DAPC) showed one geographically distinct group was divided two clusters furtherly. Sichuan and Chongqing provenance is the geographical origin, and diffuses outward along two lines. Oleoresin yield might be reflected by the evolution, and exhibits two different changing trends between two diffusing line. It may be associated with the genes of chitinase, CYP720B, Cytochrome P450, ABC transporter and AP2/ERF based on SNPs and expression. Conclusions: SNP markers by transcriptome sequencing have a strong power in evaluating genetic diversity within species and the genetic control of objective trait. The function of these genes will be verified and the strong associated genes with oleoresin yield will be used in the improvement of oleoresin yield by early genotype selection or genetic engineering.

group was divided two clusters furtherly. Sichuan and Chongqing provenance is the geographical origin, and diffuses outward along two lines. Oleoresin yield might be reflected by the evolution, and exhibits two different changing trends between two diffusing line. It may be associated with the genes of chitinase, CYP720B, Cytochrome P450, ABC transporter and AP2/ERF based on SNPs and expression.
Conclusions: SNP markers by transcriptome sequencing have a strong power in evaluating genetic diversity within species and the genetic control of objective trait. The function of these genes will be verified and the strong associated genes with oleoresin yield will be used in the improvement of oleoresin yield by early genotype selection or genetic engineering.

Background
As a dominant native tree species, masson pine (Pinus massoniana Lamb.) is commercially important conifer for timber and oleoresinin China. The natural distribution extends from 21°41′N to 33°56′N and 102°10′E to 123°14′E with the planting area of 2 million hectares [1]. The provinces of Guangdong, Guangxi, Hu'nan, Sichuan, Chongqing, Guizhou, Zhejiang, Fujian and Jiangxi are the main natural distribution regions of masson pine in China [2]. Because masson pine has the characteristics of fast growth and tolerance tobarren soil, it is often considered as a pioneer species for afforestation in bare mountain fields. Genetic diversity is critical for long-term survival species, which made the 3 species adapt to various abiotic and biotic stress and avoid extinction [3]. Large genetic variation can be observed among and within the natural populations based on provenance or family analysis for most trees species in growth, terpenoid, resistance, etc. [4][5][6]. To learn the genetic variation on main economic traits of masson pine, large-scale provenance experimentations have been carried in China since 1978. Two whole native range provenance trails and many partial native range provenance trails were built in China [2], which provided good materials to reveal the interplay and significance of several evolutionary forces causing phenotype diversity, and formulate gene conservation strategy capturing the natural genetic diversity within spieces. A classical geographical variation patternin latitude was found for diameter at breast height (DBH) in masson pine [7].
As a secondary substances of masson pine, oleoresin is an important natural source in chemical industry [8,9], defensing against insects and disease [10,11]. It also can be exploited as advanced liquid biofuels [12].Significant genetic variation in oleoresin yield was also observed among different families of masson pine, which ranged from 14.12 and 50.55 g per day [13,14] also reported that variation in oleoresin yield was heritable in P. taeda and could be increased 1.5-to 2.4-fold in one generation through selection.
Molecular markers are very useful in identifying germplasm, assessing biodiversity and describing the geographic patterns of genetic variation. Single nucleotide polymorphisms (SNPs) are commonly used in genetic study. Taking advantage of next generation sequencing (NGS) technologies, we could rapidly develop millions of SNPs for crops with low cost [9]. The high-throughput SNPs have been successfully used to evaluate genetic diversity, to infer population structure [15,16] and kinships [17]. As an important forest tree in South China, high density SNPs map is essential for genetic innovation and traits improvements in masson pine's future breeding. However, there are no reports about the whole genome sequences and developing SNP markers to study the genetic diversity and structure for masson pine. To date, only partial masson pine germplasms have been analyzed using random amplification polymorphic DNA (RAPD) [18], inter-simple sequence repeat (ISSR) [19], simple sequence repeat (SSR) [20] and inter-retrotransposon amplified polymorphism (IRAP) [21].
Associative transcriptomics has largely contributed to identify sequence polymorphisms and 4 transcript abundance linked to phenotypic variation, especially for non-model species [22,23]. In addition, high quality full length transcripts are critical and useful for functional assays and understanding of genetic diversity [24]. In this work, we firstly constructed high quality transcript reference sequences through combination of full-length transcriptome and NGS-based unigenes. Then RNA-Seq of 204 representative accessions was adopted for de novo SNP discovery to generate a genome-wide variation map. The aims of our study were to 1) assess the genetic diversity, population structure and geographic origin of masson pine; 2) reveal the genes associated with oleoresin yield.
The results would be useful for managing this species and expounding the forming mechanism of high-yielding oleoresin.

Sequencing and variation discovery
The research and breeding program of masson pine have been hampered by lacking of high quality genome sequences, because of its extremely large and complex genome presumed by close species loblolly pine (P. taeda) [25]. To overcome this obstacle, we constructed a high quality full length transcripts data set from secondary xylem transcriptome using PacBio Single Molecule, Real-Time (SMRT) Sequencing platform. A total of 81,837 high-quality and non-redundant full-length transcripts were obtained from 18 Gb PacBio subreads. To explore the origins and patterns of genetic diversity, we also designed the population transcriptome experiments for 204 geographically diverse masson pine genotypes, which collected from main habitats in China. Totaling of 341,714 non-redundant unigenes were assembled. After combination of full length transcripts and unigenes, 423,288 nonredundant transcripts considered as reference sequences for further analysis (Additional file 1: Table   S1). On average, 85.02% of the reads for each sample were successfully mapped to the reference sequences, suggesting the high completeness of reference transcripts (Additional file 2: Table S2).
A total of 1,326,230 single nucleotide polymorphisms (SNPs) and 153,459 insertions/deletions (InDels) were detected from transcriptomes of 204 genotypes using GATK packages [26] with an average SNP density of 3.13 per transcript (Additional file 3: Table S3). Among these SNPs, 94,194 core SNPs with minor allele frequency (MAF) ≥ 0.05 and missing genotype calls <5% were remained for further analysis, occupying 7.1 % of total set. These core SNPs included 23,864 (25.33%) non-synonymous (nSNPs) (Additional file 4: Table S4). This transcriptome variation map will benefit core germplasm identification, genetic variation research and artificial breeding.

Genetic diversity of Masson pine
The genetic diversity among P. massoniana germplasms from mainly distribution regions was  Table 2). The results showed that the differentiation among populations explained by 3.29% of the total variance. Only 0.01% of the variation was found among different subpopulations, suggesting the closed kindship within different subpopulations. In summary, our variant data set provides a comprehensive overview of genomic diversity at variously populational scales, and represents a rich source of genetic information for exploitation by both the academic and agricultural research communities.

Construction of P. massoniana core germplasm
The allelic diversity among P. massoniana accessions can be maximized by SNP markers. The redundancy curve showed that the masson pine allelic diversity could be represented by more than 40 core germplasms. These 40 representative genotypes which accounted for only 20% collection could represent more than 90.7% of the allelic diversity ( Fig. 1 and Table 3). Therefore, the minimum size of the core germplasm could constructed by these 40 representative accessions, including 9 accessions from Zhejiang, 9 accessions from Guizhou and 7 accessions from Sichuan and Chongqing (Additional file 5: Table S5). To our knowledge, this is first comprehensive core germplasms 6 identification based on high density SNPs map in large population scale, which is valuable for P. massoniana breeding practices.
Population structure of P. massoniana germplasm To further understand the evolutionary history of masson pine, we used ADMIXTURE [27] to estimate ancestry proportions for each accession. Genetic assignment analysis showed an optimal value of K = 2, which clearly separated the accessions of Chongqing and Sichuan from other wild genotypes (Additional file 6: Fig. S1). The first group which mainly included the clones from Chongqing and Sichuan provinces has high level signal of inter population admixture ( Fig.2A). We further estimated the genetic diversity of different clusters. The genome-wide nucleotide diversity (π) of Cluster I (2.91×10 -2 ) was higher than those of Cluster II (2.77×10 -2 ) and Cluster III (2.83×10 -2 ), exhibited the highest diversity level. This result is also supported by H e value, which revealed the sequence diversity based on heterzygous sites (Table 4). . The Nei's genetic distance showed the values ranging from 0.135 (Cluster II vs. Cluster III) to 0.303 (Cluster I vs. Cluster II), while the pairwise F st ranged from 0.024 (Cluster II vs. Cluster III) to 0.110 (Cluster I vs. Cluster II) and the Nei's and F st genetic distance of Cluster I vs. Cluster II was higher than that of Cluster I vs. Cluster III (Additional file 8: Table S7). These observations suggest that masson pine of sichuang basin has been maintained high genetic diversity, and has larger differentiation with that of Southeast China.

Geographical origin and diffusion of P. massoniana germplasm
To further elucidate the evolution map and spread pathway, we examined the phylogeny of 204 masson pine genotypes by building a neighbor joining phylogenetic tree (Fig. 2D). Meanwhile, P.taeda was assigned as outgroup of a maximum likelihood tree to identify the earliest diverged population, considering as progenitors of modern P. massoniana (Additional file 9: Fig. S2). The phylogenetic tree showed that the genotypes from Sichuan Basin (Sichuang, Chongqing) was closest to P.taeda and followed by other clades, suggesting that Sichuan Basin is the geographic origin of masson pine.
Sichuan Basin was one of glacial refugees for many species at the last Pleistocene glaciations [28,29], which may rescue the masson pine from extinction event. The masson pine gradually migrated to Zhejiang. This hypothesis strongly supported by population structure evidence (Fig. 3B). The population differentiation is significantly larger between Sichuang/Chongqing and Guizhou (F st = 0.13) than those of other populations, implying the strong genome variation for new natural adaption when firstly transfer the habitat from basin to plateau. The neucleotide diversity is slightly higher in progenitors of Sichuang/Chongqing population (Fig. 3A). Signals of introgression were detected between the populations for two dissemination lines by the TreeMix program. The hybridization signal from Sichuan Basin population to Guangdong/Guangxi population was detected (Fig.3C). The introgression might be mainly mediated by human activity.

Associative transcriptomics with oleoresin yield
The oleoresin yield in xylem varied substantially in the 204 clones of masson pine, its levels varying from 0.00 to 6.07 g·cm -1 ·d -1 (Additional file 10: Table S8). The oleoresin yield appeared to be positive skew distribution among the accession (Fig. 4A). The oleoresin yield of accessions from Sichuan basin and Guizhou is significantly lower than that of Hunan accessions (Fig. 4B). In Central South China spreading path, oleoresin yield is slightly reducing when masson pine spreaded into Guangdong and Guangxi, but is still higher than that of accessions from Sichuan basin and Guizhou. In Southeast China spreading path, the oleoresin yield significantly increased when masson pine spreaded into the Southeast China, especial for Anhui, Zhejiang and Jiangxi.
Associative transcriptomics analysis identified 121 SNPs from 109 transcripts that were significantly associated with oleoresin yield at the significance level of P< 10 -6 ( Fig.4C, Additional file 11: Fig. S3 and Additional file 12: Table S9). The most significant SNP (c51955_f1p3_1546, R 2 =0.51, P =3.74E-19) is localized in the transcript annotated as chitinase class I ( Table 5). The mutated SNP took place the upstream of coding region, but the expression of the transcript (c51955_f1p3_1546) was not significantly correlated with oleoresin yield.
The family of CYP720B belonging to cytochrome P450 monooxygenases (P450), are an important enzyme involved in the biosynthesis of diterpene resin acids as the main content of oleoresin [30].
One CYP720B (c19795_f1p0_1763) and one Cytochrome P450 (c9591_f1p0_1663) were found sequence associated with the oleoresin yield (9.60E-07, 1.85E-07). The mutated SNP of CYP720B led to non-synonymous mutations with the transition of codon CTC to TTC. The mutated SNP from Cytochrome P450 (c9591_f1p0_1663) belong to synonymous mutation. The expression of these two transcripts (c19795_f1p0_1763, c9591_f1p0_1663) was significantly correlated with oleoresin yield (P = 3.61E-08, P = 2.13E-08). The result of qRT-PCR using the high-and low-yielding oleoresin accessions showed the higher expression level for these two transcripts in high-yielding oleoresin masson pines (Fig.4D).
The sequences of AP2 domain transcription factor and ABC transporter were associated with oleoresin yield in P. teada [14]. In this study, two SNPs from AP2/Ethylene-responsive transcription factors (ERFs) (c24091_f1p1_1286, c8825_f1p0_1733) and one SNP from ABC transporter (c189021.graph_c0) was also found significantly associated with oleoresin yield in masson pine. The SNP from AP2/ERF (c8825_f1p0_1733) caused non-synonymous coding and the coding amino acid changing from cystine to arginine.
In addition, one SNP from the transcript of tubulin alpha chain (c20772_f1p4_1467) was significantly associated with oleoresin yield in sequence (P = 8.73E-16) and expression level (P = 4.83E-08) simultaneously. The SNP caused non-synonymous mutation with different codon CTC and TTC.
However, the function of tubulin alpha chain is unclear during the biosynthesis of oleoresin.

Discussion
The SNP markers have been used to evaluate diversity within many species, such as Populus trichocarpa [31], Vitis vinifera [32], Ginkgo biloba [33]. SNPs called from transcriptome sequencing is a more efficient strategy for characterizing diversity in non-model or massive-genome species, since the sequences are detected on the coding regions rather than the whole genome. In this study,  (Table 1). Either H e or H 0 can be used to assess genetic variation, but the H 0 value is often influenced by the level of inbreeding within population. Therefore, H e is commonly more applied for comparing genetic diversity among different species or population within the same species [3].
Masson pine has continuous distributions in larger native region of China. Hamrick et al. [34] found that the average expected heterozygosity within populations of tree species with widespread distributions was 0.228 using allozyme analyses. Huang and Zhang [35] reported that the H e value was 0.27 by isozyme analysis on six natural population of masson pine in Guizhou province. The genetic diversity of five population of masson pine in Fujian province was assessed and found that the average H e was 0.22 [36]. In this study, higher genetic variability detected might be attributed to the wider sampling regions, which almost over whole native region of masson pine. Similar consequence was observed for natural populations of Scots pine (P. sylvestris) [37,38]. In addition, the difference of genetic diversity assessed in these studies could be also caused by the different marker types, sampling locations and sizes [39].
Both structure analysis and DAPC obviously separated Sichuan and Chongqing samples from the others ( Fig. 2A, C). This differentiation was also in agreement with the results of F st and Nei's genetic distance values (Additional file 9: Table S7), which revealed the germplasm from Sichuan and Chongqing had the highest values for F st and Nei's genetic distance by DAPC, respectively. Although the structure analysis showed the minimum cross-validation error at K = 2, the cross-validation error at K = 3 was only slightly higher than that at K = 2. When K = 2, most of the germplasm from the other provinces not including Sichuan and Chongqing, were grouped one cluster ( Fig. 2A). However, this cluster were divided into two groups at K = 3, which was strongly correspondence with the clusters of DAPC, despite with minor differences in the member of each cluster. The differentiation between Cluster II and Cluster III was relatively small with the F st and Nei's genetic distance values of 0.024 and 0.135 respectively, suggesting the masson pine germplasm from South-central and Southeast China have a closer relationship than that from Southwest China. In addition, compared to the other two clusters, Cluster I composed of Sichuan and Chongqing germplasm located at the Southwest China had the highest genetic diversity with the H e of 0.318.
Climate is one of main effecting factors for adaptive evolution of forest trees [40,41]. In the Northern Hemisphere, warm subtropical and temperate climates with rich gymnosperms in the Eocene turned into cold and strong seasonal climates from the Oligocene onwards over the Cenozoic in the middlelatitude and high-latitude landmass, especially in the Quaternary with large-scale ice cover and glaciations [42][43][44]. Many tree species were extinction suffer severe cold during this period. However, some tree species better adapted cooler conditions sustained. In southern China, complex topography made numerous temperate forests survived the last glacial maximum at the "refugia" [45], such as . Therefore, the results suggested that Sichuan Basin is one main refugium for many species [46]. These species were expanding to lower elevations in the glacial periods, and reteating the refugia at higher elevations during the interglacial stages [45,47,48].
To explore the evolutionary history of masson pine, we use loblolly pine (P. taeda) as the reference. It expounded that masson pine from Sichuan Basin was the geographic origin. This is concordant with previous reported by Qin [46] through observing the characteristics of the needles. However, structure analysis showed two geographically distinct groups and DAPC identified three clusters in this study, which suggested the genes have been changing to adapt the habitat.
Masson pine spread to Guizhou province firstly from Sichuan Basin. Guizhou province is neighboring with Guangxi and Hunan provinces, but masson pine only spread to Hunan from Guizhou, subsequently. It might that Yunnan-Guizhou Plateau was the barrier hindering the spread of masson pine from Guizhou to Guangxi. Although Guangxi provenances were not highly distinguished using structure, DAPC and cladogram with other several provenances, the difference was significant between Guangxi and Guangdong provenances and the other provenances for growth traits. The growth of Guangxi and Guangdong provenances was faster, which is related to the thermal resources of origins [7].
The breeding can be accelerated by selecting the genes related to target trait. NpABC1 was reported the first transporter involved in the secretion of terpeniods in soybean [49]. In conifers, oleoresin is transported from living cells to resin ducts and flowing upon the wounding in stems after stem suffering abiotic stimuli [50]. Westbrook et al. [14] found that SNPs located in ABC transporters were associated with oleoresin yield and inferred ABC transporters participate in oleoresin transportion. In this study, the results of SNPs also indicate that ABC transporters were significantly associated with oleoresin yield, which suggested that ABC transporters may play an important role in regulating the oleoresin yield through changing the sequences.
Chitinase play a key role in modifying cell wall structure. Zhong et al. [51] found the mutant of Chitinase (elp1) would lead ligin to be ectopically deposited in the stem of Arabidopsis, and walls of the lignified cells were not thickened. The function of chitinase might affect the transport rate of oleoresin from living cells to resin duct.
Most of Cytochrome P450s have been reported involving in the progress of secondary metabolism [52]. The CYP720B gene family of Cytochrome P450s is specific for conifer and can catalyze consecutive oxidation steps in the biosynthesis pathway of various diterpene resin acids as the main component of oleoresin [30]. We found that one SNP in CYP720B and one SNP in Cytochrome P450 were significantly associated with oleoresin yield, and the SNP in CYP720B led to nonsynonymous mutation. Therefore, the SNP in CYP720B was inferred have an important influence in determining oleoresin yield by changing sequence and expression level.
Ethylene can induce the biosynthesis and the formation of traumatic resin ducts in many conifers [53]. AP2/ERF transcription factors are involved in the regulation of ethylene-responsive gene expression in the ethylene signaling pathway during abiotic stress. Over expression of OsEREBP1belonging to ERF family would cause increased expression of genes relating to lipid metabolism in rice [54]. In P. taeda, one SNP in AP2 domain transcription factor was also associated with oleoresin yield [14], which is verified by our results in masson pine. 13

Conclusion
It is important to understand the genetic architecture for improving the oleoresin yield in genetic breeding process. Although the genome of masson pine has not been sequenced, we obtain satisfactory result on genetic diversity, population structure and trait-gene association based on 94,194 SNPs using the full-length transcriptome as reference. Masson pine is clearly differentiated two groups and Sichuan and Chongqing provenance is the geographical origin, then masson pine diffuses outward along two lines. Oleoresin yield exhibits two different changing trends between two diffusing line, and associated with the genes of chitinase, CYP720B, Cytochrome P450, ABC transporter and AP2/ERF, among which some genes was also confirmed in other conifer. The function of these genes will be verified in further work.

Sample collection
A clonal test of masson pine, located at Laoshan Forest Farm in the western of Zhejiang province, China (119°02′ E,29°33′ N; altitude, 152 m above sea level), was used for this study. This trial included 400 clones (genotypes) obtianed from 10 provinces or municipality. In the 1980's, a national technical cooperation group for masson pine was established in China, and the scions for these clones In this study, 204 healthy clones of masson pine from 10 provinces or municipality have been 14 selected (Additional file 13: Table 10). Before 1997, Chongqing was a city of Sichuan province, we collected scions from Sichuan and Chongqing both belonging to Sichuan Basin. Therefore, the analysis would be carried considering the germplasm from Chongqing and Sichuan as a population.
Oleoresin yield was measured according Liu's method [55] from May to October in 2017 and 2018.
The oleoresin yield of each tree was calculated as the yield of the individual per day per cm streak length in grams. Simultaneously, 5 mm deep fresh secondary xylem tissues adjoining cambium were harvested from sample trees after removing the bark, phloem. These samples were put into liquid nitrogen immediately in field and then stored at -80 °C for RNA extraction. These experiments were under taken in Research Institute of Subtropical Forestry, Chinese Academy of Forestry, China.

RNA extraction and PacBio-based sequencing
Total RNA from each sample was separately extracted and evaluated according to Liu's method [56].
Briefly, total RNA from each sample was extracted using the Plant RNA kit RN38 EASYSpin plus (Aidlab Biotech, Beijing, China). The concentration and integrity of the total RNA was detected using an Ultraspec TM 2100 Pro UV/visible spectrophotometer, and an Agilent 2100 Bioanalyzer. High-quality RNA samples were used to constructing cDNA libraries. One microgram RNA for each sample was equally pooled together and full-length cDNA was synthesized using the SMARTer™ PCR cDNA Synthesis Kit. The size of the full-length cDNAs were selected using Blue Pippin (SageScience, Beverly, MA, USA) and for three libraries of differently sized cDNA (1-2 kb, 2-3 kb, and > 3 kb) were build.
Then, the size distribution of cDNA was quantified using Qubitfluorometer and the quality of three libraries was assessed using the Agilent Bioanalyzer 2100Bioanalyzer. Subsequently, SMRT sequencing was carried by the Pacific Biosciences RS II platform in Biomarker Technologies Corporation.

Next-generation sequencing
The mRNA was obtained from high-qualitytotal RNA for each sample using magnetic beads enrichment procedure. Fragmentation Buffer was used to fragment mRNA randomly. The first-and second-strand cDNA were synthesized, respectively. All of cDNAs were purified using AMPure XP beads. After end repairing, adding A and adaptor ligation, the fragment size of purified cDNA was 15 selected using AMPure XP beads. Then, cDNA fragments were enriched by PCR amplification and the quality of cDNA library for each sample was assessed using Qubitfluorometer and the Agilent Bioanalyzer 2100 Bioanalyzer. Finally, the qualified cDNA library was paired-end sequenced on the Illumina HiSeq™ 2000 sequencing platform.
Quality control of RNAseq data Low quality reads were filtered based on the following four rules: 1) If one end of a pair-end read had > 5% "N" bases, then the pair-end read was removed; 2) For each pair-end read, if one of them had an average base-quality less than 20, then they were removed; 3) For each read, we trimmed its 3' bases if their quality scores are less than 13. The trimming was stopped at the base with quality score ≥ 13. After trimming, if the number of remaining bases was less than 40, then the pair-end reads were removed; 4) Duplicates of pair-end reads were removed. Clean data were then used to call both SNPs and insertions/deletions (InDels).

Phylogenetic analyses
For phylogenetic tree, the genome of loblolly pine was downloaded firstly from NCBI database (SRX4454630). Then the genome of loblolly pine was aligned with the full-length transcriptome sequences. Subsequently, SNPs were called from genome of loblolly pine. Then, the phylogenetic tree visualizing and editing assigning were performed using ITOL (http://itol.embl.de/). The divergence time between masson pine and loblolly pine was obtained by the online Time Tree software Population structure analyses ADMIXTURE software [27] was used to visualize population genetic structure. The preset ancestral population numbers was ranging from K = 1 to K = 10.The most likely number of ancestral genetic groups was determined by the minimum K value on the cross-validation curve.

Discriminant analysis of principal components (DAPC)
For DAPC, genetic data were first transformed into uncorrelated components using PCA. The number of genetic clusters was then defined using k-means, a clustering algorithm that looks for the value of K that maximizes the variation between groups. The Bayesian Information Criterion (BIC) was calculated for K = 1-40 and the K value with the lowest BIC was selected as the optimal number of clusters. A discriminant analysis was then performed on the first 120 principal components using the function DAPC to efficiently describe the genetic clusters.

Identification of genes associated with oleoresion yield
The association between SNPs and oleoresin yield was carried by the mixed linear model (MLM) using TASSEL [58]. The P-value corresponding to each association was calculated, and the association was significant with the P-value ≤ 1.06E-5, which was estimated using 1/n named Bonferroni correction (n is the number of SNP markers).

Quantitative RT-PCR analysis
Ten high-and low-oleoresin yielding RNA samples were used in qRT-PCR. The primer pairs (Additional file 13: Table S10) for seven genes of chitinase, tubulin alpha chain, AP2/ERF, ABC transporter, CYP720 and cytochrome P450 design and the cDNA was amplification according to Liu's method [56], and the expression levels of genes were calculated with the 2 −ΔΔCt method [59]. Elongation factor 1alpha (EF 1-alpha) was used to normalize the transcript profiles.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analyzed during this study are included in this published article (and its additional information files). Sequencing data is available through NCBI database (SAMN13566835).  Figure 1 Core collection sizewith the proportion of alleles captured by SNP markers in masson pine