Comparative transcriptome sequencing and de novo analysis of Vaccinium corymbosum during fruit and color development

Background Blueberry is an economically important fruit crop in Ericaceae family. The substantial quantities of flavonoids in blueberry have been implicated in a broad range of health benefits. However, the information regarding fruit development and flavonoid metabolites based on the transcriptome level is still limited. In the present study, the transcriptome and gene expression profiling over berry development, especially during color development were initiated. Results A total of approximately 13.67 Gbp of data were obtained and assembled into 186,962 transcripts and 80,836 unigenes from three stages of blueberry fruit and color development. A large number of simple sequence repeats (SSRs) and candidate genes, which are potentially involved in plant development, metabolic and hormone pathways, were identified. A total of 6429 sequences containing 8796 SSRs were characterized from 15,457 unigenes and 1763 unigenes contained more than one SSR. The expression profiles of key genes involved in anthocyanin biosynthesis were also studied. In addition, a comparison between our dataset and other published results was carried out. Conclusions Our high quality reads produced in this study are an important advancement and provide a new resource for the interpretation of high-throughput data for blueberry species whether regarding sequencing data depth or species extension. The use of this transcriptome data will serve as a valuable public information database for the studies of blueberry genome and would greatly boost the research of fruit and color development, flavonoid metabolisms and regulation and breeding of more healthful blueberries. Electronic supplementary material The online version of this article (doi:10.1186/s12870-016-0866-5) contains supplementary material, which is available to authorized users.


Background
Blueberry (Vaccinium corymbosum) is an economically important small fruit crop and a member of Ericaceae family which includes many species, such as blueberry, cranberry (V. macrocarpon), lingonberry (V. vitis-idaea), rhododendron, and more than 400 other species [1,2]. Three major types of blueberry are harvested commercially including lowbush (Vaccinium. angustifolium), highbush (V. corymbosum), and rabbiteye bluberry (V. ashei or V. virgatum). Although mostly originated in North America, many blueberry species are widely grown in Asia, Europe, South America, Africa, Australia and New Zealand owing in part to their high level of vitro antioxidant capacities [3,4]. Blueberry is becoming a major crop in China, cultivated widely from temperate area to subtropical region. There are currently three major areas for blueberry cultivation in China, the Jilin and Liaoning provinces, the Shandong provinces and the areas of the Yangtze River [5].
Demand and consumption worldwide of blueberry has greatly increased in recent years for its beneficial influence on human health. These positive effects are generally due to the high levels of flavonoid [6], which have been linked to improve night vision, prevent macular degeneration, and decrease the heart disease [7,8]. Therefore, it is crucial to elucidate the molecular mechanisms that trigger biosynthesis and accumulation of anthocyanin metabolites during fruit and color development. The blueberry genome is large (600 Mb/haploid genome) and genomic information is limited compared to some plants like grape, for example [9], which restrains the dissection of blueberry. Over the past decades, more attention has been focused on the analysis of plant cold resistance, cultivation, and effects on human health [10,11]. Since massive amounts of information can be obtained from genome-scale expression data, RNA sequencing has become a powerful technology to profile the transcriptome [12]. To date, RNA sequencing has been reported in bilberry (Vaccinium myrtillus) and cranberry [13,14]. Recently, transcriptome sequences of blueberry were analyzed during cold acclimation and at different development stages of fruit by ESTs sequencing or RNA sequencing [2,15]. So far, transcriptome sequences have been generated using next generation sequencing so far from northern highbush [2], half-high [16], and southern highbush blueberry [17]. However, the information is still limited regarding the control of horticultural traits such as the molecular regulation mechanisms of blueberry maturation and flavonoid metabolism.
In order to gain new insights into molecular mechanism at transcriptome level, we performed transcriptome sequencing and gene expression profiling for the northern highbush blueberry variety 'Sierra' over berry development with Illumina sequencing technique. A total of more than 13.67 Gbp of data were generated and assembled into 186,962 transcripts and 80,836 unigenes. Large numbers of simple sequence repeats (SSRs) and candidate genes, which are potentially involved in plant growth, metabolic and hormone pathways, were identified. In addition, RNA-Seq expression profiles and functional annotations have been made publicly available (accession number: SRR2910056). We believe that this study provides a new and more powerful resource for interpretation of high-throughput gene expression data for blueberry species.

Results and discussion
Sequence analysis and de novo assembly The high quality Illumina sequencing data produced from Vaccinium corymbosum 'Sierra' during fruit and color development has been deposited in NCBI SRA database under accession number: SRR2910056. Three independent cDNA libraries were constructed from RNA samples of green fruit (G, 30d), pink fruit (P, 50d) and blue fruit (B, 75d), respectively. In total, after removing adaptor sequences, ambiguous nucleotides and low-quality sequences, 67,689,734 independent reads (13.67 Gbp) with more than 92 % Q30 bases were selected using Illumina HiSeq™ 2000 from the three samples ( Table 1). The GC content of the three samples was approximately 46 %.
Using the Trinity de novo assembly program [18], we assembled the short read sequences from the three samples into 186,962 transcripts. An overview of the assembled transcripts and unigenes are exhibited in Tables 2  and 3. The N50 value of the three sample assemblies was 1688 bp, and 66, 855 (36.95 %) transcripts were longer than 1 kb. The transcripts were subjected to cluster and assembly analyses, resulting in 80,836 unigenes (N50 value = 1204 bp), among which 15,457 (19.13 %) genes were greater than 1 kb. This is about 10 times greater than that in a previous report where a 17,134 ESTs resulted in a total of 8500 unigenes from two samples by Sanger sequencing [15]. Li et al. reported 57,331 unigenes in half-high blueberry from their two fruit samples (pulp and skin) [16]. Gupta et al. produced around 60,000 gene models from five stages of berry fruit development and ripening [17]. Therefore, we believe that the 13.67 Gbp of data generated and 186,962 transcripts and 80,836 unigenes assembled in this study are an important advancement and new resource for interpretation of high-throughput data for blueberry species whether regarding sequencing data depth or species extension. It is noteworthy that we have compared our dataset with the previously published Illumina dataset [17]. Compared with the previously published dataset, our dataset has a higher amount of data (Nucleotide), better data quality (Q20% or Q30%), and better assembly results (Average length and N50) (Additional file 1: File S1). There is no doubt that we now have a more powerful information database with higher capacity for gene expression analysis in Vaccinium, especially for fruit development and maturation, flavonoid metabolism and regulation and breeding for more healthful blueberries.

Similarity analysis
In order to better evaluate the similarity between blueberry and other organisms, blueberry unigenes were submitted to the non-redundant (Nr) NCBI database for BLASTx similarity analysis. The results showed that Vaccinium corymbosum unigenes hit a wide range of plant species including Vitis vinifera, Populus trichocarpa, Prunus avium, Ricinus communis, Fragaria vesca, Jatropha curcas, Solanum lycopersicum, Solanum lycopersicum, Camellia sinensis, Gossypium hirsutum, Brassica rapa, Vaccinium macrocarpon, etc. (Fig. 1). Among them, interestingly, 10,280 (12.72 %) unigenes showed a significant homology with sequences of Vitis vinifera, and 1654 (2.05 %) and 1183 (1.46 %) unigene sequences had a high similarity with those of Populus trichocarpa and Prunus avium, respectively. The high similarity of unigenes between Vaccinium corymbosum and Vitis vinifera suggests the possibility of using Vitis vinifera transcriptomes and genomes as a reference sequence for Vaccinium corymbosum. In contrast, only 242 (0.29 %) unigenes had a high similarity with sequences of Vaccinium macrocarpon, probably owing to the insufficient number of sequences in Genebank.

Sequence annotation
To annotate the assembled sequences, several complementary approaches were utilized. The unigenes were annotated by aligning with the deposited ones in diverse protein databases (Nr, Nt, Swiss-Prot, KEGG, COG, GO and TrEMBL) and the best one was selected from the matches with an E-value of less than 10 −5 . The overall functional annotation is described in Table 4

GO, COG, KEGG annotation
Gene Ontology (GO) enrichment analysis was carried out to classify gene functions of the unigenes identified. The majority of the GO terms (68,355, 49.58 %) were assigned to biological process, and 35,683 (25.89 %) and 33,840 (24.54 %) were assigned to the molecular function and the cellular component, respectively (Fig. 2). It was noteworthy that cells and organelles are highly represented in category of cellular components, while binding and catalytic activity are highly typical in molecular functions category. For biological processes, genes implicated in metabolic processes and cellular processes are the most represented category followed by the response to stimulus and biological regulation (Fig. 3). These data suggest that the active synthesis of substance and energy and metabolic process are present in  Fruits at developmental stages of green, pink, and blue were sampled for transcriptome sequencing a wide variety of organelles and associated with fruit development and in response to stress during fruit ripening. Furthermore, all blueberry unigenes were searched against the COG database for functional prediction and classification. Overall, 80,836 unigenes were assigned to the COG classification (Fig. 3). The largest group was the cluster for general function prediction (2502, 17.84 %), followed by translation, ribosomal structure and biogenesis (1297, 9.25 %), replication, recombination and repair (1262, 9.00 %), transcription (1160, 8.27 %), posttranslational modification, protein turnover and chaperones (1071, 7.64 %), signal transduction mechanisms (942, 6.72 %), and carbohydrate transport and metabolism (868, 6.19 %). However, small clusters were for cell motility and nuclear structure (13 and 2 unigenes, respectively). In addition, no unigene was assigned to extracellular structures.
To characterize the active biological pathways in blueberry, KEGG pathway tool was used as an alternative approach to analyze the pathway annotations of unigene sequences. The 7616 unigenes were assigned to 119 biological pathways (Additional file 2: File S2). These predicted pathways are responsible for growth and development probably via compound biosynthesis, degradation, utilization, and assimilation. These results suggest that a large number of metabolic activities are occurring during fruit development and coloring of blueberry.

EST-SSR discovery
As highly informative markers, SSRs have developed into powerful molecular markers for comparative genetic mapping and genotyping among species within genera [19,20]. To date, SSRs are most widely applied in genetics, evolution and breeding.
To explore EST-SSR profiles in the unigenes of blueberry, 15,457 unigene sequences were submitted to search for SSRs. A total of 6429 sequences containing 8796 SSRs were identified from 15,457 unigenes, with 1763 unigene sequences containing more than one SSR (Table 5 and Additional file 3: File S3). Di-nucleotide motifs and tri-nucleotide motifs were the most abundant with 75.11 % (4426) and 23.81 % (1403), respectively. The most abundant repeat type was AG/CT (4069), followed by AAG/CTT (388), AGG/CCT (227), and ACC/GGT (225). Because SSRs within genes are likely to be subjected to stronger selective pressure than other  genomic regions, these SSRs probably represent different putative functions [21]. Therefore, the unigenes yielded from blueberry are a larger resource for SSR mining than ever and the SSR profiles which we explored are a powerful platform for research in genetics, evolution and molecular marker-assistant breeding, etc.

Analysis of gene regulation in anthocyanin biosynthesis pathway using the assembled unigenes
In this study, we analyzed all unigene annotation in blueberry. Some genes responsible for anthocyanin biosynthesis were screened for further analysis. Anthocyanins are a large class of flavonoids which are responsible for the colors of flowers, fruits and other tissues [22]. A simple anthocyanin biosynthesis pathway is shown in Fig. 4. In most of the pathway, more than one unique sequence was annotated as encoding the same enzyme, especially the final step involving 3-glycoside formation by UFGT (UDP Glc-favonoid 3-O-glucosyl transferase) (Additional file 4: File S4).

The relative expression levels of candidate unigenes and the content of anthocyanins during blueberry fruit and color development
To validate the results from the bioinformatics analysis, the expression profile of 9 differentially expressed genes of anthocyanin biosynthesis pathway such as PAL, C4H, 4CL, CHS, CHI, F3H, DFR, ANS, and UFGT were tested by RT-qPCR using cDNA templates from different fruit growth stages (Fig. 5). The number of unigenes assumed to be involved in anthocyanin biosynthesis pathway is shown in Additional file 4: File S4 and the specific primers used for RT-qPCR reactions are listed in Table 6. Among them, five candidate genes covered over 10 unigenes. The transcripts of some genes such as CHS, CHI, F3H, DFR, DFR, ANS, and UFGT increased with fruit coloring (stage 4) and reached the highest level at red fruit stage (stage 5), then slightly declined with fruit maturation (stage 5 and 6). CHS showed significant increase in expression at the stage of fruit coloring (stage 4), about a 19-fold increase compared with green fruit    [16]. This is consistent with our results. Zifkin et al. characterized the expression level of DFR, ANS, ANR, and LAR in blueberry and found that the expression of DFR, ANS increased at the S6 stage (red fruit) [16]. The anthocyanin content was next quantified at different developmental stages to test if anthocyanins accumulate in accordance with the pattern of gene expression. As expected, at green berry stage, anthocyanins were detected at low levels, only 0.5 mg/g FW. In accordance with the deep coloration, in mature fruits (blue fruit) the levels of anthocyanin peaked dramatically. Analysis of expression and anthocyanin accumulation indicated that there was a significant correlation between the expression profile of candidate genes and the accumulation of anthocyanins.

Conclusions
In the present study, the transcriptome and gene expression profiling over fruit development and coloring of blueberry were initiated by means of Illumina sequencing. A total of approximately 13.67 Gbp of data were obtained and assembled into 186,962 transcripts and 80,836 unigenes from three stages of blueberry fruit. A large number of simple sequence repeats (SSRs) and candidate genes were identified. A total of 6429 sequences containing 8796 SSRs were identified from 15,457 unigenes, and 1763 unigenes containing more than one SSR. Compared with the previously published data, our dataset has a higher amount of data (Nucleotide), better data quality (Q20 or Q30%), and better assembly results (Average length and N50). Therefore, our high quality reads produced in this study are an important advancement and a new resource for interpretation of high-throughput data for blueberry species regarding from sequencing data depth or species extension.

Plant materials and RNA extraction
Blueberry (Vaccinium corymbosum 'Sierra') fruits were harvested from an organic blueberry farm (Tianshuo Farm in Hebei Province, China) during the 2013 to 2015 growing season. The tissues at three different developmental stages including green fruit (G, 30d), pink fruit (P, 50d) and blue fruit (B, 75d), were randomly sampled in the field from 4-or 5-year-old healthy blueberry plants. The plants had been propagated by tissue culture, thus all came from the same mother plant. All samples intended for RNA extraction were flash-frozen in liquid nitrogen immediately after collection and stored at −80°C until use. Fifty fruits were collected at each time point and combined for RNA extraction. RNA was extracted using the Plant total RNA Kit (TIANGEN, Beijing, China). The purified RNA quality and quantity was evaluated using a spectrophotometer (Thermo Scientific, Waltham, MA, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).

mRNA-seq library construction for Illumina sequencing
The mRNA-seq library was constructed following the Illumina's mRNA-seq Sample Preparation Kit (Illumina Inc., San Diego, CA, USA) from three stages of fruit development, separately. Briefly, the mRNA was purified from total RNA samples using oligo (dT) magnetic beads, and then broken into small pieces by divalent cation under an elevated temperature. Taking these RNA fragments as templates, the first strand cDNA was synthesized by reverse transcriptase and random primers, while the second strand cDNA was synthesized using DNA Polymerase I and RNase H. The cDNA fragments were blunt-ended and ligated to sequencing adaptors. The fragments (200 ± 25 bp) were then separated by agarose gel electrophoresis and selected for PCR amplification. Finally, the mRNA-seq libraries were sequenced by the Illumina HiSeq™ 2000 sequencing.

Sequence data analysis and assembly
The high-quality clean read data for assembly was separated from adapters and low-quality. Reads with more than 10 % Q30 bases were removed. De novo assembly of transcriptome was done using the Trinity method with an optimized k-mer length of 25 [18]. The contigs were clustered and further assembled according to paired-end information. The longest transcripts in the cluster units were defined as unigenes. EMBOSS Getorf Software was used to predict coding regions (http:// emboss.bioinformatics.nl/cgi-bin/emboss/getorf ).