Evolution of gene structure in the conifer Picea glauca: a comparative analysis of the impact of intron size
© Stival Sena et al.; licensee BioMed Central Ltd. 2014
Received: 11 September 2013
Accepted: 9 April 2014
Published: 16 April 2014
A positive relationship between genome size and intron length is observed across eukaryotes including Angiosperms plants, indicating a co-evolution of genome size and gene structure. Conifers have very large genomes and longer introns on average than most plants, but impacts of their large genome and longer introns on gene structure has not be described.
Gene structure was analyzed for 35 genes of Picea glauca obtained from BAC sequencing and genome assembly, including comparisons with A. thaliana, P. trichocarpa and Z. mays. We aimed to develop an understanding of impact of long introns on the structure of individual genes. The number and length of exons was well conserved among the species compared but on average, P. glauca introns were longer and genes had four times more intronic sequence than Arabidopsis, and 2 times more than poplar and maize. However, pairwise comparisons of individual genes gave variable results and not all contrasts were statistically significant. Genes generally accumulated one or a few longer introns in species with larger genomes but the position of long introns was variable between plant lineages. In P. glauca, highly expressed genes generally had more intronic sequence than tissue preferential genes. Comparisons with the Pinus taeda BACs and genome scaffolds showed a high conservation for position of long introns and for sequence of short introns. A survey of 1836 P. glauca genes obtained by sequence capture mostly containing introns <1 Kbp showed that repeated sequences were 10× more abundant in introns than in exons.
Conifers have large amounts of intronic sequence per gene for seed plants due to the presence of few long introns and repetitive element sequences are ubiquitous in their introns. Results indicate a complex landscape of intron sizes and distribution across taxa and between genes with different expression profiles.
KeywordsGenome size Pinus taeda BAC Repeat elements Gymnosperms Gene expression
Many factors related to genome size, recombination rate, expression level, and effective population size, among others, have been proposed to affect the evolution of gene structure [1–4]. At the molecular level, genome size variations may result from mobile or transposable elements (TEs), whole genome duplication events, and polyploidization events, among others. Comparative studies have shown that intron lengths and the abundance of mobile elements directly correlate with genome size, such that large genomes have longer introns and a higher proportion of mobile elements . Mobile elements also impact gene structure and function as they can insert into genes, including introns and exons, and thus contribute to the evolution of genes.
Conifer trees have very large genomes ranging from 18 to 35 Gbp  that are composed of a large fraction of repetitive sequences [6, 7]. New insight into plant genome evolution are expected from the unique structure and history of conifer genomes , which may contribute to a broader understanding of the relationships between gene structure and genome architecture. Draft genome assemblies were recently reported for the European Picea abies (Norway spruce)  as well as the North American species Picea glauca (white spruce)  and Pinus taeda (loblolly pine) [11, 12]. Nystedt et al.  reported that Norway spruce and other conifers accumulate long introns and showed that some introns can be very long (>10 Kbp) compared to other plant species.
A positive relationship between genome size and intron length has been observed in broad phylogenetic studies [2, 13, 14] including between recently diverged Drosophila species harboring considerable difference in genome size, where D. viliris had longer introns than D. melanogaster. In plants, a few studies investigated this question within angiosperms, indicating that genome size is not necessarily a good predictor of intron length [16, 17] although a general trend is observed. For instance, Arabidopsis thaliana, Populus trichocarpa, Zea mays have well characterized genomes that range in size from 125 Mbp to 2.3 Gbp; their average exons sizes are between 250 and 259, whereas their introns sizes are 168 bp, 380 bp and 607 bp on average, respectively [18–20]. The length of introns may depend upon gene function and expression level; however, there is considerable debate surrounding this issue when it comes to plant genomes. In Oryza sativa and A. thaliana it was found that highly expressed genes contained more and longer introns than genes expressed at a low level , which is in contrast to findings in Caenorhabditis elegans and Homo sapiens.
Transposable elements are among the factors that may influence the evolution of intron size, as they represent the major component of plant genomes . In Vitis vinifera, transposable elements comprise 80% of long introns . In many plants, LTR-RT represent a large fraction of the genome but are more abundant in gene poor regions of the genome; therefore, their impact on the evolution of gene structure may actually be lesser than other classes of transposable elements such as MITEs  and helitrons, both of which are known to insert into or close to genes .
To date, studies related to genome size and the evolution of plant introns have primarily involved angiosperms (flowering plants), many of which have genomes under 1 Gbp. More recently, the Picea abies and Pinus taeda genomes were shown to have among the largest average introns size [9, 12]. We aimed to develop an understanding of the gene structure in conifers through a detailed analysis of individual genes with a particular emphasis on the potential impact of long introns on gene structure trough comparative analyses. An underlying question relates to potential impacts on gene expression; therefore, our analyses took into account their expression profiles. Gene structure was analyzed in two conifers (P. glauca and P. taeda) and three angiosperms. We explored three main hypothesis: (1) Intron length is the major type of variation affecting gene structure in conifers compared to other plant species; (2) there is a positive relationship between genome size and intron length in P. glauca compared to A. thaliana, Z. mays and P. trichocarpa; (3) P. glauca and P. taeda present a conserved gene structure despite the fact that they diverged over 100 MYA in keeping with their low rate of genome evolution .
We present a detailed analysis of gene structure for 35 genes from the conifer Picea glauca obtained from BAC sequencing and genome assembly and comparative analyses with A. thaliana, P. trichocarpa and Z. mays. Our study also included the analysis of nearly 6000 gene sequences obtained from sequence capture aiming to explore the potential impact of repetitive sequences on intron size in P. glauca. Our findings show that intron size and the position of long introns within genes is variable between plant lineages but highly conserved in conifers.
Genomic sequences were analyzed for several P. glauca genes. The sequences were obtained either by targeted BAC isolations, from an early assembly of the P. glauca genome , or from a sequence capture experiment (for details, see Methods).
A total of 21 BAC clones were isolated each containing a different single copy gene associated with secondary cell-wall formation or with nitrogen metabolism. Following shotgun sequencing by GS-FLX and assembly with the Newbler software, the integrity and identity of each gene was verified. Estimated size of BAC clones was 131 Kb on average and coverage was 144× (for Summary statistics, see Additional file 1: Table S6). Twenty of the 21 targeted genes were complete as determined by sequence alignment indicating full coverage of FL cDNA sequences from spruces and pines (P. glauca, P. sitchensis, P. taeda and Pinus sylvestris) [25–28] (Additional file 1: Table S7). Nearly all genes were contained within a single contig, except the LIM gene which lacked one exon, and the Susy gene which was complete cDNA sequence but spanned two contigs. None of BACs contained other genes as determined by BLAST searches against the P. glauca gene catalog  and the Swiss Prot database.
Sequences were also isolated from a whole genome shotgun assembly of P. glauca. Sequences with ubiquitous expression were targeted in order to complement the set of more specialized genes which had been selected for BAC isolation. The P. glauca genome shotgun assembly was screened with the complete CDS derived from cDNA sequences (according to Rigault et al. ) that were highly expressed in most tissues (according to Raherison et al. ). A total of 18 genomic sequences were randomly selected among those that spanned the entire coding region of the targeted gene.
Gene expression profiles
Gene structures and comparative analysis with angiosperms
The gene structure (exon and introns regions) of P. glauca genes was determined by mapping the complete cDNA onto the genomic sequence (BACs or shotgun contigs) for 35 genes. Homologs were retrieved from three well-characterized angiosperm genomes, Arabidopsis thaliana, Populus trichocarpa and Zea mays. The comparative analyses considered all of the genes together and also as two separate groups, i.e. genes highly expressed and genes related to secondary cell-wall formation and nitrogen metabolism. On average, the protein coding sequence similarity between P. glauca and A. thaliana was 76%, 78% with P. trichocarpa and 75% with Z. mays.
Average number and length of exons in genes used for comparative analyses
Highly expressed genes1
Secondary cell-wall formation and nitrogen metabolism genes2
Comparative analysis of gene structures between Picea glauca and Pinus taeda
A total of 23 different genes were submitted to pairwise comparisons between Picea glauca and Pinus taeda, which are both of the Pinaceae (for details, see Methods). A high level of similarity was observed for coding sequences (91% on average) indicating that they were likely orthologous genes (Additional file 1: Table S4), and gene structure was conserved between the two conifers, with almost identical numbers of exons. The total intronic sequences per gene did not vary significantly at 3.13 and 3.17 Kbp for P. glauca and P. taeda, respectively (Additional file 1: Table S1). Pairwise comparison of introns indicated that the majority of individual introns were similar in length in the two species, despite the fact that the two genera diverged ca. 140 million years ago [32, 33] (Figure 5). Although these observations are based on a set of only 23 genes, they provide an indication that intron length is mostly conserved between these two conifer genera.
Repeat elements in Picea glaucagenes
The possible origin of long introns as observed in conifer genomes was investigated by searching for the presence of repeated sequences including transposable elements.
First, the repetitive element content of the BACs was estimated based on a repetitive library constructed with P. glauca data (see Methods) as a baseline. It was 55% on average, but it varied considerably among the BAC clones, ranging from 18% to 83%. Additional file 4: Figure S1 shows that around half of repetitive sequences were classified as LTR-RT elements and the other half as unknown elements (without significant hits in Repbase and nr genbank).
We then analyzed the sequences of the 35 P. glauca genes described above including those identified in BACs, representing a total of 238 introns. The gene structures of these genes were screened for repeat elements using a P. glauca repeat library (see Methods). We found repetitive elements in 10 of the genes for a total of 24 unclassified fragments with no significant hits in RepBase; 22 of the fragments produced no hits in genbank and were 179 bp on average and only two had significant hits in nr genbank (Additional file 1: Table S8).
Abundance of repetitive elements in P. glauca genes obtained from sequence capture
This study reports on the detailed gene structure analysis of 35 genes from the conifer Picea glauca obtained from BAC sequencing and genome assembly. Recent analyses of the Picea abies and Pinus taeda genomes have analyzed individual introns and reported among the highest average intron lengths, the longest introns and highest average among long introns [9, 12]. We aimed to develop an understanding of the gene structure in conifers through a detailed analysis of entire genes taking into account gene expression profiles, with a particular emphasis on the potential impact of longer introns on gene structure trough comparative analyses. Our findings were also derived from the analysis of nearly 6000 gene sequences obtained from sequence capture sequencing. We present an interpretation of our findings in regard to the evolution of gene structure.
Evolution of gene structure in plants
Analyses over a broad phylogenetic spectrum in eukaryotes showed that increases in genome size correlate with increases in the average intron length [2, 13]. A strong relationship between intron length and genome size was observed from studies in humans and pufferfish , species of Drosophilla, and from studies of plants with small genomes [2, 13].
Our study compared the gene structure (introns and exons) of 35 homologous genes between four seed plant species with very different genome sizes. The conifer P. glauca has the largest genome with 19.8 Gbp ; among angiosperms, the monocot Z. mays has a genome of 2.3 Gbp , and dicots represent smaller plant genomes in this set, i.e. P. trichocarpa with genome of 484 Mbp  and A. thaliana with the smallest genome of 125 Mbp . In the present study, the average exon length was similar between the four species, but the overall length of genes varied owing to longer introns in P. glauca, P. trichocarpa and Z. mays. For the set of sequences analyzed, P. glauca had 4.1 times more intron sequence per gene than Arabidopsis, 2.2 times more than poplar and 1.8 times more than maize (Figures 3 and 4); however, the statistical significance of these differences was variable.
Even though our study was based on 35 genes, our results are consistent with variations of intron size reported for A. thaliana, P. trichocarpa and Z. mays genomes [9, 12, 18–20]. We concluded that the increased intron length in P. glauca, P. trichocarpa and Z. mays was heterogeneous compared to A. thaliana. Even in genes with many introns, only a few introns were very long, whereas in Arabidopsis, genes exhibited a more uniform intron length, suggesting that introns expansion or contraction within a gene may be independent across species.
Comparisons between the A. thaliana (125 Mbp) and A. lyrata (~200 Mbp) genomes, which diverged about 10 million years ago, showed that most of the difference in genome size was due to hundreds of thousands of small deletions, mostly in noncoding DNA . The authors concluded that evolution toward genome compaction is occurring in Arabidopsis. Conifers such as species of Picea and Pinus have large amounts of repetitive elements in intergenic regions and apparently more intronic sequence per gene in comparison to many angiosperms. Our results do not reveal whether the P. glauca genome and introns are expanding, or alternatively evolving at slower pace, than other plant genomes which are contracting. Some evidence like the presence of very ancient retrotransposon elements [9, 36] and the lack of gene rearrangements since before their split from extant angiosperms  lend credence to the paradigm that conifer genomes are slowly evolving.
Repetitive sequences in gene evolution
Transposable elements play a role in plant genes as was shown by the abundance of TE- gene chimeras in Arabidopsis which was reported as 7.8% of expressed genes . The abundance of TEs may be especially high in long introns as recently shown in Picea abies where most of the introns were longer than 5 Kbp, representing 5% of the total intron count . This trend was also observed in other repeat rich genomes as V. vinifera and Z. mays[20, 21, 38].
We isolated P. glauca BAC clones each containing a different complete transcription unit for 21 target genes. In each the BACs (average 131 Kb), only one intact gene sequence was identified, which is indicative of large intergenic regions as reported for other conifers [39–41]. Previous studies on conifer trees have considered only two targeted genes (from terpenoid biosynthesis) isolated from P. glauca BAC clones  and only a few other intact genes with complete coding sequence isolated from BACs in pines [7, 39, 41].
Complete sequencing of the P. glauca BACs showed that the repetitive element content is not distributed uniformly in proximal intergenic regions, as indicated by the variable proportion of repetitive elements among the different BACs. A study in 10 P. taeda BACs, sequences similar to eukaryote repeat elements (according to Repbase) represented 23% of the sequence on average, and ranged from 19% to 33% . In P. glauca, 26% of BAC sequences were classified as LTR-RT repetitive elements on average and ranged from 8% to 47%, while P. taeda had an average of 18.8% of LTR-RT . Furthermore, an average 26% of the P. glauca BAC sequences were unknown repeat elements. Results in spruce and pine indicate a relatively low abundance of TEs in gene proximal sequences compared to whole genomes at 70% in the Picea abies genome  and around 80% in Pinus taeda.
Picea and Pinus genomes are reported to have among the highest average for the longest intron per gene, when compared to angiosperms of diverse genome sizes . We verified whether insertions of repetitive elements could be responsible for the length of introns in P. glauca in a set of more than 1800 genes sequences, and found that more genes harboring repetitive elements in introns were 10 times more frequent than genes harboring repetitive elements in exons, i.e. 29.8% vs 3.2%. The vast majority of the repetitive elements were short fragments, suggesting that they were remnants or fragments of TE insertions that have not persisted and could represent ancient insertion events. Importantly, interpretation of our findings in P. glauca must take into account the fact that the sequences were derived from a sequence capture study and that nearly all of the introns in the data set were <1 Kbp. Thus we show that TE sequences are ubiquitous even in genes that do not harbor long introns, suggesting that their presence has been very widespread during the evolution of conifer genes. An analysis of intact LTR TE in Picea genomic sequences showed that most insertions date back to 10 MYA or more, with a maximum around 20–25 MYA . The TE remnants that we detected in P. glauca indicate that many genes introns contained TE in a more or less distant past. In this report and in recent analyses of conifer genomes, an emphasis has been place on long introns; however the median intron length in conifers is very similar to other plant species, most of which have a median between 100 bp and 200 bp. Therefore our findings on intron are relevant for a large majority of introns rather than a small fraction represented by large or very large introns.
Slow evolution of conifer genes
Analyses of the gene structure of 23 orthologous genes between P. glauca and P. taeda clearly showed the conservation of gene structure and the distribution of intron sizes in spite a divergence time of 100 to 140 MYA [32, 33]. The conservation of long introns was also observed across gymnosperm taxa, where a group of long introns in P. abies was identified as orthologous to long introns in P. sylvestris and Gnetum gnemon. We suggest that the long introns observed in P. glauca likely date back to a period predating the divergence of major conifer groups. As more conifers genomes become available [9–11] and assembly contiguities are improved it will be possible to extend this analyses of orthologous gene structures among conifers.
We also observed that the sequence of many introns was highly similar between spruce and pine, and that shorter introns were more conserved on average. Between humans and chimpanzee, a strong positive correlation was found between intron length and divergence . The pattern found in conifers as well as observations in primates lead to the hypothesis that shorter introns could be under stronger selection pressure than longer introns, which could be explained by factors such as the maintenance of functional regulatory elements in shorter introns or impacts on RNA transcript processing and stability. In our analysis of sequence similarity between Picea and Pinus, 20 of the introns were longer than 1 Kbp and only two of them had high sequence similarity. Future studies with more long introns are required to confirm the hypothesis that shorter introns are more conserved in conifers. Despite the fact that introns are assumed to be non-coding, conserved introns may play a functional role related to gene expression.
Costs and benefits associated with intron size
There is also considerable debate about other factors that may impact the evolution of introns, aside from transposable elements. Lynch  stated that the reduced efficiency of selection in regions of low recombination may lead to an increase in intron size if small introns provide a slightly improved transcription efficiency or splicing accuracy. On the other hand, Comeron and Kreitman  proposed that there might be situations in which a longer intron is selectively advantageous as an explanation for intron persistence and increased lengths. If so, there would be indirect selection for large introns in regions of low recombination because they can reduce the load caused by deleterious mutations by increasing the recombination rate. It was proposed that conifers have low recombination rates at both the genome and within-gene scales . Their low recombination rates may explain at least in part, the accumulation of longer introns.
The high degree of sequence conservation that we observed in short introns between spruce and pine may also depend on the recombination rate within genes, where small introns would be under stronger selection because of efficiency in transcription and splicing, and long introns in regions of low recombination diverge because of reduced selection pressure. Another factor underlying the evolution of intron size is that intron length would be constrained by energy use during transcription, given that large introns represent a higher cost of transcription, the so-called “economy” or low-cost transcription hypothesis . In the present study, the 35 P. glauca genes analyzed were divided in two groups based on their expression profiles, i.e. 17 genes associated with secondary cell-wall formation or nitrogen metabolism, many of which had tissue preferential expression, and 18 genes that were highly transcribed in a large range of tissues (based on Raherison et al. ). The highly expressed genes had more intronic sequences per gene on average than the more specialized subset of genes (4,182 bp versus 3,013 bp). We also observed a large variation among genes in each group, i.e. from 446 to 12,009 bp in highly transcribed genes and 440 to 9,847 bp in the set of more specialized genes. These observations do not support the “economy” hypothesis in P. glauca as there appears to be no clear rule governing the relationship between intron size and expression levels or profiles. In humans, genes contained total intronic sequences are ~5,500 bp per gene on average , which more than any plant described so far. It was observed that intron length declines steadily as the expression level increases in humans, in agreement with the low-cost transcription hypothesis . Considering the smaller amount of intron sequences in plant genes including conifers compared to humans, it may be that the economy rule does not impact their introns as strongly as in vertebrates and that other evolutionary forces are main drivers of intron size evolution. This interpretation is consistent with the findings reported for the P. abies genome . Future studies with more genes are needed to confirm this hypothesis.
Our results indicate that P. glauca has longer introns than Arabidopsis, P. trichocarpa and Z. mays on average due the presence of few long introns. Intron size and the position of long introns within genes were variable between plant lineages but well conserved in conifers. Our findings are consistent with recent reports indicating that conifers accumulate very long introns but we point out that long introns represent a relatively small fraction of the overall intronic content, which is reflected by the median length of similar size to other plants. We show that RE sequences are detected at a high frequency (32%) even in introns <1 Kbp, indicating their ubiquitous presence in conifer genes over the course of evolution.
Taken together, our observations and the recent literature suggest that the evolution of plant gene structure is determined by more interacting forces than classically expected. The pattern is reminiscent of the heterogeneity of rates of evolution at the genetic, genomic and morphological levels seen among seed plants including angiosperms, conifers, annual and perennial taxa. It stands to reason that the distinctive features of the conifer genome, such as its large size and relatively small occupancy of the gene space, its conserved macro-structure, the large numbers of repetitive elements, and long introns, represent the product of the intricate evolutionary history of conifers.
Picea glaucaBAC isolation and validation
A BAC library developed from the single Picea glauca (Moench) Voss individual PG29 from BC Ministry of Forests was utilized. The non-arrayed library consisted of approximately 1.1 million BAC clones with an average insert size of 140 kbp, representing approximately 3× coverage of P. glauca genome . The library was screened by quantitative PCR (qPCR) through successive steps using BAC super-pools and pools, serial dilutions and clone identify verification by amplicon sequencing. The BAC isolation and sequencing are reported here for the first time.
We isolated 21 BAC clones each containing a different single copy gene from P. glauca (see list of genes and accession numbers in Additional file 1: Table S2). Each of the genes screened was represented by a unique FL-cDNA clone in P. glauca as described in Rigault et al. . The selected genes encoded enzymes and transcriptional regulators involved in secondary cell-wall formation and nitrogen metabolism and were subject to manual curation. They were chosen as to facilitate comparison with BAC isolation studies conducted in other conifers species (e.g. ). Two sets of gene specific primers were designed for each gene based on the cDNA sequence available in P. glauca gene catalogue . The genomic sequence obtained was used to design two additional primers such that two small amplicons of 120–200 bp could be amplified by quantitative PCR (qPCR) (Additional file 1: Table S3). All of the primer sets were verified by PCR and qPCR using the genomic DNA from P. glauca, genotype PG-653, and then they were used to screen the BAC library in three steps. See PCR conditions in Additional file 5.
The BAC library was subdivided into pools with a titer 1000 BACs on average, which were arrayed into ten 96 deep-well plates. Each plate was inoculated in 96-well culture plates with 1 ml of terrific broth (TB) and 20 μg/mL of chloramphenicol and grown in a 37°C shaker at 300 rpm overnight. The same TB medium and growing conditions were utilized to culture bacteria throughout the screening steps. Bacterial cultures from each of the columns and rows within a plate were combined in a total of 200 super-pools for DNA isolation as described in Osoegawa et al. .
The first step followed Jeukens et al. . Briefly, the super-pool DNA was amplified by the two small amplicons for each target gene by qPCR using QuantiTect SYBR Green master mix as described in Boyle et al. . The intersection of a positive row and a positive column was indicative of positive wells on the original plate. The presence of target genes in the positive super-pools was verified by qPCR in 30 μL reactions using QuantiTect SYBR Green master . We performed PCR of the long amplicon and its purification for gene sequence validation by Sanger sequencing (Additional file 1: Table S3). The second step of the screening relied on serial dilutions of the super-pools to inoculate 50 bacteria from the positive per well in a 96 deep-well plate. DNA super-pools were extracted and screened by qPCR using the same conditions as in the first step. Then, we extracted DNA from 1 μL of bacterial culture from each well of a positive column to test it by qPCR and determine the positive well in the column. From the positive well of the same bacterial culture plate, we proceeded with serial dilutions and we inoculated a 96 deep-well plate with one isolated colony per well. The third step of the screening consisted to pool columns and rows of bacterial cultures. We identified positive wells by qPCR and plated the culture of each positive well on a different Petri dish and one colony per dish was inoculated in 5 mL TB with chloramphenicol. DNA was extracted from 2 mL of each culture. Positive isolated clones were validated by PCR, qPCR and resequencing of the long amplicon. The validation steps to confirm gene identity and integrity proved essential as conifers such as conifers contain many pseudogenes that reduce the efficiency of targeted BAC isolation .
The 21 isolated BAC clones, each identified by screening for a different gene (for accessions numbers, see Additional file 1: Table S2) were sequenced by Roche 454 FLX pyrosequencing at McGill University and Genome Québec Innovation Centre, Montreal, Canada. Sequences were assembled de novo into contigs using the GS De novo Assembler module of Newbler version 2.3 (Roche) . In this analysis, the BAC vector and E. coli genome were trimmed and the assembly parameters were a minimum overlap of 200 bp length, minimum overlap identity of 98% and minimum contig length of 500 bp. In general, more than one contig per BAC was obtained; therefore, the order of the contigs within each BAC was tested by PCR.
To determine gene structure (introns and exons), cDNAs were mapped onto the BAC contigs containing the respective gene using est2genome incorporated in the annotation software MAKER . Four of the genes were eliminated from the comparative gene structure analyses because they were either incomplete, lacked introns or identifiable homologs in the species targeted for comparative analyses.
Pinus taedaorthologous sequences
Seven BAC clones of Pinus taeda containing orthologs of P. glauca genes were identified by BLAST  using an e-value threshold of 1e-20 and sequence identity > 90% (Additional file 1: Table S4). An additional 16 sequences were identified by BLAST  using an e-value threshold of 1e-20 in the whole genome shotgun assembly of Pinus taeda. Their gene structures were defined using est2genome  and P. taeda cDNA or P. glauca cDNA when P. taeda complete cDNA was not available. Accessions numbers available in Additional file 1: Table S4. A pairwise alignment of all corresponding intron and exon sequences of orthologous genes between P. glauca and P. taeda was conducted, followed by the estimation of their similarity with the software Needle, part of the analysis package EMBOSS . The BAC clones containing the LIM gene in P. glauca and the Korrigan, Peptidase_C, Thiolase, Gp_dh_C and eRF1_2 genes in P. taeda lacked an intron and exon; these missing exons and introns were excluded from the comparison between P. glauca and P. taeda.
Screening for highly expressed genes in whole genome shotgun assembly
Based on transcript profiles (PiceaGenExpress database ) a set of 500 gene sequences each representing a unique FL-cDNA clone that was highly expressed in all tissues was identified from the P. glauca gene catalog . A preliminary assembly of the P. glauca genome assembly described by Birol et al.  was screened with each of these sequences. The screening was performed by exonerate and est2genome model , which considers intron/exon boundaries. The cDNA/genome alignments were further filtered based on the identity and length coverage to retain only complete alignments with entire cDNAs; i.e., genes with complete genomic sequence. We randomly selected 18 the genomic sequences thus identified as containing complete structures of highly expressed genes. As for genes contained into the BACs, genes annotations were generate automatically and curated manually individual reciprocal BLASTs and sequence alignments.
Identification of closest homologs in angiosperms
Homologous sequences to P. glauca genes were identified in Arabidopsis thaliana and P. trichocarpa by BLASTX  with threshold e-value of 1e-10. Reciprocal analysis (BLASTX) between the A. thaliana, P. trichocarpa sequence and the P. glauca gene catalogue was used to verify that the genes were the closest homologs among known sequences. In Zea mays, the closest homolog was identified based on A. thaliana sequences by BLASTX, with a threshold e-value of 1e-10, and orthology was verified in the Maize Genome Project Sequencing database (Additional file 1: Table S5). We also performed a BLASTX of P. glauca against Z. mays sequences and we verified that the closest homologs identified between Z.mays and A. thaliana were among the best hits. Gene structures were recovered from the databases of TAIR 10 , Phytozome (JGI v3.0 gene annotation of assembly v3 of P. trichocarpa) [18, 55] and the Maize Genome Sequencing Project . Accession numbers available in Additional file 1: Table S5.
From the 21 genes contained into the P. glauca BAC clones, four genes were eliminated from the gene structure comparative analyzes between P. glauca, A. thaliana, P. trichocarpa and Z. mays: (1) Dof5 because a clear A. thaliana homolog could not be identified, (2) asparaginase because a clear Z. mays homolog could not be identified, (3) PAL because it lacked introns and (4) LIM because it presented an incomplete sequence (cDNA). A total of 35 genes had their gene structure compared with closest homologs in angiosperms: 17 genes related to secondary cell wall formation and nitrogen metabolism and 18 highly transcribed genes with little tissue-specific expression.
Statistical analyses of introns
Intron lengths were compared between P. glauca, A. thaliana, P. trichocarpa and Z. mays by nonparametric Kruskal-Wallis tests with post-test analysis by Dunn’s multiple comparisons, because intron length did not follow a Gaussian distribution. Comparisons of two groups of genes used by Wilcoxon rank sum test with continuity correction; this test was used to compare total intron sequences in P. glauca genes belonging to the two expression groups and to compare P. glauca with P. taeda. Data analyses were performed using the R packages coin and multcomp [57–59].
Gene space obtained from sequence capture technology
Sequences were obtained by using genomic DNA hybridizations on a custom P. glauca chip containing oligonucleotide baits for 23,864 genes. The method development, the DNA sequence isolation and analysis procedure and the resulting sequence data are reported in this manuscript for the first time.
DNA was extracted from needles of the P. glauca individual 77111 from the Canadian Forest Service as described in Pelgas et al.  using the DNeasy Plant mini kit according to the manufacturer’s instructions (QIAGEN). One microgram of high quality DNA was used to prepare a GS-FLX rapid library according to the manufacturer instructions (Roche). The library was amplified by ligation-mediated PCR using 454 A and B primers as described in the NimbleGen SeqCap EZ Library LR User’s guide.
Custom probes were designed by Nimblegen based on the cDNA sequences and ESTs from the P. glauca gene catalogue . We used a Newbler (gsAssembler module v2.5.3) assembly of sequences from random genomic sequencing (0.15× of coverage) from P. glauca to identify highly repetitive elements and to filter out probes representing such elements as they were expected to reduce the efficiency of the sequence capture. Next, a comparative genomic hybridization (Array CGH) experiment was conducted in collaboration with Nimblegen (Madison, WI, USA) to eliminate probes with abnormally high level of hybridization that could not be identified with in silico approaches. Throughout the process, probes within genes harboring abnormally high capture levels were eliminated. The final design covered 23,864 genes. The target enrichment procedures including quantitative PCR assessments are described in Additional file 5.
Emulsion PCR and GS-FLX Titanium sequencing was performed according to manufacturer’s instructions at the Plateforme d’Analyses Génomiques of the Institut de Biologie Intégrative et des Systèmes (Université Laval, Quebec, Canada). Raw sequencing reads were de novo assembled using the gsAssembler module of Newbler v2.5.3. Contigs were screened for complete gene structures based on the P. glauca gene catalogue . Technical details are available in Additional file 5.
Picea glaucarepetitive library and identification of repeat elements
For repeat identification, a random sample of 100,000 P. glauca 454 reads from randomly sheared DNA was searched de novo for repeats using the software RepeatScout . The results were filtered by removing low complexity sequences and sequences shorter than 100 nt, and retaining only repeats having at least 10 matches when mapped onto the original 454 set using RepeatMasker . Since RepeatScout is tailored to analyze complete genomes or at least large scaffolds, its output is usually fragmented when the program is run on random sheared reads. In order to reduce fragmentation, we merged the repeats belonging to the same element running cap3  under relaxed settings (-o 30, -p 80, -s 500) on the RepeatScout output. Finally, the entire set of repeated sequences was clustered using the software cd-hit-est  by collapsing all the repeats sharing at least 80% similarity in order to remove redundancies.
Repeat characterization proceeded by similarity searches were used to associate candidate repeats to known TE families and to remove repeats showing similarity to gene sequences and being possibly part of gene families. In particular, the repeat candidates from each species were searched against RepBase  using TBLASTX  and setting as significance threshold an e-value of 1e-5. Repeats that did not provide significant hits were used as queries in BLASTX searches against the non-redundant (nr) division of Genbank. Those having significant hits with genes were removed from the library while those having significant hits with TEs were labeled accordingly and the remaining repeats were considered as unclassified.
The search for repeat elements in all BAC contigs and in the gene space obtained from sequence capture was conducted using RepeatMasker  using the Picea glauca repetitive library and default parameters.
The authors thank D. Peterson (Mississippi State Univ., USA) and G. Claros and F. Canovas (Univ. de Málaga, Spain) for sharing information on gene targets and strategies for BAC isolation in pines. Technical assistance of S. Caron, É. Fortin, G. Tessier (Univ. Laval, Canada) is acknowledged for BAC screening. F. Belzile, R. Lévesque, L. Bernatchez (Univ. Laval, Canada) are acknowledged for valuable discussions and suggestions at the project planning stage. Funding for the project was received from Génome Québec for a Genome exploration grant (JM, JBou, PR, KR), from Genome Canada, Génome Québec and Genome British Columbia for the SmarTForests project (JM, JBoh, JBou, IB, KR, SJ) and NSERC of Canada (JM). JS received partial funding from Univ. Laval.
- Lynch M, Conery JS: The origins of genome complexity. Science. 2003, 302: 1401-1404. 10.1126/science.1089370.View ArticlePubMedGoogle Scholar
- Deutsch M, Long M: Intron-exon structures of eukaryotic model organisms. Nucleic Acids Res. 1999, 27: 3219-3228. 10.1093/nar/27.15.3219.PubMed CentralView ArticlePubMedGoogle Scholar
- Comeron JM, Kreitman M: The correlation between intron length and recombination in Drosophila: dynamic equilibrium between mutational and selective forces. Genetics. 2000, 156: 1175-1190.PubMed CentralPubMedGoogle Scholar
- Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA: Selection for short introns in highly expressed genes. Nat Genet. 2002, 31: 415-418.PubMedGoogle Scholar
- Murray BG, Leitch IJ, Bennett MD: Gymnosperm DNA C-values Database. 2004, http://www.kew.org/cvalues/,Google Scholar
- Morse AM, Peterson DG, Islam-Faridi MN, Smith KE, Magbanua Z, Garcia SA, Kubisiak TL, Amerson HV, Carlson JE, Nelson CD, Davis JM: Evolution of genome size and complexity in Pinus. PLoS One. 2009, 4: e4332-10.1371/journal.pone.0004332.PubMed CentralView ArticlePubMedGoogle Scholar
- Magbanua ZV, Ozkan S, Bartlett BD, Chouvarine P, Saski CA, Liston A, Cronn RC, Nelson CD, Peterson DG: Adventures in the enormous: A 1.8 million clone BAC library for the 21.7 Gb genome of loblolly pine. PLoS One. 2011, 6: e16214-10.1371/journal.pone.0016214.PubMed CentralView ArticlePubMedGoogle Scholar
- Pavy N, Pelgas B, Laroche J, Rigault P, Isabel N, Bousquet J: A spruce gene map infers ancient plant genome reshuffling and subsequent slow evolution in the gymnosperm lineage leading to extant conifers. BMC Biol. 2012, 10: 84-10.1186/1741-7007-10-84.PubMed CentralView ArticlePubMedGoogle Scholar
- Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, Vicedomini R, Sahlin K, Sherwood E, Elfstrand M, Gramzow L, Holmberg K, Hällman J, Keech O, Klasson L, Koriabine M, Kucukoglu M, Käller M, Luthman J, Lysholm F, Niittylä T, Olson Å, Rilakovic N, Ritland C, Rosselló JA, Sena J, et al: The Norway spruce genome sequence and conifer genome evolution. Nature. 2013, 497: 579-584. 10.1038/nature12211.View ArticlePubMedGoogle Scholar
- Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA, Yuen MMS, Keeling CI, Brand D, Vandervalk BP, Kirk H, Pandoh P, Moore RA, Zhao Y, Mungall AJ, Jaquish B, Yanchuk A, Ritland C, Boyle B, Bousquet J, Ritland K, Mackay J, Bohlmann J, Jones SJM: Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics. 2013, 29: 1492-1497. 10.1093/bioinformatics/btt178.PubMed CentralView ArticlePubMedGoogle Scholar
- Neale DB, Wegrzyn JL, Stevens KA, Zimin AV, Puiu D, Crepeau MW, Cardeno C, Koriabine M, Holtz-Morris AE, Liechty JD, Martínez-García PJ, Vasquez-Gross HA, Lin BY, Zieve JJ, Dougherty WM, Fuentes-Soriano S, Wu L-S, Gilbert D, Marçais G, Roberts M, Holt C, Yandell M, Davis JM, Smith KE, Dean JF, Lorenz WW, Whetten RW, Sederoff R, Wheeler N, McGuire PE, et al: Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol. 2014, 15: R59-10.1186/gb-2014-15-3-r59.PubMed CentralView ArticlePubMedGoogle Scholar
- Wegrzyn JL, Liechty JD, Stevens KA, Wu L-S, Loopstra CA, Vasquez-Gross HA, Dougherty WM, Lin BY, Zieve JJ, Martínez-García PJ, Holt C, Yandell M, Zimin AV, Yorke JA, Crepeau MW, Puiu D, Salzberg SL, Dejong PJ, Mockaitis K, Main D, Langley CH, Neale DB: Unique features of the Loblolly Pine (Pinus taeda L.) Megagenome revealed through sequence annotation. Genetics. 2014, 196: 891-909. 10.1534/genetics.113.159996.PubMed CentralView ArticlePubMedGoogle Scholar
- Vinogradov AE: Intron–genome size relationship on a large evolutionary scale. J Mol Evol. 1999, 49: 376-384. 10.1007/PL00006561.View ArticlePubMedGoogle Scholar
- McLysaght A, Enright AJ, Skrabanek L, Wolfe KH: Estimation of synteny conservation and genome compaction between pufferfish (Fugu) and human. Yeast. 2000, 17: 22-36. 10.1002/(SICI)1097-0061(200004)17:1<22::AID-YEA5>3.3.CO;2-J.PubMed CentralView ArticlePubMedGoogle Scholar
- Moriyama EN, Petrov DA, Hartl DL: Genome size and intron size in Drosophila. Mol Biol Evol. 1998, 15: 770-773. 10.1093/oxfordjournals.molbev.a025980.View ArticlePubMedGoogle Scholar
- Wendel JF, Cronn RC, Alvarez I, Liu B, Small RL, Senchina DS: Intron size and genome size in plants. Mol Biol Evol. 2002, 19: 2346-2352. 10.1093/oxfordjournals.molbev.a004062.View ArticlePubMedGoogle Scholar
- Jiang K, Goertzen LR: Spliceosomal intron size expansion in domesticated grapevine (Vitis vinifera). BMC Res Notes. 2011, 4: 52-10.1186/1756-0500-4-52.PubMed CentralView ArticlePubMedGoogle Scholar
- Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, Schein J, Sterck L, Aerts A, Bhalerao RR, Bhalerao RP, Blaudez D, Boerjan W, Brun A, Brunner A, Busov V, Campbell M, Carlson J, Chalot M, Chapman J, Chen G-L, Cooper D, Coutinho PM, Couturier J, Covert S, Cronk Q, et al: The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006, 313: 1596-1604. 10.1126/science.1128691.View ArticlePubMedGoogle Scholar
- Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Haberer G, Young S, Bharti AK, Gundlach H, Raymond C, Fuks G, Butler E, Wing RA, Rounsley S, Birren B, Nusbaum C, Mayer KFX, Messing J: Structure and architecture of the maize genome. Plant Physiol. 2005, 139: 1612-1624. 10.1104/pp.105.068718.PubMed CentralView ArticlePubMedGoogle Scholar
- Ren X-Y, Vorst O, Fiers MWEJ, Stiekema WJ, Nap J-P: In plants, highly expressed genes are the least compact. Trends Genet. 2006, 22: 528-532. 10.1016/j.tig.2006.08.008.View ArticlePubMedGoogle Scholar
- Kumar A, Bennetzen JL: Plant retrotransposons. Annu Rev Genet. 1999, 33: 479-532. 10.1146/annurev.genet.33.1.479.View ArticlePubMedGoogle Scholar
- Feschotte C, Jiang N, Wessler SR: Plant transposable elements: where genetics meets genomics. Nat Rev Genet. 2002, 3: 329-341.View ArticlePubMedGoogle Scholar
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326: 1112-1115. 10.1126/science.1178534.View ArticlePubMedGoogle Scholar
- Ralph SG, Chun HJ, Kolosova N, Cooper D, Oddy C, Ritland CE, Kirkpatrick R, Moore R, Barber S, Holt RA, Jones SJ, Marra MA, Douglas CJ, Ritland K, Bohlmann J: A conifer genomics resource of 200,000 spruce (Picea spp.) ESTs and 6,464 high-quality, sequence-finished full-length cDNAs for Sitka spruce (Picea sitchensis). BMC Genomics. 2008, 9: 484-10.1186/1471-2164-9-484.PubMed CentralView ArticlePubMedGoogle Scholar
- Bedon F, Grima-Pettenati J, Mackay J: Conifer R2R3-MYB transcription factors: sequence analyses and gene expression in wood-forming tissues of white spruce (Picea glauca). BMC Plant Biol. 2007, 7: 17-10.1186/1471-2229-7-17.PubMed CentralView ArticlePubMedGoogle Scholar
- Cañas RA, de la Torre F, Cánovas FM, Cantón FR: High levels of asparagine synthetase in hypocotyls of pine seedlings suggest a role of the enzyme in re-allocation of seed-stored nitrogen. Planta. 2006, 224: 83-95. 10.1007/s00425-005-0196-6.View ArticlePubMedGoogle Scholar
- Nairn CJ, Lennon DM, Wood-Jones A, Nairn AV, Dean JFD: Carbohydrate-related genes and cell wall biosynthesis in vascular tissues of loblolly pine (Pinus taeda). Tree Physiol. 2008, 28: 1099-1110. 10.1093/treephys/28.7.1099.View ArticlePubMedGoogle Scholar
- Rigault P, Boyle B, Lepage P, Cooke JEK, Bousquet J, MacKay JJ: A white spruce gene catalog for conifer genome analyses. Plant Physiol. 2011, 157: 14-28. 10.1104/pp.111.179663.PubMed CentralView ArticlePubMedGoogle Scholar
- Raherison E, Rigault P, Caron S, Poulin P-L, Boyle B, Verta J-P, Giguère I, Bomal C, Bohlmann J, MacKay J: Transcriptome profiling in conifers and the PiceaGenExpress database show patterns of diversification within gene families and interspecific conservation in vascular gene expression. BMC Genomics. 2012, 13: 434-10.1186/1471-2164-13-434.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradnam KR, Korf I: Longer first introns are a general property of eukaryotic gene structure. PLoS One. 2008, 3: e3093-10.1371/journal.pone.0003093.PubMed CentralView ArticlePubMedGoogle Scholar
- Savard L, Li P, Strauss SH, Chase MW, Michaud M, Bousquet J: Chloroplast and nuclear gene sequences indicate late Pennsylvanian time for the last common ancestor of extant seed plants. Proc Natl Acad Sci U S A. 1994, 91: 5163-5167. 10.1073/pnas.91.11.5163.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang X-Q, Tank DC, Sang T: Phylogeny and divergence times in Pinaceae: evidence from three genomes. Mol Biol Evol. 2000, 17: 773-781. 10.1093/oxfordjournals.molbev.a026356.View ArticlePubMedGoogle Scholar
- Ohri D, Khoshoo TN: Genome size in gymnosperms. Plandt Syst Evol. 1986, 153: 119-132. 10.1007/BF00989421.View ArticleGoogle Scholar
- Hu TT, Pattyn P, Bakker EG, Cao J, Cheng J-F, Clark RM, Fahlgren N, Fawcett JA, Grimwood J, Gundlach H, Haberer G, Hollister JD, Ossowski S, Ottilar RP, Salamov AA, Schneeberger K, Spannagl M, Wang X, Yang L, Nasrallah ME, Bergelson J, Carrington JC, Gaut BS, Schmutz J, Mayer KFX, Van de Peer Y, Grigoriev IV, Nordborg M, Weigel D, Guo Y-L: The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet. 2011, 43: 476-481. 10.1038/ng.807.PubMed CentralView ArticlePubMedGoogle Scholar
- Morgante M, De Poali E: Toward the conifer genome sequence. Genetics, Genomics and Breeding of Conifers Trees. Edited by: Plomion C, Bousquet J, Kole C. 2011, Enfield: Science Publishers, 389-403.Google Scholar
- Lockton S, Gaut BS: The contribution of transposable elements to expressed coding sequence in Arabidopsis thaliana. J Mol Evol. 2009, 68: 80-89. 10.1007/s00239-008-9190-5.View ArticlePubMedGoogle Scholar
- Jaillon O, Aury J-M, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyère C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Fabbro CD, Alaux M, Gaspero GD, Dumas V, et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007, 449: 463-467. 10.1038/nature06148.View ArticlePubMedGoogle Scholar
- Kovach A, Wegrzyn JL, Parra G, Holt C, Bruening GE, Loopstra CA, Hartigan J, Yandell M, Langley CH, Korf I, Neale DB: The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences. BMC Genomics. 2010, 11: 420-10.1186/1471-2164-11-420.PubMed CentralView ArticlePubMedGoogle Scholar
- Hamberger B, Hall D, Yuen M, Oddy C, Hamberger B, Keeling CI, Ritland C, Ritland K, Bohlmann J: Targeted isolation, sequence assembly and characterization of two white spruce (Picea glauca) BAC clones for terpenoid synthase and cytochrome P450 genes involved in conifer defence reveal insights into a conifer genome. BMC Plant Biol. 2009, 9: 106-10.1186/1471-2229-9-106.PubMed CentralView ArticlePubMedGoogle Scholar
- Bautista R, Villalobos DP, Díaz-Moreno S, Cantón FR, Cánovas FM, Claros MG: Toward a Pinus pinaster bacterial artificial chromosome library. Ann For Sci. 2007, 64: 855-864. 10.1051/forest:2007060.View ArticleGoogle Scholar
- Gazave E, Marqués-Bonet T, Fernando O, Charlesworth B, Navarro A: Patterns and rates of intron divergence between humans and chimpanzees. Genome Biol. 2007, 8: R21-10.1186/gb-2007-8-2-r21.PubMed CentralView ArticlePubMedGoogle Scholar
- Lynch M: Intron evolution as a population-genetic process. Proc Natl Acad Sci U S A. 2002, 99: 6118-6123. 10.1073/pnas.092595699.PubMed CentralView ArticlePubMedGoogle Scholar
- Jaramillo-Correa JP, Verdú M, González-Martínez SC: The contribution of recombination to heterozygosity differs among plant evolutionary lineages and life-forms. BMC Evol Biol. 2010, 10: 22-10.1186/1471-2148-10-22.PubMed CentralView ArticlePubMedGoogle Scholar
- Sakharkar MK, Chow VTK, Kangueane P: Distributions of exons and introns in the human genome. In Silico Biol (Gedrukt). 2004, 4: 387-393.Google Scholar
- Osoegawa K, de Jong PJ, Frengen E, Ioannou PA: Construction of bacterial artificial chromosome (BAC/PAC) libraries. Current Protocols in Molecular Biology. Edited by: Ausubel FM, Brent R, Kingston RE, Moore DD, Seidman JG, Smith JA, Struhl K. 2001, Hoboken, NJ, USA: John Wiley & Sons IncGoogle Scholar
- Jeukens J, Boyle B, Kukavica-Ibrulj I, St-Cyr J, Lévesque RC, Bernatchez L: BAC library construction, screening and clone sequencing of lake whitefish (Coregonus clupeaformis, Salmonidae) towards the elucidation of adaptive species divergence. Mol Ecol Resour. 2011, 11: 541-549. 10.1111/j.1755-0998.2011.02982.x.View ArticlePubMedGoogle Scholar
- Boyle B, Dallaire N, MacKay J: Evaluation of the impact of single nucleotide polymorphisms and primer mismatches on quantitative PCR. BMC Biotech. 2009, 9: 75-10.1186/1472-6750-9-75.View ArticleGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Goodwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al: Genome sequencing in open microfabricated high density picoliter reactors. Nature. 2005, 437: 376-380.PubMed CentralPubMedGoogle Scholar
- Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M: MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008, 18: 188-196.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.View ArticlePubMedGoogle Scholar
- Slater GS, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinforma. 2005, 6: 31-10.1186/1471-2105-6-31.View ArticleGoogle Scholar
- The Arabidopsis Information Resource. http://arabidopsis.org.
- Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012, 40 (D1): D1178-D1186. 10.1093/nar/gkr944.PubMed CentralView ArticlePubMedGoogle Scholar
- Maize Genome Sequencing Project. http://www.maizesequence.org,
- Hothorn T, Bretz F, Westfall P: Simultaneous inference in general parametric models. Biom J. 2008, 50: 346-363. 10.1002/bimj.200810425.View ArticlePubMedGoogle Scholar
- Hothorn T, Hornik K, van de Wiel MA, Zeileis A: Implementing a class of permutation tests: the coin package. J Stat Softw. 28 (8): 1-23. URL http://www.jstatsoft.org/v28/i08/
- R project. http://www.r-project.org,
- Pelgas B, Bousquet J, Meirmans PG, Ritland K, Isabel N: QTL mapping in white spruce: gene maps and genomic regions underlying adaptive traits across pedigrees, years and environments. BMC Genomics. 2011, 12: 145-10.1186/1471-2164-12-145.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics. 2005, 21 (Suppl 1): i351-i358. 10.1093/bioinformatics/bti1018.View ArticlePubMedGoogle Scholar
- Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. 1996–2010 http://www.repeatmasker.orgGoogle Scholar
- Huang X, Madan A: CAP3: a DNA sequence assembly program. Genome Res. 1999, 9: 868-877. 10.1101/gr.9.9.868.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26: 680-682. 10.1093/bioinformatics/btq003.PubMed CentralView ArticlePubMedGoogle Scholar
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005, 110: 462-467. 10.1159/000084979.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.