Comparative BAC end sequence analysis of tomato and potato reveals overrepresentation of specific gene families in potato

Datema, Erwin; Mueller, Lukas A; Buels, Robert; Giovannoni, James J; Visser, Richard GF; Stiekema, Willem J; van Ham, Roeland CHJ

doi:10.1186/1471-2229-8-34

Research article
Open access
Published: 11 April 2008

Comparative BAC end sequence analysis of tomato and potato reveals overrepresentation of specific gene families in potato

Erwin Datema^1,2,
Lukas A Mueller³,
Robert Buels³,
James J Giovannoni⁴,
Richard GF Visser⁵,
Willem J Stiekema^2,6 &
…
Roeland CHJ van Ham^1,2

BMC Plant Biology volume 8, Article number: 34 (2008) Cite this article

8221 Accesses
24 Citations
Metrics details

Abstract

Background

Tomato (Solanum lycopersicon) and potato (S. tuberosum) are two economically important crop species, the genomes of which are currently being sequenced. This study presents a first genome-wide analysis of these two species, based on two large collections of BAC end sequences representing approximately 19% of the tomato genome and 10% of the potato genome.

Results

The tomato genome has a higher repeat content than the potato genome, primarily due to a higher number of retrotransposon insertions in the tomato genome. On the other hand, simple sequence repeats are more abundant in potato than in tomato. The two genomes also differ in the frequency distribution of SSR motifs. Based on EST and protein alignments, potato appears to contain up to 6,400 more putative coding regions than tomato. Major gene families such as cytochrome P450 mono-oxygenases and serine-threonine protein kinases are significantly overrepresented in potato, compared to tomato. Moreover, the P450 superfamily appears to have expanded spectacularly in both species compared to Arabidopsis thaliana, suggesting an expanded network of secondary metabolic pathways in the Solanaceae. Both tomato and potato appear to have a low level of microsynteny with A. thaliana. A higher degree of synteny was observed with Populus trichocarpa, specifically in the region between 15.2 and 19.4 Mb on P. trichocarpa chromosome 10.

Conclusion

The findings in this paper present a first glimpse into the evolution of Solanaceous genomes, both within the family and relative to other plant species. When the complete genome sequences of these species become available, whole-genome comparisons and protein- or repeat-family specific studies may shed more light on the observations made here.

Background

The Solanaceae, or Nightshade family, is a dicot plant family that includes many economically important genera that are used in agriculture, horticulture, and other industries. Family members include the tuber bearing potato (Solanum tuberosum); a large number of fruit-bearing vegetables, such as peppers (Capsicum spp), tomatoes (S. lycopersicum), and eggplant (S. melongena); leafy tobacco (Nicotiana tabacum); and ornamental flowers from the Petunia and Solanum genera.

Tomato is generally considered to be a model crop plant species, for which many high-quality genetic and genomic resources are available, such as high-density molecular maps [1], many well-characterized near-isogenic lines (NILs), and rich collections of ESTs and full-length cDNAs [2, 3]. Potato is the most important crop within the Solanaceae, ranking fourth as a world food crop following wheat, maize and rice. Similar resources are available for potato, including an ultra-high density linkage map [4], a collection of phenotype data [5], and a large transcript database [6]. Like most other nightshades, tomato and potato both have a basic chromosome number of twelve, and there is genome-wide colinearity between their genomes [7].

Much effort is currently being invested to sequence the nuclear and organellar genomes of these organisms. The International Tomato Genome Sequencing Project [8] is sequencing the tomato (S. lycopersicum cv. Heinz 1706) genome in the context of the family-wide Solanaceae Project (SOL). Rather than sequencing the complete genome, which is approximately 950 Mb [9], only the gene-rich euchromatic regions (estimated at 240 Mb) are being sequenced using a BAC-by-BAC walking approach [10]. The Potato Genome Sequencing Consortium (PGSC) [11] aims to sequence the complete potato (S. tuberosum, genotype RH89-039-16) genome of approximately 840 Mb [4] using a similar marker-anchored BAC-by-BAC sequencing strategy.

Both sequencing projects rely heavily on BAC libraries, of which three exist for tomato (HindIII [12], MboI, and EcoRI) and two exist for potato (HindIII and EcoRI). The tomato libraries are available through the SOL Genomics Network (SGN) [13] and the potato libraries will soon by available at through the PGSC [11]. All of these libraries have been end-sequenced to support BAC-by-BAC sequencing and extension, and to provide a base of genome-wide survey sequences to support studies such as the one presented here.

This paper describes the detailed sequence analysis of 310,580 tomato BAC End Sequences (BESs), representing 181.1 Mb (~19%) of the tomato genome, and 128,819 potato BESs, corresponding to 87.0 Mb (~10%) of the potato genome (for an overview of the tomato and potato BES data, see Table 1). This comparative genomics study aims to gain insight into the similarity between the tomato and potato genomes, both on the structural level through repeat and gene content analyses and on the functional level through gene function analyses. Furthermore, we investigate micro-syntenic relationships between these two Solanaceous genomes, and several other sequenced plant genomes. The sequence content of BESs from a particular library is biased by which restriction enzyme was used to make the library. To avoid comparing sequence sets with different biases, tomato-potato comparisons are made only between BESs from libraries made with the same enzyme.

Table 1 Overview of tomato and potato BES data

Full size table

Results

Repeat density and categorization

Based on similarity searches of the repeat database, between 13.0% and 22.9% of the nucleotides in the tomato BESs were identified as belonging to a repeat (see Table 2, second through fourth columns). The most common repeat families in the tomato libraries were the Gypsy (5.0 – 11.6%) and Copia (4.2 – 5.3%) classes of retrotransposons. Another prominent class of repeats comprised the ribosomal RNA genes (<0.1 – 8.6%). The tomato Eco (EcoRI) library had the lowest repeat density at 13.0%, which can be attributed to a lower amount of Gypsy retrotransposons (5.0%). The highest repeat content was found in the tomato Mbo (MboI) library (22.9%), more than a third of which (8.6%) consisted of ribosomal RNA genes. Note that, since the repeat detection was based on sequence similarity, different segments in a BES could be assigned to more than one repeat family. As a result, the sum of the repeat content per repeat type can be slightly larger than the total repeat content.

Table 2 Classification and distribution of known plant repeats in the BAC end sequences

Full size table

In contrast to the tomato BESs, only between 10.0% and 12.5% of the nucleotides in the potato BESs showed similarity to known Magnoliaphytae repeats (see Table 2, fifth and sixth columns). As in tomato, the majority of the repeats were found in the Gypsy (5.4 – 8.6%) and Copia (2.5 – 2.6%) retrotransposon families, whereas the fraction of ribosomal RNA genes was small (<0.1 – 0.5%). Potato appeared to contain approximately two times as many LINE and SINE elements as tomato (see Table 2), although the absolute percentages were low. Furthermore, a higher percentage of class II DNA transposons was observed in potato (1.0 – 1.2%, versus 0.5 – 0.7% in tomato), the majority of which could not be classified. In agreement with the differences observed between the tomato HBa (HindIII) and Eco libraries, the potato PPT (EcoRI) library had an overall lower repeat content than the POT (HindIII) library, and more specifically, a lower amount of Gypsy retrotransposons (5.4% versus 8.6% in the POT library). The PPT library was also enriched in ribosomal RNA genes in comparison to the POT library (0.5% versus less than 0.1%), just as was found comparing the Eco library to the HBa library in tomato.

Since similarity-based repeat detection can be limited by the size and diversity of the repeat database, a self-comparison of the BESs was performed in order to estimate the redundancy within the BESs. Even with the stringent requirement that at least 50% of a given query sequence match another BES with at least 90% identity, 52.0% of the nucleotides in the tomato BESs had a match to one or more other tomato BESs, and 19.0% matched five or more other BESs. The redundancy in the potato BESs was lower than in tomato; 39.0% of the nucleotides in the potato BESs had a hit to at least one other potato BESs, and 12.9% had a hit to five or more BESs. This difference could not be attributed solely to the larger number of tomato BESs, compared to the number of potato BESs; a self-comparison of the tomato HBa library, which is of approximately the same size as the potato POT and PPT libraries combined, showed that 50.7% of the nucleotides in this library matched at least one other HBa BES, and 16.8% matched five or more other HBa BESs. The percentage of nucleotides in both species that matched five or more other BESs was only slightly higher than the findings from the RepeatMasker analysis (see Table 2), suggesting that the repeat database used in this study was sufficient to detect the majority of highly abundant repeats in these species. These findings also confirm the observation from the similarity-based repeat detection that the tomato BESs are more repetitive than the potato BESs.

Simple sequence repeats

A total of 28,423 SSRs with a motif length between one and five nt, and a total length of at least 15 nt were detected in the tomato BESs, representing one SSR per 6.4 kb of genomic sequence. The term 'motif length' is used here to describe the length of the motif that is repeated in the SSR; for example, an ATATAT repeat has a motif length of two (with AT being the motif). The most abundant motif length was five nucleotides (11,177 SSRs), followed by motif lengths of two (6,588 SSRs), four (4,596 SSRs), three (4,135 SSRs), and lastly one (1,927 SSRs).

In potato, 19,019 SSRs were found, out of which 3,964 (21%) belonged to class I (i.e., SSRs containing more than 10 motif repeats). Thus, the potato BESs had one SSR per 4.6 kb of genomic sequence, which is higher than that in tomato (one SSR per 6.4 kb). As in tomato, the most abundant motif length in the potato SSRs was five nucleotides (7,922 SSRs). However, the next most abundant length was three (3,941 SSRs), followed by motif lengths of two (3,270 SSRs), four (1,980 SSRs) and one (1,906 SSRs).

Figure 1 shows the distribution of the primary SSR motifs in the tomato and potato BESs, ordered by motif length and relative frequency within the motifs of the same length. The most abundant SSR motifs in both datasets were AT-rich, with the di-nucleotide repeat AT/TA being the most abundant (16.6% of all tomato and 14.7% of all potato SSRs, respectively). Several motifs, such as AG/CT, AC/GT, AATT/AATT and AAAG/CTTT were more frequent in tomato than in potato, whereas other motifs, such as AAG/CTT, AAC/GTT, AACTC/GAGTT and AAACC/GGTTT were found predominantly in potato.

Considering only the class I SSRs, the most abundant SSR motifs in tomato and potato were AT/TA (50.8 and 39.1% of all class I SSRs, respectively) and A/T (25.8 and 42.1%). In tomato, the di-nucleotide motifs AC/GT (6.3%) and AG/CT (5.7%) were the most abundant after these two, whereas in potato the mononucleotide C/G (6.0%) and tri-nucleotide AAT/ATT (4.5%) and AAG/CTT (3.7%) occurred at the second, third and fourth highest frequency, respectively. This suggests that the differences in primary motif frequencies between tomato and potato also hold when considering only class I SSRs.

Gene content

In the tomato BESs, the percentage of nucleotides that matched by at least one database sequence ranged from 21.3% for the Eco library, to 30.5% for the Mbo library. Figure 2 presents a breakdown of these BLAST hits into three main categories ('coding', 'repeats', and 'other'), based on the keyword filtering described in Materials and Methods. Each category was then subdivided into 'masked' and 'unmasked' subcategories, with 'masked' indicating an overlap with repetitive sequences identified by RepeatMasker, and 'unmasked' indicating a lack of such overlap. In this way, the BLAST and RepeatMasker results were combined in order to generate the best possible estimation of the percentage of putative protein-coding nucleotides in the BESs. The 'coding' category represents the percentage of nucleotides that matched one or more database sequences, and were not identified as repetitive by the keyword filtering. After removing the overlap with repeats identified by RepeatMasker, the percentage of coding nucleotides in the three libraries ranged from 3.5% for the Mbo library to 4.6% for the HBa library (the 'coding unmasked' category in Figure 2). The Mbo library had the highest percentage of the three libraries in the 'coding masked' category, which is likely the result of the high number of ribosomal repeat sequences in this library that have escaped the keyword filtering. The 'repeats' category contains the BLAST matches to transposon and other repeat related sequences. In all three libraries, there was a considerable fraction of nucleotides that the keyword filtering assigned to the 'repeats' category but that did not overlap with the repeats identified by RepeatMasker (i.e. the 'repeats unmasked' category). This fraction ranged from 6.9% in the Eco library to 8.4% in the HBa library and may represent a combination of repeats that were missed by RepeatMasker and true protein-coding genes that were miss-classified by the keyword filtering. The final category in Figure 2, 'other', represents all non-transposon-related repetitive sequences that were identified by the keyword filtering (all keyword terms other than "Transposon terms" from Additional File 1).

In the potato POT and PPT libraries, 24.3 and 20.5% of the nucleotides matched the protein database, respectively. While these numbers were slightly lower than those for the tomato HBa and Eco libraries (28.5 and 21.3%, respectively), the percentage of nucleotides assigned to the 'coding' category (6.8 and 6.3%) was larger than those of the corresponding tomato libraries (4.6 and 3.9%), suggesting that potato may have a larger gene repertoire than tomato. Furthermore, the number of transposon regions and other repeat-related regions that was found in this comparison to the protein database was more than 1.5-fold higher for tomato than for potato. This is consistent with the difference in transposon content that was found in the repeat analysis.

Figure 3 shows the results of the BLASTN comparison of the BESs to species-specific EST databases. The matches were divided into two categories, 'masked' and 'unmasked'. The 'masked' category contains the nucleotides that had a match in the EST database, but were found to be repetitive in the RepeatMasker analysis; the 'unmasked' category contains the nucleotides that did not overlap with repeats. In the tomato libraries, between 10.2 and 19.1% of the nucleotides matched one or more tomato EST sequences. The Mbo library had the highest EST coverage (19.1%), but more than half of these matches (10.3%) were 'masked'. The percentage of nucleotides in the 'unmasked' category ranged from 6.8% in the Eco library to 8.8% in the Mbo library.

For the potato BESs, 11.1% (POT) and 11.5% (PPT) of the nucleotides had match in the potato EST database, which is in fairly good agreement with the tomato HBa and Eco comparisons versus the tomato database (11.3 and 10.2%, respectively; see also Figure 3). Fewer matches in the potato BESs were 'masked' than in tomato, confirming the observation from the BLASTX comparison to the protein database that the potato BESs have more protein coding nucleotides and lower repeat content.

Functional annotation

A total of 30,335 GO terms, out of which 585 unique terms, were assigned to the tomato HBa BESs based matches in the Pfam database (see Additional Files 2, 3, 4, 5 for an overview of all GO terms and their corresponding frequencies in the tomato and potato BESs). Although there were more than half as many Eco BESs as HBa BESs, only 7,647 GO terms (403 unique terms) were assigned to them. In potato, 17,060 terms (544 unique terms) were assigned to the POT library, whereas only 9,312 terms (419 unique terms) were assigned to the PPT library. Comparing the GO annotations of tomato to those of potato (for libraries generated with the same restriction enzyme) resulted in 18 significantly overrepresented terms between the HindIII digested libraries (seven in tomato HBa, and eleven in potato POT; P values are found in Additional File 3) and nine significantly overrepresented terms between the EcoRI digested libraries (seven in tomato Eco, and two in potato PPT; P values are found in Additional File 2).

In both species, many of the terms that were overrepresented in the HindIII libraries compared to their EcoRI counterparts were related to retrotransposon activity, such as DNA binding (GO:0003677), DNA integration (GO:0015074), RNA-directed DNA polymerase activity (GO:0005634), and chromatin-related terms (GO:0000785, GO:0003682, GO:0006333). Furthermore, many of these transposon-related terms were significantly overrepresented in tomato, compared to potato (P value < 10^-4; individual P values are found in Additional Files 2 and 3). This is consistent with the findings from the RepeatMasker and BLAST analyses discussed above. Surprisingly, some terms that were overrepresented in both the EcoRI digested libraries could be linked to transcription factor genes. In tomato, zinc ion binding (GO:0008270), DNA-dependent regulation of transcription (GO:0006355), and transcription factor activity (GO:0003700) were overrepresented in the Eco library. The potato PPT library was enriched for zinc ion binding (GO:0008270), nucleic acid binding (GO:0003676), and transcription factor activity (GO:0003700).

Analysis of the protein families identified by PANTHER revealed similar trends for the number of matches, both within and between the tomato and potato libraries (see Additional Files 6, 7, 8, 9 for an overview of all PANTHER terms and their corresponding frequencies in the tomato and potato BESs). In tomato, 1,064 distinct families were found in the HBa BESs for a total of 28,984 hits, and 8,226 hits representing 654 families were found in the Eco BESs. Analysis of the potato POT library revealed 951 distinct PANTHER families for a total of 13,821 hits; however, only 6,926 hits to 716 families were found in the PPT BESs. Two and three PANTHER families were found to be overrepresented in the tomato HBa and Eco libraries, compared to eleven and five overrepresented families in the potato POT and PPT libraries, respectively.

Consistent with the greater abundance of Gypsy retrotransposons in the HindIII libraries of both tomato and potato, the GAG/POL/ENV polyprotein (PTHR10178) PANTHER family was found to be overrepresented in both HindIII libraries, compared to the corresponding EcoRI libraries. Furthermore, the GAG-POL-related retrotransposon (PTHR11439) PANTHER family was relatively more abundant in the EcoRI libraries, which also agrees with the difference in the Gypsy:Copia ratio between the HindIII and EcoRI libraries (see also Table 2). Both of these retrotransposon-related terms were found to be significantly (P value < 10^-4; individual P values are found in Additional Files 6 and 7) overrepresented in tomato when compared to potato. In the tomato Eco library, transcription-factor related terms such as zinc finger CCHC domain contain protein (PTHR23002), zinc finger protein (PTHR11389) and MADS box protein (PTHR11945) were significantly overrepresented (P values 4.0*10^-13, 7.8*10^-7, and 1.5*10^-6, respectively), confirming the results from the GO analysis. No transcription-factor related PANTHER families were significantly overrepresented in the potato PPT library.

Between tomato and potato, the majority of the overrepresented terms in potato corresponded to important biological and biochemical processes. For example, zinc finger CCHC domain containing proteins (PTHR23002) and general transcription factor 2-related zinc finger proteins (PTHR11697) occurred with a significantly (P value 2.2*10^-16 for both) higher frequency in potato POT than in tomato HBa; the latter was also overrepresented in the potato PPT library. This was also reflected in the GO annotation through terms such as nucleic acid binding (GO:0003676) and zinc ion binding (GO:0008270). The overrepresentation of these terms relative to tomato suggests an expansion of transcription factors or other genes for DNA binding proteins in the potato genome.

Another example is the cytochrome P450 superfamily (PTHR19383), which was also found in the GO analysis through terms such as iron ion binding (GO:0005506) and mono-oxygenase activity (GO:0004497). Cytochrome P450 proteins play important roles in the biosynthesis of secondary metabolites, and the overrepresentation of these proteins in potato could indicate an expanded network of pathways that synthesize secondary metabolites in potato.

A final example involves the large family of plant-type serine-threonine protein kinases (PTHR23258), which are known to play important roles in disease resistance in various plant species (for example, the Pto gene in tomato [14]). In the PANTHER database, this family consists of 104 different subfamilies, 71 of which were found in the tomato and potato BESs. Out of these 71 subfamilies, 15 were found only in tomato, and five were unique to potato. Most of the subfamilies that were found in both species were overrepresented in potato, such as LRR receptor-like kinases (PTHR23258:SF462) and LRR transmembrane kinases (PTHR23258:SF474). Several subfamilies occurred at a higher frequency in tomato, including serine/threonine specific receptor-like protein kinases (PTHR23258:SF416) and Pto-like kinases (PTHR23258:SF418). Thus, while the complement of serine-threonine protein kinases in potato exceeds that of tomato, several of the subfamilies have expanded specifically in tomato. This may reflect an adaptation for resistance to different pathogens, or a difference in the dominant mechanism of pathogen resistance between these species.

Comparative genome mapping

Out of the 135,842 pairs of tomato BESs that were compared to the A. thaliana genome, 15,283 pairs had one or more matches. These matches were divided into five categories, as is shown in the last five columns of Table 3. The 'single end' category represents the BAC end pairs from which only one of the two sequences had a match to the A. thaliana genome, and contained the majority of the matches (10,191). Paired end matches, in which the BESs from the same BAC each had a match to a different chromosome, were assigned to the 'non-linear' category. The 'gapped' category contained 4,836 BAC end pairs that matched to the same A. thaliana chromosome with a distance between the paired matches that was either smaller than 50 kb or larger than 500 kb. The final two categories represented the BACs from which both end sequences were matched to the genome within a distance of 50 to 500 kb of each other, either in the correct orientation with respect to each other ('colinear'), or rearranged with respect to each other ('rearranged'). Out of the 4,840 tomato BES pairs that hit to the same A. thaliana chromosome, three pairs fell into the 'colinear' category, and one pair fell into the 'rearranged' category, suggesting the presence of four putative micro-syntenic regions between tomato and A. thaliana.

Table 3 BLASTN hits between the tomato and potato BESs, and the A. thaliana genome

Full size table

Potato had 55,662 pairs of BESs, out of which 117 pairs were mapped to the A. thaliana genome, with both BESs of the pair matching the same chromosome. Two potato BACs displayed putative microsynteny based on the end sequence matches, one of which was colinear, whereas the other represented a possible rearrangement. In comparison to tomato, potato had very few BACs that fell into the 'gapped' category, although the smaller PPT library had more than five times as many sequences in this category as the POT library. Interestingly, the large majority of the tomato BACs that fell into this category was from the Eco and Mbo libraries (1,279 and 3,507, respectively). The EcoRI and MboI digested libraries were found to contain a high fraction of ribosomal RNA genes in the RepeatMasker analysis, and indeed more than 80% of the sequences from these libraries that fell into the 'gapped' category contained ribosomal RNA genes.

Repeating the same analysis against the P. trichocarpa genome, only 708 of the tomato BES pairs matched with both ends to the same chromosome (the sum of the last three columns in Table 4). It should be noted here that P. trichocarpa has both a larger number of chromosomes than A. thaliana (19 versus 5) and approximately twenty-two thousand additional contig sequences that have not yet been integrated into the chromosome pseudomolecules. Based on these numbers alone, one would expect a smaller number of paired BESs to map to the same chromosome or contig sequence. Even so, P. trichocarpa displayed more regions of micro-synteny with tomato than A. thaliana: 73 pairs of BESs mapped within a distance between 50 and 500 kb of the other BES in the pair. More than two-thirds of these matches (51, the 'colinear' category in Table 4) showed colinearity between tomato and P. trichocarpa, whereas the remaining 22 hits represented rearrangements in their respective regions of micro-synteny.

Table 4 BLASTN hits between the tomato and potato BESs, and the P. trichocarpa genome

Full size table

Consistent with the difference between the tomato – A. thaliana and tomato – P. trichocarpa mappings, a smaller number of potato BES pairs (75) could be mapped with both ends to the same chromosome in P. trichocarpa, than in A. thaliana. Of these, there were 41 regions of potential microsynteny, out of which 24 were colinear. Compared to tomato, the 'non-linear' and to a lesser extent the 'gapped' categories were underrepresented in potato. Again these differences seem to originate from the fact that many of the BESs in the Eco and Mbo libraries contain ribosomal RNA genes. The majority of these sequences fell into the 'non-linear' category in the P. trichocarpa comparison, rather than the 'gapped' category as was the case with A. thaliana, due to the ribosomal RNA genes being contained in some of the unassembled contig sequences rather than in the chromosomal pseudomolecules.

Discussion

Sequence properties

Based on the differences between the libraries in both tomato and potato, it seems unlikely that any of these partial digestion-based libraries represents an unbiased cross section of the genome. For example, in tomato the Mbo library has a higher GC percentage than the HBa and Eco libraries. This difference is likely caused by the length and GC content of the restriction sites that were targeted in the digestion of the genome: both the HindIII and EcoRI sites (AAGCTT and GAATTC, respectively) have a length of six nucleotides and a GC content of 33.3%, whereas the MboI site (GATC) has a length of four nucleotides and a GC content of 50%. The consequences of this are clearly visible in the results of the gene and repeat content analyses presented in this paper: results differ markedly among libraries made with different enzymes. However, we think it reasonable to assume that tomato and potato libraries derived from digestion with the same restriction enzyme would have similar sequence bias. Using this assumption, we strive to minimize any effect of sequence bias on our results by maintaining logical separation of BESs from different libraries, and only directly comparing data for BESs from libraries constructed with the same restriction enzymes.

The tomato BESs (and specifically the Mbo BESs) are shorter than the potato BESs on average. The difference in average sequence length between the tomato HindIII and EcoRI libraries and their potato counterparts is approximately 60 nt for both libraries and is most likely the result of a difference in sequencing quality and equipment. However, we think it reasonable to assume that a difference in sequence length on this scale would not influence the results of the similarity-based analyses that have been performed in this study.