Our evaluation of targeted gene space in the strawberry diploid species, Fragaria vesca, unveils new knowledge about 20 important genomic neighborhoods: information that can guide a diversity of gene- or trait-specific investigations, and facilitate site-specific molecular marker development. Moreover, the cumulative generation of over 1.75 Mb of genomic sequence by the present investigation of 20 gene-targeted sites and its companion study of 31 randomly selected sites , provides an invaluable baseline of robustly assembled and carefully annotated Sanger sequence data to which future Next Generation data sets and high throughput bioinformatic analyses can be compared and assessed. We anticipate that the experience gained through this effort will contribute valuable perspective, precedent, and impetus to whole genome sequencing efforts in Fragaria.
Although our assessment of gene content ultimately relied on homology-based methods, ab initio predictions provided an illuminating framework within which to organize and interpret homology-based determinations. In undertaking a comparison of the six higher plant ab initio training models (Arabidopsis, Medicago, monocot, Nicotiana, Lycopersicon, Vitis) accessible on Softberry's FGENESH website , we hypothesized that the taxa most closely related phylogenetically to Fragaria would provide the best training models for our analysis. According to the most recent release of the Angiosperm Phylogeny Group , the ordered phylogenetic distances of the six training model taxa from Fragaria (order Rosales) are (closest to most distant) Medicago (Fabales) <Arabidopsis (Brassicales) <Vitis (Vitales) <Nicotiana and Lycopersicon (Solonales) < monocots. By one measure, the "accurate" prediction of start and stop codons, the Medicago (Mt) model was marginally better than the Arabidopsis (At) model, and both of these surpassed the remaining four predictive models. It would be of considerable interest to know whether and to what extent an FGENESH model trained on Rosaceae sequence data would outperform the Mt and At models for predictive analysis of Fragaria sequence; however, such a model was not available for the present study.
Our analysis indicated that the six FGENESH models were variably prone to over-prediction, under-prediction, gene-merging, and/or gene-splitting. However, with knowledge of these tendencies in hand, the overall perspective provided by comparisons among these disparate models provided a useful backdrop to the interpretation of homology-based analyses, helping to draw attention to structural anomalies worthy of further exploration. On balance, our experience suggests that integrated consideration of all six FGENESH model outputs provided maximal insight into the genetic content of the studied Fragaria sequences. As easily visualized by viewing a broad sampling of the Fosmid Figures, the FGENESH models were in substantial agreement in some genomic regions, but at considerable variance in others. Yet such disagreements are themselves informative, potentially drawing attention to sites of unconventional functionality.
The genomic frequency of gene sites identified in our study of 20 gene-targeted fosmid clones is similar to that found in 31 randomly selected clones from the same genomic library . Discounting 11 TE-related gene sites, the present study identified 120 protein encoding genes and pseudogenes within a total of ~708 kb, or an average of one gene site per 5.9 kb. Similarly, the companion study identified 182 gene sites in 1,035 kb, or one gene per 5.7 kb . That an intentional focus on gene-rich, as opposed to randomly selected, genomic sites yielded similar protein-encoding gene densities is consistent with the finding that TE and other repetitive sequence content in the F. vesca genome is quite low , and that most of this genome is, in fact, gene rich.
A surprising finding in the present study was the number of targeted genes, as well as non-targeted genes, that were tandemly duplicated. Full length tandem gene duplications of targeted genes were seen in fosmids 14K06 (ADH), 73I22 (CHS), 41O22 (TPS), 76K13 (PISTILLATA), 19M24 (NBS-LRR resistance-like), and 34E24 (NBS-LRR resistance-like). Tandem or near-tandem duplications of genes or gene fragments not targeted by probes were also seen. Such duplications involved apparently truncated pseudogenes on fosmids 13I24 (CIPK20 KINASE), 32L07 (SMC2), and 41O22 (pentatricopeptide containing protein). On fosmid 08G19, two apparently full-length copies of a small basic intrinsic protein gene flanked a retroelement-like sequence. Although the clustering and neighboring duplication of disease resistance-like genes is a well-known phenomenon in plants , it is noteworthy that, excluding the resistance-like genes, none of the homologues to the tandemly duplicated Fragaria genes enumerated above were themselves tandemly duplicated in Arabidopsis.
EST support varied with respect to the members of tandem gene duplicates. Substantial EST support existed for both CHS copies. In contrast, EST support was lacking for ADH-1, but was sufficient to allow definition of a gene model for ADH-2. Only one of the two Pistillata copies had top-tier EST support. Similarly, only one of the two tandemly duplicated NBS-LLR resistance-like genes on fosmid 19M24 had any top-tier EST support, while no such EST support existed for either TPS copy on fosmid 41O22 or for the single copy on fosmid 53J04. The absence of EST support for one or both members of a tandemly duplicated gene pair might be the consequence of differential expression patterns, or might simply be attributable to sampling bias in the existing Fragaria EST database, wherein most of the currently available sequences are from whole seedlings of Fragaria vesca subjected to a handful of stressors. Alternately, absence of EST support might be indicative of mutational gene silencing, which is one of several possible evolutionary fates of duplicated genes . Resolution of these possibilities awaits the much needed expansion of the Fragaria EST database to include a comprehensive diversity of tissue types, and representation of influence by a broad spectrum of environmental variables.
The complete elucidation of candidate gene sequences from strawberry opens many opportunities to now test functional predictions as they relate to plant productivity. Clearly the information identified from analysis of LEAFY, SOC, PHYA, HY5 and CO all may present a means to now translate information about flowering from Arabidopsis and other species to strawberry. Strawberry species exhibit a wide range of photoperiodic behaviors. These are of intense interest to breeders as photoperiod sensitivity strongly dictates the utility of a given cultivar.
Anthocyanin pigmentation is an important aspect of fruit color and quality, but also can be a factor in stress resistance and other physiological functions and environmental interactions throughout the plant . The identified CHS, CHI, DFR, and RAN genes are likely to be factors in many aspects of anthocyanin pigment composition and spatiotemporal distribution. Along with the anthocyanin pathway gene products, terpene synthases play a demonstrated role in flavor and fragrance as aspects for fruit quality, also making them of interest to strawberry breeders.
Two other metabolic genes, ADH and GBSSI, were of interest because of their widespread usage in plants , and their specific recent usage in Fragaria [5, 19, 29], for phylogenetic analysis. The finding that the ADH gene is tandemly duplicated in F. vesca, and the differential EST support for its two gene copies, further extends the potential interest in ADH as a focal point for comparative evolutionary studies in Fragaria. The GBSSI gene sequence described herein is that of GBSSI-1, as distinct from the GBSSI-2 gene used in the phylogenetic analysis of Fragaria by Rousseau-Gueutin et al. . The presence of at least two copies of the GBSSI gene is a general feature of the Rosaceae family .
Disease resistance genes are of central interest to plant breeders. Conserved segments of NBS-LRR resistance-like genes have been isolated from genomic DNA in many plant species, including strawberry , using degenerate primers targeted to conserved sites . The NBS-LRR and LRR resistance-like gene sequences we present here are the first complete genomic disease-resistance like gene sequences to be reported in strawberry.
As the number of sequenced genomes grows, various studies have examined gene arrangement between sequenced genomes in the interest of inferring evolutionary relationships. One recent study defined microsyntenic relationships by examining colinearity of Prunus (a close taxonomic neighbor of Fragaria), Populus, Medicago and Arabidopsis. A positive relationship was defined as a distance not less than 200 kb that contained four gene pairs . Comparisons using this approach relating Prunus and Arabidopsis genomes indicated that microsynteny is not well-conserved between these species. In the present study gene-pair relationships were examined between the genes ordered in the fosmids and the known gene order in Arabidopsis. Not surprisingly, similar results were obtained to those in the Prunus-Arabidopsis comparisons. The data in Table 3 indicate that out of the set of 20 fosmid clones, only nine shared evidence of potential gene-pair relationships with Arabidopsis.
The data agree well with the conclusions of Jung et al. . There are some clear special cases that should be considered carefully. The two adjacent genes on fosmid 34E24 are NBS-LRR genes. These are typically found as proximally located members of a multigene family, so it is not surprising that these would be detected as colinear in these analyses. Two fosmids contain strawberry terpene synthase genes, where Arabidopsis only has one. In both cases an immediate neighbor is an Arabidopsis gene, yet a gene found on different linkage groups. This finding indicates the possibility that the terpene synthase gene may have been a site for duplication in strawberry relative to a common ancestor, or perhaps a site of duplication within strawberry.
EST support and coverage
The genomic sequence analyzed provides a means to test gene prediction against actual gene-coding sequence, best estimated by analysis of EST relationships. Of the total predicted genes on all fosmids, approximately half (78/148) maintain >85% identity with an EST in the Viridiplantae database. When compared against ESTs from the Rosaceae even fewer matches were obtained, and those were typically from Malus, Rosa and Prunus where significant EST resources exist. Of all of the sequences featuring EST cognates, only fourteen genes have sufficient EST support to provide complete delineation of exon/intron boundaries as a basis for gene modeling, while 76 gene sites had no top-tier Fragaria EST support. Exemplifying the latter case, support was lacking for fosmid 08G19 gene 3 (Figure 12) and fosmid 10B08 gene 1 (Figure 11). The first is annotated only as an embryo defective transcript and the second is Leafy. Both of these are examples where transcripts may be expected to be found in specialized tissues and/or developmental contexts. Therefore, it is not surprising that representative cDNA sequences do not appear in the public databases, wherein over 90% of sequences represent seedling transcripts in response to abiotic stress.
Taken together, these findings indicate the need for more Fragaria EST coverage, especially from specific tissues and developmental states. EST coverage from other diploid species, such as Fragaria iinumae, will be helpful in the development of subgenome-specific markers in the cultivated strawberry Fragaria ×ananassa. The reciprocal condition also exists, where fosmid-based sequences have EST coverage, but it is either confined to the Rosaceae (no match in Viridiplantae) or possibly strawberry specific (no match beyond Fragaria). These uncharacterized expressed sequences are abundant in EST collections but were not identified in this study.
The identification of SSR loci for use as potential molecular markers for linkage mapping, marker assisted selection, and diversity studies has received considerable attention in Fragaria [33, 34]. A total of 158 SSRs of five or more homogeneous repeat tracts were identified. Of the di-nucleotide repeat motif types, AG and AT were by far the most common, as has also been reported in species as diverse as Arabidopsis and rice . Among tri-nucleotide repeat types, AAG was the most common, eclipsing the frequency of any other tri-nucleotide type by a factor of at least 2.8. AAG is also the most common tri-nucleotide repeat motif in Arabidopsis, while CCG is the most common type in rice .
The utilized SSRIT program counts only uninterrupted repeat tracts as SSRs. Thus, a continuous sequence such as (TCC)6TCT(TCC)5 (as in fosmid 14K06, SSRs D and E) would be counted as two SSRs by SSRIT because the two TCC tracts are interrupted by a TCT. From the perspective of PCR primer pair design for SSR marker genotyping, this and several other instances of close-proximity SSR tracts would have to be treated as a single SSR locus, amplified by a single primer pair. Thus, the total number of operationally defined SSR loci detected in the fosmid inserts is somewhat less than the total number of 158 counted SSRs. If any pair of SSR tracts separated by less than 100 bp is counted as constituting a single operational SSR locus for purposes of molecular marker development, there are 144 discrete SSR loci, with a frequency of 1 SSR locus per 4.9 kb, or about 200 SSR loci per Mb.
The current F. vesca linkage map  has a total length of 424 cM. Given the 206 Mb size of the F. vesca genome, there is an average ratio of 486 kb/cM. Extrapolating from these data, SSR loci are distributed in the F. vesca genome with a density of about 92 SSR loci per 1 cM of map distance, thus indicating that sufficient SSR loci exist to support the construction of SSR-based linkage maps to a resolution of well under 1 cM.
In this study, thirteen TE-related elements were detected on the basis of Blastx homology and structural analysis. A thorough analysis of TE-related and other repetitive element content in 31 random sequence samples comprising ~1 Mbp in F. vesca was presented in the companion study , while Ma et al.  reported the isolation of retroelement sequences from Fragaria ×ananassa. No top-tier EST support was found for any of the TE-related sequences identified in the present study or that of Ma et al. , and no evidence of contemporary TE transpositional activity has been reported to date in Fragaria.