Identification of genic moss SSR markers and a comparative analysis of twenty-four algal and plant gene indices reveal species-specific rather than group-specific characteristics of microsatellites

Background The moss Physcomitrella patens is an emerging model in comparative plant science. At present, the Physcomitrella genome is sequenced at the Joint Genome Institute (USA). In this study we present our results on the development of expressed sequence tag-derived microsatellite markers for Physcomitrella patens, their classification and applicability as genetic markers on the intra- as well as on the interspecies level. We experienced severe restrictions to compare our results on Physcomitrella with earlier studies for other plant species due to varying microsatellite search criteria and a limited selection of analysed species. As a consequence, we performed a side by side analysis of expressed sequence tag-derived microsatellites among 24 plant species covering a broad phylogenetic range and present our results on the observed frequencies. Results We identified 3,723 microsatellites using the software MISA in a non-redundant Physcomitrella expressed sequence tag database comprising more than 37 megabases of nucleotide information. For 2,951 microsatellites appendant primer sequences have been derived. PCR of 376 microsatellites yielded 88 % successful amplicons and over 30 % polymorphisms between two Physcomitrella accessions. The polymorphism information content of 64 microsatellites based on 21 different Physcomitrella accessions was comparably high with a mean of 0.47 +/- 0.17. Of the 64 Physcomitrella microsatellite markers, 34 % respectively 79.7 % revealed cross-species applicability in two closely related moss species. In our survey of two green algae, two mosses, a fern, a fern palm, the ginkgo tree, two conifers, ten dicots and five monocots we detected an up to sevenfold variation in the overall frequency with a minimum of 37 up to maximal 258 microsatellites per megabase and a high variability among the different microsatellite class and motif frequencies. Numerous species-specific microsatellite frequencies became evident and several deviations to earlier reports were ascertained. Conclusion With the Physcomitrella microsatellite marker set a valuable tool has been made available for further genetic and genomic applications on the intra- as well as on the interspecies level. The comparative survey of expressed sequence tag-derived microsatellites among the plant kingdom is well suited for a classification of future studies on plant microsatellites.


Results:
We identified 3,723 microsatellites using the software MISA in a non-redundant Physcomitrella expressed sequence tag database comprising more than 37 megabases of nucleotide information. For 2,951 microsatellites appendant primer sequences have been derived. PCR of 376 microsatellites yielded 88 % successful amplicons and over 30 % polymorphisms between two Physcomitrella accessions. The polymorphism information content of 64 microsatellites based on 21 different Physcomitrella accessions was comparably high with a mean of 0.47 +/-0.17. Of the 64 Physcomitrella microsatellite markers, 34 % respectively 79.7 % revealed cross-species applicability in two closely related moss species.
In our survey of two green algae, two mosses, a fern, a fern palm, the ginkgo tree, two conifers, ten dicots and five monocots we detected an up to sevenfold variation in the overall frequency with a minimum of 37 up to maximal 258 microsatellites per megabase and a high variability among the different microsatellite class and motif frequencies. Numerous species-specific microsatellite frequencies became evident and several deviations to earlier reports were ascertained.

Conclusion:
With the Physcomitrella microsatellite marker set a valuable tool has been made available for further genetic and genomic applications on the intra-as well as on the interspecies level. The comparative survey of expressed sequence tag-derived microsatellites among the plant kingdom is well suited for a classification of future studies on plant microsatellites.

Background
The moss Physcomitrella patens (Hedw.) B. S. G. is an important model organism for comparative studies in plant science [1]. The ancestors of mosses and seed plants separated shortly after the transition from water to land at least 500 million years ago [2,3]. The moss Physcomitrella is therefore placed in a phylogenetic key position between the green algae and the seed plants. Physcomitrella displays an exceptionally high rate of homologous recombination [4], which is a unique characteristic among plants. This facilitates direct replacement of genomic loci to knock-out or knock-in genes in order to enable their fast and straightforward functional characterisation [5]. Functional mutations are furthermore facilitated by the dominating haploid gametophyte of the moss. Besides, Physcomitrella is easy to handle in vitro and to transfect, and is regarded as a rich source of novel genes [6]. More than 200,000 sequenced cDNA fragments, so called expressed sequence tags (ESTs), derived from the worldwide labstrain 'Gransden' have been assembled and annotated in a non-redundant database, a Physcomitrella gene index [7][8][9]. At present, the Physcomitrella patens genome is sequenced by a whole genome shotgun approach at the Joint Genome Institute (USA) and the appendant international moss genome consortium collaborates in processing and assembling the genome data. Little is known about the genome organisation yet. The Physcomitrella genome is of intermediate size with about 511 megabases [10] and cytogenetic analyses indicate a chromosome number of n = 27 [11]. Neither molecular markers nor genetic linkage maps have been established so far. Thus our objective was to establish EST-derived microsatellites in order to be able to create a genetic map for Physcomitrella patens.
Microsatellites or simple sequence repeats (SSRs) denote a DNA class of mono-up to hexanucleotide sequence repeats dispersed over the whole genome with an accumulation in nonrepetitive DNA and untranslated 3'-and 5'regions of genes [12,13]. SSRs are currently preferentially applied as molecular markers in numerous organisms particularly with regard to their unique hypervariabilty combined with co-dominance, specificity and reproducibility [14,15]. The main disadvantage of SSRs as markers has been their time consuming development in the laboratory [16]. However, with the fast-paced increase of nucleic acid sequences during the last decade it became practicable to screen in silico for microsatellites in sequence databases for a growing number of organisms. Several tools have been made available for the computational database mining of SSRs, reviewed in [17]. Apart from genomic sequences, especially the large number of availble ESTs and the respective databases have been used extensively to derive SSRs, for example [18][19][20][21][22][23]. A big advantage of ESTderived markers is their non-anonymity. Each marker is absolutely linked to a distinct gene and therefore to its known or putative function. Moreover, each marker sequence can be extended by the underlying EST. This in particular can be of great benefit in the implementation of genetic markers and linkage maps as a scaffold for physical mapping.
A clear-cut comparison of the first large scale EST-based SSR search for a moss with results obtained for other plant species was virtually impossible due to the large differences in the applied SSR search strategies concerning the redundancy and the chosen parameters for the minimal SSR motif length or the inclusion/exclusion of imperfect SSRs [17]. In wheat, for example, this differences led to reported SSR frequencies ranging from 57 [24] up to 1,350 [12] SSR per megabase. Furthermore, comparative analyses of SSR frequencies based on large scale computational database searches have been limited to only a few groups of mono-and dicots [12,13,18].
In this study we present our results on the development of SSR markers for Physcomitrella patens based on clustered EST data, their classification and characterisation and their applicability as genetic markers on the intra-as well as on the interspecies level. For a better classification of our results on Physcomitrella SSRs, we performed a comparative side by side EST-based SSR search in 24 phylogenetically well distributed plant species including two green algae, two mosses, a fern, a fern palm, the ginkgo tree, two conifers, ten dicots and five monocots and present our results on the observed SSR frequencies.

Results and discussion
Within the Physcomitrella patens gene index dimer SSRs are most frequent We analysed 48,961 virtual Physcomitrella transcripts with the MISA software [25,26]. In 3,108 (6.3 %) of the EST sequences one or more microsatellites were found, 3,723 microsatellite motifs were detected in total. This corresponds to a frequency of 98.8 SSR counts per mega base pairs (counts/Mbp) and one SSR per 10.1 kilo base pairs (kbp). Interestingly, the majority of the SSRs were dimer repeats with 2,095 individual SSRs (56.3 %), whereas only 1,315 (35.3 %) were trinucleotide repeats. This is an unexpected result because trimer repeats are reported to be the most frequent SSRs in plant ESTs [17]. Other SSRs were 273 (7.3 %) tetranucleotides, 39 (1.1 %) pentanucleotides and one hexanucleotide SSR (Table 1). This corresponds to SSR counts/Mbp of 55.6 for dimers, 34.9 for trimers, 7.2 for tetramers, 1.0 for pentamers and 0.03 for hexamers, respectively. Among each of the SSR classes the different possible repeat motifs were not evenly distributed ( Table 1). The average motif length including compound SSRs was 21.3 base pairs (bp) with 26 % of the motifs being larger than 20 bp. 472 EST sequences contained more than one SSR. 381 ESTs contained two SSRs (80.7 %), 59 contained three (12.5 %), 23 contained four (4.9 %), six contained five (1.3 %) and one EST each contained six, seven respectively 13 SSRs (each 0.2 %). For the generation of PCR markers, multiple SSRs in one EST being separated by less than 100 bp were defined as in 'compound formation' and subsequently handled as one single potential marker. Multiple SSRs being separated by more than 100 bp were further treated as separate single marker loci. In total, 3,171 SSRs were available as potential marker loci, comprising 2,924 SSRs in single and 247 SSRs in compound formation.

A high rate of Physcomitrella SSRs are PCR applicable
Oligonucleotide primer sequences were successfully derived for 2,951 (93 %) of the 3,171 single and compound microsatellites using the Primer3 [27] software in batch mode with MISA. Primers could not be designed for SSR motifs comprising too short or inappropriately composed flanking sequences.
For 376 SSRs, primer pairs were synthesized and PCR was performed using the standard Physcomitrella lab strain 'Gransden'. PCR led to successful product amplifications in 329 of 376 SSRs (87.5 %), of which 27 SSR amplicons (7.2 %) yielded longer sequences than expected (size difference from 0.1 up to 2 kbp), most likely due to presence of introns. The frequency of successfully amplified SSRs was more than 20 % higher than described for Hordeum vulgare, where SSR markers were also derived with MISA/ Primer3 [25]. This may be due to the high quality of the clustered Physcomitrella EST database and our specific PCR conditions with a touchdown PCR and a decreased final primer annealing temperature. Moreover, differing intron frequencies with 22 % in Hordeum [25] and 8.4 % in Physcomitrella, as detected in SSR amplicons, may also contribute to the higher rate of successful PCRs in Physcomitrella.  868  395  210  160  116  76  59  50  53  108  2095  NNN  738  283  137  81  48  12  10  1  2  2  1

The Physcomitrella SSR markers display a high degree of polymorphism
The applicability of the EST-derived SSRs as molecular markers could be proven with side by side amplification of the 376 SSRs in the lab strain 'Gransden' and the french accession 'Villersexel-K3', both of which we are using as parental lines in a genetic mapping approach. 110 markers (33.1 %) were polymorphic between the two accessions ( Figure 1).
The informative properties of the EST-derived microsatellites were further evaluated in 64 SSR markers selected from the 376 markers and a collection of 21 worldwide Physcomitrella accessions (Table 2, Figure 2). In total, 238 alleles, including 30 null alleles (13.2%), were detected with a maximum of 7 alleles and an average of 3. from 0.0 -0.78 with a mean PIC value of 0.47 ± 0.17. This PIC is comparably high for EST-derived SSRs, which are generally known to be less informative than SSRs derived from genomic sequences [28,29]. An explanation for the high amount of SSR polymorphism, including numerous null alleles, could be a high degree of genetic diversity in the studied accessions.

Datasets of 24 plant species covering a broad phylogenetic spectrum
We aimed to achieve a clear-cut comparison of our results for Physcomitrella with those of other plant species. Unfortunately, to classify surveys on EST-derived SSRs was rather complicated due to large differences in the applied SSR search strategies and by the limited number of analysed species. As a consequence, we identified microsatellites in a comparative side by side search using the MISA software upon the gene indices of 24 plants species. Accounting for the present availability of EST sequences, the species selection encompasses as many major clades of the plant kingdom as possible. In total, two green algae, two true mosses, a fern, a fern palm, the ginkgo tree, two conifers, ten dicots and five monocots were analysed (see Additional file 5). However, for some classes among the plants no appropriate EST datasets were available, e. g. liverworts, hornworts and magnoliidae.
The chosen plant sequence datasets consisted of assembled and non redundant EST sequences ('tentative consensus sequences') and of single non redundant ESTs

The overall SSR frequency varies up to sevenfold among plants
The MISA search statistics for all 24 analysed plant species have been made available at COSMOSS [41]. The overall frequency of SSRs varied 6.9 fold among the 24 gene indices (Figure 3, see also Additional file 1). The average SSR frequency was 114.7 counts/Mbp with a standard devia-tion of +/-60.1 counts/Mbp. The green alga Mesostigma, the fern palm Cycas, both conifers (Pinus and Picea) as well as the monocot Allium cepa were significantly deviating from the average with an extremely reduced SSR frequency of 37.3, 52.9, 41.5, 47.3 and 38.9 counts/Mbp, respectively. The monocot Oryza, however, as well as the dicots Aquilegia and Mesembryanthemum revealed a significantly increased frequency with 258, 240 and 239 counts/Mbp, respectively. The high SSR frequency of rice has been reported in earlier results [18]. The hitherto unnoted significantly biased frequencies of the further seven species indicate a much higher variability of ESTderived SSR frequencies among plants than previously reported. Our data for Cycas, Ginkgo, Picea and Pinus seem to indicate that a low SSR frequency is an intrinsic characteristic for gymnosperm gene indices. Among the monocots we ascertained the highest variability in the SSR frequencies with Allium (38.9 counts/Mbp) and Oryza (258.0 counts/Mbp).
Pearson's correlation coefficients (R) of the SSR frequencies with the average GC-content of the analysed species were not significant with -0.12 for dimer SSRs, 0.23 for trimer, 0.24 for other SSRs and 0.04 for the total SSRs. Thus correlations of SSR frequencies with the average GCcontent of the analysed species could not be concluded based on our data.

Trimer and dimer SSRs are most frequent except for the alga Mesostigma
In all datasets, more than 91 % of the detected SSRs were dimer and trimer repeats with the exception of the green alga Mesostigma where they accounted for only 58 % of the total SSRs. Among tetra-, penta-and hexamers, the tetramers are the most abundant in 20 of 24 plants, whereas in the remaining four plant species hexamer SSRs are most abundant (Figure 3, see also Additional file 1).
The ratio of the two most common SSR classes, the dimer and trimer repeats, varied among the datasets. In 15 plants trimer SSRs were more frequent than dimer SSRs. Interestingly, the higher dimer than trimer SSR frequency of Physcomitrella was also true for eight more species among the analysed sample. This dominance of dimer SSRs in nine of 24 plants contradicts the general assumption that trimer SSRs are most frequent in plant ESTs [17]. counts/Mbp (7.9 % of the total SSRs). The increased tetramer frequency was mainly due to an increased AATT/ TTAA motif frequency (9.9 counts/Mbp) and the increased pentamer frequency due to an increased AAATT/ AATTT frequency (2.4 counts/Mbp).

Dimer SSRs averagely are longer and more variable in length than trimer SSRs
Dimer SSRs are much more variable in length throughout the analysed species than trimer SSRs with the exception of Mesostigma (Figure 4, see also Additional file 2). The average standard deviation of the dimer SSR length with +/-11.9 bp is more than two times larger than that of trimer SSRs with only +/-5.2 bp. On average, dimer SSRs (18.7 bp +/-2.9) were significantly longer (p = 0.03) than trimer SSRs (17.5 bp +/-0.8). The reduced average dimer length in Mesostigma of only 13.0 bp with a small standard deviation (+/-1.4 bp) might be correlated to the corresponding reduced dimer counts/Mbp. The average dimer length was longest in the two conifers with 24.5 bp, respectively 23.4 bp. Further significant deviations from the average SSR length (see Additional file 2) could neither be correlated with SSR counts/Mbp nor with certain taxonomic clades.

The SSR dimer motif AG/CT is exceptionally abundant in Mesembryanthemum
The abundancies of the four canonical non-redundant SSR dimer motifs varied strongly among the analysed plants ( Figure 5, see also Additional file 3). The average counts/Mbp for the SSRs motifs AG/CT, AC/GT, AT/TA and CG/GC were 27.4 (+/-27.6), 7.8 (+/-7.5), 9.8 (+/-6.8) and 1.3 (+/-1.9). The large standard deviations reflect the extremely varying motif abundancies in the analysed species. As an amendment to earlier studies, where AG/CT was found to be the most abundant dimer motif in several plants [12], in our analysis this was persistent for only 16 of the 24 datasets. In Chlamydomonas, AC/GT and in Mesostigma, Ginkgo, Picea, Pinus, Gossypium, Solanum and Allium AT/TA was the most abundant dimer SSR motif. For Picea, the high abundance of AT/TA repeats has been noted in earlier reports [42]. In Mesembryanthemum an exceptionally high frequency of the AG/CT motif with 140 Average length of dimer as well as trimer microsatellites among the gene indices of 24 plants counts/Mbp is the main contribution to the species increased overall SSR frequency.

Observed abundancies of SSR trimer motifs amend earlier reports
The abundancies of the ten canonical non-redundant SSR trimer motifs varied as well ( In the four grasses, CCG/CGG was the most abundant trimer motif and the same was due for Chlamydomonas. The increased CCG/CGG frequency has been described earlier for grasses and has been related to a high general GC-content [12]. In this context the CCG/ CGG increase in Chlamydomonas, which has the the highest GC-content (58.3 %) of all 24 gene indices, was consistent, whereas the low CCG/CGG frequency in the alga Mesostigma and the moss Tortula, which also had signifi- cantly increased GC-contents of 51.7 and 53.0 %, respectively, did not follow the earlier assumed rule.

Frequencies of dimer microsatellite motifs among the gene indices of 24 plants
The detected low CCG/CGG frequency in Allium contradicts the earlier reported generalisation that an increased CCG/CGG frequency is specific for monocots [12,18,43]. An exeptionally high frequency of the SSR motif AAG/CTT contributes to the increased overall SSR frequency in Aquilegia. The canonical motifs encompassing the three stop codons AAT, ATG and AGT were not exceptionally reduced most likely due to the presence of the untranslated regions in the EST databases.

Conclusion
In this study we present the first comprehensive overview of EST-derived microsatellites in a moss, the model plant Physcomitrella patens. We detected 3,723 SSRs in silico using clustered and assembled EST data. All 2,951 derived primer sets are publicly available, making the markers accessible for PCR analysis. 88 % of the SSRs can be suc- cessfully applied to PCR. With the EST-derived SSR marker set a valuable tool has been made available for numerous further genetic and genomic applications on the intra-as well as on the interspecies level. Using the SSR markers a genetic map can be established, the available sequence extensions by the underlying ESTs will greatly facilitate the implementation of the SSR marker loci during the pending iterative assembly process of the Physcomitrella genome sequence data.

Frequencies of trimer microsatellite motifs among the gene indices of 24 plants
In our comparative survey microsatellites were found with highly variable abundancies within the EST datasets of 24 phylogenetically well distributed plant species. Interestingly, species belonging to the same phylogenetic group, such as the two mosses, the dicots as well as the monocots did not reveal consistent class-specific SSR characteristics. Rather, species-specificities seem to contribute to the high overall variation. Two exceptions to this rule are the low overall SSR frequency of gymnosperm ESTs, which may be an intrinsic characteristic, and the abundance of the SSR trimer motif CGG/CCG, which seems to be a distinct feature of grasses.
The earlier assumed general dominance of trimer SSRs in ESTs was true for only 15 of the 24 analysed plant species. Dimer SSRs on average were longer and revealed a much higher variability in length than trimer SSRs. Correlations of SSR frequencies with the average GC-content could not be concluded. While the previously reported connection of a high general GC-content to an increased frequency of the trimer SSR motif CGG/CCG was valid for four grasses and Chlamydomonas, this was not true for the alga Mesostigma and the moss Tortula. The earlier assumed general high abundance of the SSR motif CCG/CCG for monocots was not true for Allium. Our survey is well suited for a classification of future studies on EST-derived SSRs.

Plant material
A collection of 21 worldwide Physcomitrella accessions including the standard laboratory accession 'Gransden' plus two related Funariaceae, Physcomitrium sphaericum and Funaria hygrometrica, were selected for analysis ( Table  1). The collection includes japanese, australian, african, european and american accessions [44]. Details of the collection will be presented elsewhere (Mark von Stackelberg Gabriele Schween, Stefan A. Rensing, Ralf Reski, manuscript in preparation). Plants were grown axenically on solid media and in liquid culture according to the small scale cultivation protocol as described [45]. Prior to DNA extraction, plant material was harvested, shock frozen in liquid nitrogen and stored at -80°C.

DNA extraction
The DNA extraction protocol is based on Doyle and Doyle [46]. Approximately 0.3 g plant material was ground in liquid nitrogen to a fine powder and incubated in 5 [35,36] and from NYPG [38][39][40]. For the moss Physcomitrella patens we used a non-redundant gene index of the standard lab strain 'Gransden' consisting of high quality, vector-clipped, clustered, assembled and annotated ESTs available via COSMOSS [9,37]. The same filtering and assembly procedure as for Physcomitrella was applied to derive a gene index for all available EST sequences of the moss Tortula ruralis, also available via COSMOSS. All applied databases are summarized in See Additional file 5. The percentage of the nucleotides guanin and cytosin (GC-content) was calculated for the gene indices with the EMBOSS [47] program GEECEE.

Analysis of Physcomitrella EST-derived microsatellites
For Physcomitrella patens, MISA was used in direct conjunction with the Primer3 software [27] to derive appropriate PCR oligonucleotides. The results of the microsatellite search for Physcomitrella patens are available at COSMOSS [41]. A maximum of three different primer pairs per SSR have been provided. Furthermore PCR results for 376 microsatellites, polymorphic SSRs between 'Gransden' and 'Villersexel-K3' and PIC values of 64 SSRs for 21 Physcomitrella accessions have been provided at COSMOSS [41].
The length of the SSR PCR amplicons was set to 100 -280 bp. Oligonucleotide parameters for Primer3 were set to a length of 18 -27 bp with an optimum of 20 bp, a GC content of 20 -80 % with an optimum of 50 %, a melting temperature (Tm) of 57 -63°C with an optimum of 60°C, and a primer Tm maximum difference of 1°C. SSR marker were amplified in a 20 µL PCR mix containing 2 µL of 10 × RED-Taq-PCR buffer, 0.1 mM dATP, dCTP, dGTP and dTTP, 5 pmol each of two primers, 0.5 Units RED-Taq-Polymerase (SIGMA-Aldrich) and 4 ng plant DNA. Cycling was carried out in a Biometra thermal cycler T1 starting with an initial DNA denaturation at 95°C for 2 min. The first cycle consisted of 30 sec denaturation at 92°C, primer annealing for 30 sec at 60°C and elongation for 30 sec at 72°C. In each of the 10 subsequent cycles the annealing temperature was decreased by 0.7°C. The final 25 cycles consisted of 15 sec denaturation at 92°C, 15 sec primer annealing at 52°C and 30 sec elongation at 72°C.
Electrophoretic size separation of SSR PCR products was performed in 3 % MetaPhor (Cambrex) high resolution agarose gels in 0.5 fold TBE (45 mM Tris-borate, 1 mM EDTA, pH 8.0). We thereby circumvented the lengthy and elaborate use of denaturing polyacrylamide gels. As described in [48] MetaPhor agarose is capable of resolving microsatellite size differences of only two basepairs. The comparably high costs of this agarose can be reduced by reusing it several times.

SSR-PCR evaluation
SSRs were scored visually in the gel according to their amplified fragment size. Different sizes of one marker were scored as different alleles. Same sizes of one SSR and different sizes which could not be visually distinguished were scored as the same allele. Repeated abscence of PCR products in 'Gransden'were scored as PCR failures. Repeated absence of PCR products in the 20 additional accessions were scored as null alleles if PCR worked out in 'Gransden'. The polymorphism information content (PIC value) was calculated according to Botstein et al. [49] as follows: where Pij is the frequency of the jth allele for marker i and summation extends over n alleles.