Frequency, type, and distribution of EST-SSRs from three genotypes of Lolium perenne, and their conservation across orthologous sequences of Festuca arundinacea, Brachypodium distachyon, and Oryza sativa

Background Simple sequence repeat (SSR) markers are highly informative and widely used for genetic and breeding studies in several plant species. They are used for cultivar identification, variety protection, as anchor markers in genetic mapping, and in marker-assisted breeding. Currently, a limited number of SSR markers are publicly available for perennial ryegrass (Lolium perenne). We report on the exploitation of a comprehensive EST collection in L. perenne for SSR identification. The objectives of this study were 1) to analyse the frequency, type, and distribution of SSR motifs in ESTs derived from three genotypes of L. perenne, 2) to perform a comparative analysis of SSR motif polymorphisms between allelic sequences, 3) to conduct a comparative analysis of SSR motif polymorphisms between orthologous sequences of L. perenne, Festuca arundinacea, Brachypodium distachyon, and O. sativa, 4) to identify functionally associated EST-SSR markers for application in comparative genomics and breeding. Results From 25,744 ESTs, representing 8.53 megabases of nucleotide information from three genotypes of L. perenne, 1,458 ESTs (5.7%) contained one or more SSRs. Of these SSRs, 955 (3.7%) were non-redundant. Tri-nucleotide repeats were the most abundant type of repeats followed by di- and tetra-nucleotide repeats. The EST-SSRs from the three genotypes were analysed for allelic- and/or genotypic SSR motif polymorphisms. Most of the SSR motifs (97.7%) showed no polymorphisms, whereas 22 EST-SSRs showed allelic- and/or genotypic polymorphisms. All polymorphisms identified were changes in the number of repeat units. Comparative analysis of the L. perenne EST-SSRs with sequences of Festuca arundinacea, Brachypodium distachyon, and Oryza sativa identified 19 clusters of orthologous sequences between these four species. Analysis of the clusters showed that the SSR motif generally is conserved in the closely related species F. arundinacea, but often differs in length of the SSR motif. In contrast, SSR motifs are often lost in the more distant related species B. distachyon and O. sativa. Conclusion The results indicate that the L. perenne EST-SSR markers are a valuable resource for genetic mapping, as well as evaluation of co-location between QTLs and functionally associated markers.


Background
Lolium perenne is one of the major grass species used for turf and forage in the temperate regions of the world. It belongs to the grass family Poaceae. L. perenne (2n = 2x = 14) is taxonomically related to many important plant species in the Poaceae family, including rice (Oryza sativa), wheat (Triticum aestivum L.), barley (Hordeum vulgare L.), maize (Zea mays L.), and sorghum (Sorgum bicolor L.) [1].
Several anonymous molecular markers have been developed for L. perenne, including restriction fragment length polymorphism and random amplified polymorphic DNA [2,3], amplified fragment length polymorphism [4], as well as SSR markers [5,6]. More recently, gene-tagged markers [7] have been developed and used to construct genetic linkage maps [8][9][10]. Although there have been several reports on L. perenne SSR marker development, most of these markers are currently not publicly available [8,9]. Furthermore, synteny to other Poaceae species is based on a limited number of anchor markers [11], reinforcing the need for more publicly available gene-derived EST-SSR markers for L. perenne.
Simple sequence repeats (SSRs) have become one of the most widely used molecular marker systems in plant genetics and breeding. They are widely used for genetic diversity assessment, variety protection, molecular mapping, and marker assisted selection, providing an efficient tool to link phenotypic and genotypic variation [12][13][14].
SSRs are tandem repeated sequences comprised of mono-, di-, tri-, tetra-, penta-, or hexa-nucleotide units [15,16]. SSRs are ubiquitous in prokaryotes and eukaryotes and can be found both in coding-and non-coding regions. They are ideal as molecular markers because of the codominant inheritance, relative abundance, multi-allelic nature, extensive genome coverage, high reproducibility, and simple detection [12].
The number of SSR motifs at a locus is variable, because SSRs experience a high rate of reversible length-altering mutations by unequal crossing over and replication slippage, where the transient dissociation of the replicating DNA strand is followed by misaligned re-association [17,18]. SSRs are among the most variable DNA sequences in the genome [19], and the mutation rate and type depends mainly on the number of repeat motifs [20]. However, the mutation rates differ among loci and among alleles, and also between species [21]. The resulting mutations, which typically add or subtract one or a few repeat motifs, can be reversed by a subsequent mutation at the same or any other point in the repeat motif [22]. In addition, point mutations in a repeat motif may result in an imperfect repeat motif, that in turn can be eliminated and converted back to a perfect motif again by replication slippage, which tends to eliminate imperfect repeats [22].
Whereas earlier studies on SSR marker development primarily utilized anonymous DNA fragments containing SSRs isolated from genomic libraries, more recent studies have used computational methods to detect SSRs in sequence data generated from large-scale EST sequencing projects. About 1 to 5% of ESTs from different plant species have been found to contain SSRs suitable for marker development [23]. EST-SSR markers have been developed for a number of plant species, including grape [24], rice [25], durum wheat [26], rye [27], barley [28], barrel medic [29], ryegrass [8], wheat [30], and cotton [31]. EST-SSR markers are gene-tagged markers directly associated with an expressed gene and, thus, completely linked with putative qualitative or quantitative trait locus alleles. EST-SSR markers are, therefore, superior and more informative compared to anonymous markers [7].
The conservation of grass genomes has been comprehensively documented, and comparative genomics has become an important strategy to extend genetic information from model species to species with a more complex genome, as well as between related species with complex genomes [11,32]. As EST-SSR markers are derived from expressed genes, they are more conserved and have a higher level of transferability to related species than anonymous DNA markers. They are, therefore, useful as anchor markers for comparative mapping across species, comparative genomics, and evolutionary studies [23,24,28,29,33,34]. However, the conserved nature of EST-SSRs may also limit their degree of polymorphism. The transferability of SSR loci across species within a genus has in several studies been above 50% [28,29,[35][36][37], whereas the transferability of SSR loci across genera was poor [28,35,38,39].
We report on the exploitation of a comprehensive EST collection in L. perenne for SSR identification. The objectives of this study were 1) to analyse the frequency, type, and distribution of SSR motifs in ESTs derived from three genotypes of L. perenne, 2) to perform a comparative analysis of SSR motif polymorphisms between allelic sequences, 3) to conduct a comparative analysis of SSR motif polymorphisms between orthologous sequences of L. perenne, Festuca arundinacea, Brachypodium distachyon, and O. sativa 4) to identify functionally associated EST-SSR markers for application in comparative genomics and breeding.
The 25,744 ESTs from the three genotypes of L. perenne were screened for SSRs using the MISA software [28]. As shown in Table 2, a total of 1,458 redundant ESTs containing an SSR were identified from the 25,744 ESTs. Thus 5.66% ESTs contain at least one SSR. Cluster analysis of the EST-SSRs yielded a final number of 955 (3.71%) nonredundant EST-SSRs. The percentage of redundant ESTs containing an SSR of the two genotypes NV#20F1-30 and NV#20F1-39 was 3.56 and 3.66, respectively, whereas the percentage of ESTs containing an SSR of the genotype F6 was 9.97%. On average, approximately one SSR was found per 10 kb in the genotypes NV#20F1-30 and NV#20F1-39, whereas one SSR was found per 2.7 kb in the genotype F6, corresponding to a total of approximately 26 ESTs per SSR for the two genotypes NV#20F1-30 and NV#20F1-39, and 11 ESTs per SSR for the genotype F6. A total of 133 ESTs had more than one SSR motif, 96 of which were considered the compound type according to the predefined criteria ( Table 2).
In the datasets from the genotypes NV#20F1-30 and F6, there were significantly (X 2 ; p < 0.05) more tri-repeat than di-and tetra-repeat SSRs, while in the dataset from the genotype NV#20F1-39, there were significantly (X 2 ; p < 0.05) more di-and tri-than tetra-repeat SSRs ( Figure 1). No significant differences (X 2 ; p < 0.05) was observed between genotypes with respect to tri-and tetra-repeat SSRs, while the EST-SSRs derived from the genotype NV#20F1-39 contained significantly (X 2 ; p < 0.05) more di-repeat SSRs compared to the EST-SSRs derived from the other two genotypes. The frequency of the SSR motifs (any two complementary sequences considered one motif) are listed in Table 3 for the EST-SSRs from NV#20F1-30, NV#20F1-39, and F6, and in Table 4 for the combined dataset.
In some cases, the frequency of SSR motifs for EST-SSRs varied significantly (X2; p < 0.05) between the three genotypes (Table 3). In the genotype F6, the SSR motif CCG/ CGG was identified in 41.8% of the EST-SSRs but only in 1.4% and 1.2% of the respective EST-SSRs in the genotypes NV#20F1-30 and NV#20F1-39.
In silico analysis of allelic and genotypic SSR motif polymorphisms A total of 521 contigs containing an SSR motif were identified from the 3,195 L. perenne contigs. The individual sequences within each contig were analysed for SSRs, and the results of the SSR searches were subsequently compared within each contig, to identify allelic-and/or genotypic polymorphisms at the SSR motif. A total of 22 contigs containing EST sequences with either allelic-and/ or genotypic SSR polymorphisms were identified, corresponding to 2.3% of the non-redundant EST-SSR contigs ( Table 5).
In all 22 contigs, the SSR motif polymorphisms identified were changes in the number of repeat units, while no contigs were identified with changes in the repeat type. Most of the SSR motif polymorphisms were one to two repeat unit changes, and the maximum number of repeat unit changes observed were three (Table 5). A total number of two and one allelic SSR polymorphism were identified in contigs containing EST sequences derived from the genotype NV#20F1-30 and NV#20F1-39, respectively, while fifteen allelic SSR polymorphisms were identified in contigs containing EST sequences derived from the genotype F6 (Table 5). Comparing SSR motif polymorphisms between NV#20F1-30 and NV#20F1-39 identified two contigs containing genotypic SSR motif polymorphisms. Contig 1520 contains both genotypic and allelic SSR motif polymorphisms, with genotypic SSR motif polymorphism between the genotypes NV#20F1-30 and NV#20F1-39, as well as allelic SSR motif polymorphism between alleles derived from the genotype NV#20F1-39. Contig 0700 contains one allele from each of the three genotypes, with a genotypic SSR motif polymorphism in the allele derived from the genotype NV#20F1-39, while no genotypic SSR motif polymorphisms were identified in alleles derived from the other two genotypes (Table 5).

In silico analysis of the conservation of SSR motifs between four species of the Poaceae family
Molecular markers designed to the transcribed region of the genome are often transferable among related species, because gene sequences remain highly conserved during evolution. Molecular markers designed to the transcribed region of the genome can thus be used to construct comparative genetic maps, facilitating the study of synteny conservation, and co-linearity among related genomes.  [40]. All alignments were analysed for SSR motif polymorphisms between the four species (Table 6).
In six of the 19 clusters (31%), there were no polymorphisms at the SSR motif between the sequences of the two closely related species L. perenne and F. arundinacea. The most frequent SSR motif polymorphisms between these two species were changes in the number of repeat units corresponding to 21% of the clusters. However, nucleotide substitutions, additions, and complete loss of SSR motifs were also observed ( Table 6). None of the SSR motifs identified in L. perenne was completely conserved in B. distachyon. In six clusters (31%), the SSR motif was completely lost in B. distachyon, and in four clusters (21%) the B. distachyon SSR motif had fewer repeat units. In these four clusters, the B. distachyon SSR motif contained two to three fewer SSR motif units, compared to the corresponding L. perenne SSR motif. Nucleotide substitutions and additions were observed in five (26%) of the nineteen compared orthologous sequences (  (Table 6).

Discussion
The present study was designed to create an SSR database of the transcribed region of the L. perenne genome by identification of SSRs in a dataset consisting of 25,744 ESTs    However, the differences observed in the frequencies of SSR motifs might not only be genotypic differences, but also be due to different cDNA libraries established for the three genotypes, because the composition of expressed genes is likely differing between the thirteen cDNA libraries selected for EST development. NV#20F1-30 and NV#20F1-39 are full-sibs [6], and most of the differences in SSR motif frequencies between these two genotypes can, therefore, be attributed to differentially expressed genes in the different cDNA libraries selected for EST development. Comparing the frequencies of SSR motifs in ESTs developed from four cDNA libraries of NV#20F1-30 with three libraries of NV#20F1-39 revealed no significant differences in frequencies of SSR motifs between these two genotypes. Thus, the variation in the frequency of SSR motifs can most likely be attributed to genotypic differences between F6, and NV#20F1-30 and NV#20F1-39. However, because most of the NV#20F1-30 and NV#20F1-39 ESTs are from leaf cDNA libraries, whereas the majority of ESTs from F6 comes from a root cDNA library, still the possibility cannot be ruled out completely, that the root cDNA library and other cDNA libraries prepared from the genotype F6 contains more SSRs.
The average frequency of 3.71% non-redundant SSRs in the transcribed region of the L. perenne genome is within the same range as previously reported for other plant species [14,23,[41][42][43]. However, caution should be exerted when SSRs frequencies are compared between different plant species, because of differences in the SSR search parameters.
Approximately 96% of all SSRs analysed were shorter than 21 bp, indicating that the length of SSR motifs in the transcribed region of the L. perenne genome are size-restricted. In addition, 6 bp di-repeats comprise 40 to 64% of the direpeats in the three genotypes, indicating that di-repeats, which do not perturb the open reading frame are preferred over others. The expansion of SSR repeats in transcribed regions of the genome is limited by functional and evolutionary constraints [44,45], because longer repeats have higher mutation rates and are, thus, less stable [20,46]. Short SSRs are probably generated by random mutations and then expanded by DNA polymerase slippage. Thus, the base composition of a sequence that precedes the evolution of SSRs is expected to influence SSR density [47,48]. The higher frequency of SSRs in the tran- scribed region of the genotype F6 could indicate, that the genome of this genotype is more prone to mutations and/ or DNA polymerase slippage compared to the genome of the other two genotypes. This indicates that there might be genotype specific cellular factors that interact with SSR motifs and play an important role in generating short tandem repeats [49].
Previous studies have shown that tri-nucleotide repeats predominate in coding regions of plant genomes [12,50], as well as in other genomes of higher eukaryotic organisms [45,51,52], because expansions or deletions in coding regions can be tolerated for tri-and hexa-nucleotide unit repeats, which do not perturb reading frames [53]. In L. perenne, the most common SSR repeat units were also found to be tri-nucleotide repeats, constituting between 59 and 85% of the repeats in the three genotypes included in this study, while di-and tetra-nucleotide units constitute the majority of the remaining motifs. Only a few penta-and hexa-nucleotide repeat units were identified. A wide variety of tri-nucleotide repeat units were represented at high percentages, however, the abundance of the different types of repeat units differed, especially between the genotype F6 and the two other genotypes. The repeat motif (CCG/CGG)n was highly represented in 42% of EST-SSRs from the genotype F6, while it was represented at a low frequency of approximately 1% in the other two genotypes.
In the two genotypes NV#20F1-30 and NV#20F1-39 the most abundant repeat encodes for the amino acid threonine, while the most abundant repeat in the genotype F6 encodes for the amino acid proline. Analysis of all protein sequences from the SWISS-PROT database for single amino acid repeats, tandem oligo-peptide repeats, and periodically conserved amino acids showed that repeats of glutamine, serine, glutamic acid, glycine and alanine seems to be fairly well tolerated in many proteins [54]. Of these amino acids, only the amino acid serine were found in the tri-nucleotide repeats of L. perenne, while the other amino acid residues were not represented. The presence of SSRs in transcripts of genes suggests that they may have a role in gene expression or function. In O. sativa, the length of a poly(CT) SSR in the 5'-untranslated region of the waxy gene is associated with amylose content [55], and in Z. mays a SSR the 5'-untranslated region of some ribosomal genes, have been suggested to be involved in the regulation of fertilization [56].
A total of 22 contigs containing EST sequences with either allelic-and/or genotypic SSR polymorphisms were identified, corresponding to 2.3% of the non-redundant EST- SSR contigs. The remaining 499 contigs (97.7%) contained no SSR motif polymorphism, indicating a selection against length polymorphisms in the transcribed region of the L. perenne genome. In all contigs containing an SSR motif polymorphism, the polymorphisms identified were changes in the number of repeat units, while no contigs were identified with changes in the repeat type or complete loss of the SSR motif. The majority of the SSR polymorphisms were allelic polymorphisms, and most of the SSR motif polymorphisms were one to two repeat unit changes. All polymorphisms identified, except for polymorphisms in compound SSRs, were changes in the number of repeat units, while no single nucleotide additions or deletions were identified, that otherwise would perturb the open reading frame.
Several studies have shown that SSRs developed for one species could be used in related plant species, and that the success of cross-species amplification depends on the evolutionary relatedness [57]. The availability of the O. sativa genome sequence provides a rich source of molecular information [58]. On the contrary, this type of information is limited for most forage and turf grass species. Com-parative mapping can make use of the genomic information available for O. sativa by applying this knowledge to less studied forage and turf species.
The transferability of the L. perenne SSR markers between species of the Poaceae family were performed in silico, to evaluate if the SSRs can be used as anchor markers for comparative mapping and evolutionary studies. SSRs designed from EST sequences are especially valuable owing to their genome location, which implies constraints on length, motif, abundance and flanking regions, the latter of particular interest in this context, because common primers can be designed to conserved flanking regions. However, before primers are designed it is necessary to evaluate if the SSR motif is conserved between related species, and therefore useful for SSR marker development. and G05_132_R1 (CTTGCTCTTGTCCGAATCGT). PCR and electrophoresis was performed as described previously [6].
mating how large the chance is, to find SSR motifs as prerequisite for a polymorphic marker, in closely-as well as distant related species.
With the L. perenne EST-SSRs presented in this paper, a valuable tool has been developed for further genetic-, genomic-, and plant breeding applications on the intra-as well as on the inter-species level.

Conclusion
In this study, we present a comprehensive set of publicly available EST-derived SSRs from three genotypes of Lolium perenne, one of the major grass species used for turf and forage in the temperate regions.
A total of 955 non-redundant SSRs were detected in silico using clustered and assembled EST data. Tri-nucleotide repeats were the most abundant type of repeats followed by di-and tetra-nucleotide repeats. Approximately 96% of all SSRs identified were shorter than 21 bp, indicating that the length of SSR motifs in the transcribed region of the L. perenne genome are size-restricted.
A large variation in the number of SSRs in transcribed regions of the three genotypes was observed, ranging from one SSR per 10.9 kb in genotype NV#20F1-30 to one SSR per 2.7 kb in the genotype F6. This result suggests that several genotypes should be screened to find the best genotype for SSR discovery in transcribed sequences.
All allelic SSR polymorphisms identified within L. perenne were changes in the number of repeat units. When comparing SSR motifs from L. perenne to SSR motifs in orthologous sequences from F. arundinacea, B. distachyon, and O. sativa changes both in the number of repeats, and complete loss of the SSR motifs were observed. Comparing orthologous sequences of L. perenne and F. arundinacea revealed that the most frequent SSR motif polymorphisms between these two species were changes in the number of repeat units corresponding to 21% of the clusters, while there were no SSR polymorphisms in 31% of the analysed clusters. Thus, the EST-SSRs are suitable for synteny studies between these two species.
In contrast, none of the SSR motifs identified in L. perenne was completely conserved in the more distant related species B. distachyon and O. sativa. In 31% of the clusters the SSR motif was completely lost in B. distachyon, and in 21% the SSR motif had fewer repeat units. This suggests that the EST-SSRs are less suitable for synteny studies outside the Lolium/Festuca complex.
With the EST-SSR set, a valuable tool has been made publicly available for numerous further genetic and genomic applications on intra-and inter-species level.

Library construction and DNA sequencing
Thirteen directional cDNA libraries were constructed from a range of tissues and developmental stages (Table 1). Tissues were obtained from three different L. perenne genotypes: NV#20F1-30, NV#20F1-39 [6], and F6 (DLF-Trifolium Ltd.). The two genotypes NV#20F1-30 and NV#20F1-39 are F1 offspring (full-sibs) of a cross between two genotypes from the variety Veyo and the ecotype Falster, respectively, and have thus the same heterozygous parents [6].
RNA was isolated using Tri ® Reagent (Sigma-Aldrich, St. Louis, MO, USA), and the cDNA libraries were constructed using the Creator™ SMART™ cDNA Library Construction Kit (BD Biosciences, Palo Alto, CA, USA), according to the manufacturer's instructions. The cDNAs were cloned directionally into the asymmetric SfiI sites of the pDNR-LIB vector, transformed into electrocompetent DH10B T1-phage-resistant Escherichia coli cells (Invitrogen, Carlsbad, CA, USA), and robotically arrayed into 384-well plates. A total of 31,379 random clones were subjected to single-pass sequencing reactions from the 5'end using BigDye ® Terminator v3.1 sequencing chemistry and analyzed on an ABI Prism 3700 DNA Analyzer (Applied Biosystems, Foster City, CA, USA). Colony picking and sequencing was performed by MWG Biotech (MWG Biotech, Ebersberg, Germany). Base calling, vector trimming, removal of low quality bases, and clustering and assembly of the ESTs were performed using the PHRED and PHRAP/CROSS_MATCH software packages [60][61][62]. Sequences with less than 100 PHRED ≥ 20 quality bases after trimming were discarded. A complete description of the cDNA library construction methods will be reported elsewhere. , and annotated in terms of the associated biological processes, cellular components, and molecular functions using the Gene Ontology vocabulary.

EST database and identification of EST-SSRs
The Perl script MIcroSAtelitte (MISA) [28] was used to identify SSRs in the L. perenne EST sequences. The parameters for the SSR search were defined as follows. The size of motifs was two to six nucleotides, and the minimum repeat unit was defined as six for di-nucleotides and four for tri-, tetra-, penta-, and hexa-nucleotides. Compound SSRs were defined as ≥ 2 SSRs interrupted by ≤ 50 bases.