Functionally relevant microsatellites in sugarcane unigenes

Background Unigene sequences constitute a rich source of functionally relevant microsatellites. The present study was undertaken to mine the microsatellites in the available unigene sequences of sugarcane for understanding their constitution in the expressed genic component of its complex polyploid/aneuploid genome, assessing their functional significance in silico, determining the extent of allelic diversity at the microsatellite loci and for evaluating their utility in large-scale genotyping applications in sugarcane. Results The average frequency of perfect microsatellite was 1/10.9 kb, while it was 1/44.3 kb for the long and hypervariable class I repeats. GC-rich trinucleotides coding for alanine and the GA-rich dinucleotides were the most abundant microsatellite classes. Out of 15,594 unigenes mined in the study, 767 contained microsatellite repeats and for 672 of these putative functions were determined in silico. The microsatellite repeats were found in the functional domains of proteins encoded by 364 unigenes. Its significance was assessed by establishing the structure-function relationship for the beta-amylase and protein kinase encoding unigenes having repeats in the catalytic domains. A total of 726 allelic variants (7.42 alleles per locus) with different repeat lengths were captured precisely for a set of 47 fluorescent dye labeled primers in 36 sugarcane genotypes and five cereal species using the automated fragment analysis system, which suggested the utility of designed primers for rapid, large-scale and high-throughput genotyping applications in sugarcane. Pair-wise similarity ranging from 0.33 to 0.84 with an average of 0.40 revealed a broad genetic base of the Indian varieties in respect of functionally relevant regions of the large and complex sugarcane genome. Conclusion Microsatellite repeats were present in 4.92% of sugarcane unigenes, for most (87.6%) of which functions were determined in silico. High level of allelic diversity in repeats including those present in the functional domains of proteins encoded by the unigenes demonstrated their use in assay of useful variation in the genic component of complex polyploid sugarcane genome.


Background
Sugarcane (Saccharum sp.) is a complex polyploid belonging to the family Poaceae of the tribe Andropogoneae. It is an important commercial sugar producing crop and a source of approximately 50% of the world's sugar and alcohol. The polyploid/aneuploid nature with variation in chromosome number has been largely responsible for its genetic and taxonomic complexity [1]. Characterization of such large genomes is greatly facilitated by the use of molecular markers.
Microsatellite or simple sequence repeat (SSR) markers are being preferred because of their co-dominant inheritance, reproducibility, multi-allelic nature, chromosomespecific location and wide genomic distribution. These markers are amenable to high throughput genotyping due to multiplexing and efficient resolution of amplicons by automated fragment analysis [2,3].
In sugarcane, Cordeiro et al. (2000) [2] and Parida et al. (2009a) [4] developed a large number of microsatellite markers from the genomic sequences. Pinto et al. (2004; [5,6] and more recently Oliveira et al. (2009) [7] also designed such markers from the sugarcane ESTs, which were used for constructing high resolution functional genetic linkage map of Saccharum spp. [8]. However, information on a limited number of these genic microsatellite markers is available in the public domain. Recently, the EST sequences have been assembled into unigenes [9], which is expected to provide non-redundant, locus specific and novel gene-based functional markers for sugarcane having a large genome not amenable to complete sequencing. The unigene sequences of sugarcane have not yet been analyzed for their microsatellite constitution and compared with the other small genome members of the grass family.
In India, systematic breeding of sugarcane has resulted in the development of a number of varieties with high productivity and stress tolerance by inter-specific hybridization [10]. However, the genetic base of modern Indian sugarcane cultivars is considered narrow due to use of a limited number of parental species clones in cross hybridization and repeated intercrossing of hybrids [11]. Understanding the extent of natural variation at molecular level is essential to develop new strategies for sugarcane improvement. Earlier, molecular markers such as RAPD (Randomly Amplified Polymorphic DNA), AFLP (Amplified Fragment Length Polymorphism), and maize and sugarcane genomic microsatellites have been used for this purpose [4,[12][13][14][15]. No effort has yet been made to understand the genetic diversity of Indian sugarcane cultivars based on functionally relevant genic regions of its complex genome.
The present study was undertaken to mine the available unigene sequences of sugarcane (Saccharum sp.) to understand the microsatellite structure and distribution in the expressed genic component of the genome, assess their functional significance in silico, design primers from the flanking regions of the identified microsatellites, assess the efficiency of a set of fluorescent dye labeled primers in genotyping using automated fragment analysis system and determine functional diversity among different species, related genera and Indian varieties of sugarcane.

Frequency, distribution and organization of microsatellites in sugarcane unigenes
The type, frequency and relative distribution of the microsatellites in the unigene sequences of sugarcane are given in Table 1. The perfect microsatellite (excluding the mononucleotides) frequency in the unigenes of sugarcane was one in every 10.9 kb and the proportion of microsatellite carrying unigenes was 3.7% (584 out of 15,582). When 1,871 (12%) mononucleotide microsatellites were included, the proportion of unigenes carrying microsatellites increased to 17.4%. The mononucleotides in sugarcane showed a strong bias (84.6%) towards A/T repeat, with the majority (89%) being 9 to 30 bases long and the remaining (11%) extending up to 69 bases (T 69 ).
Out of 841 perfect microsatellites identified in sugarcane, 587 (69.8%) were found in the ORFs and the remaining were present either in the 3'UTRs (102, 12.1%) or in the 5'UTRs (152, 18.1%). The trinucleotide repeat motifs were significantly more frequent (about 86%) in the ORFs. In contrast, the GA-rich dinucleotide repeat motifs were more in the 5' (49%) and 3' (32%) UTRs. The density of longer motif containing perfect class I microsatellites was one in every 44.3 kb sequences, which accounted for 24.6% (207) of the total 841 microsatellites identified (Table 1, Figure 1).

Development of unigene derived microsatellite (UGMS) markers and evaluation of their polymorphic potential
The primer pairs could be designed for 810 perfect microsatellites that was 96.3% of the total microsatellites (841) identified in the present investigation. The primer sequences flanking all the perfect UGMS including 207 class I microsatellites with their Tm values and product sizes are given in the Additional file 3. Besides, the primer sequences for 151 compound class I UGMS were designed and provided in the Additional file 4. To validate the UGMS markers, 176 primer-pairs designed from different microsatellite containing unigenes were used in PCR amplification (see Additional file 3). One hundred sixty seven (94.9%) of these produced clear and reproducible amplicons, whereas remaining nine (5.1%) did not give amplification in the S. officinarum from which sequence the primers were designed. To verify that the primers did amplify the expected microsatellite repeat-motifs, the amplified products obtained with 19 of the primers in all the Saccharum species and related genera as well as cereals were sequenced and the presence of the target microsatellite motifs as well the flanking sequences was observed in all the cases (see Additional file 5).
In silico polymorphism analysis was confined to Saccharum officinarum and five cereal species namely, rice, wheat, maize, Sorghum and barley for which unigene sequences were available in the database. The polymorphism based on variation in microsatellite repeat length was observed for a maximum of 163 primers (46.6%) between sugarcane and barley followed by 161 (42.5%) between sugarcane and wheat and least (92, 17.8%) between sugarcane and Sorghum (see Additional file 6). The actual level of polymorphism detected by automated fragment analysis using 47 fluorescent dye labeled primers was much higher than that based on in silico analysis although the trend was maintained. S. officinarum had maximum polymorphism with barley (92.7%) followed by wheat (90.6%), rice (85.8%), maize (67.7%) and Sorghum (61.4%). Forty-three (91.5%) of the 47 primers detected polymorphism (mean PIC of 0.85) among the 41 genotypes belonging to Saccharum  Comparative distribution of long hypervariable class I and potentially variable class II microsatellite repeats in the unigenes of sugarcane. Trinucleotide was the most abundant repeat-motif in both class I (56.5%) and class II (79%) category, which was followed by dinucleotide motifs. Step-wise UGSuM33c 2 431-491 Step-wise Step-wise Due to high polyploidy and heterozygosity of the Saccharum genome, most (28,65.1%) of the primers amplified two to three different loci (each locus being designated as an UGMS marker) yielding a total of 60 loci with the number of alleles per locus ranging from 2 to 22 and the allele size varying from 82 to 408 bp across sugarcane species, genera and varieties ( Figure 2) as well as five cereal species. The remaining 15 (34.9%) primers gave single locus amplification ( Table 2) with 1 to 23 alleles each. Overall, the 43 polymorphic primers amplified an average of 7.42 alleles per UGMS marker locus with a total of 722 alleles across 74 loci. Forty-two (73 UGMS marker loci with 720 alleles, 97.7% polymorphic, mean PIC of 0.81) of the 43 informative primers revealed polymorphism between sugarcane species and related genera, whereas 37 (66 UGMS marker loci with 678 alleles, 86% polymorphic, PIC of 0.74) primers detected polymorphism among the Indian commercial sugarcane varieties ( Figure 3). Twenty-six (34.7%) of the 75 UGMS marker loci showed heterozygosity in different genotypes used. Maximum heterozygous loci was detected in five sugarcane species (14, 18.7%) followed by commercial varieties (9, 12%) and minimum (4, 5.3%) in the three related genera and five cereal species.

Molecular basis of UGMS polymorphism and its functional significance
Comparison of fragment size (bp) of variant alleles amplified at 75 polymorphic UGMS loci with changes in number of repeats among Saccharum complex, varieties and five cereal species revealed purely step-wise allelic distribution for 51 (68%) loci, while remaining 24 (32%) loci showed mixed type of allele size distribution ( Figure  2). To elucidate the distribution pattern of UGMS alleles showing fragment length polymorphism, 10 amplified size variant alleles each of two microsatellite primers namely, UGSuM2 and UGSuM27 showing both stepwise and mixed type of allele distribution were sequenced. High quality sequence alignment revealed that the size variant alleles showing step-wise distribution contained the expected microsatellite repeat sequences with conserved primer binding sites, but corresponded exactly to the step-wise multiples of the number of repeat units. The size variation of sequenced alleles was explained by differences in the number of repeat units. Mixed type of allele distribution resulted from both variation in the number of repeat-units and insertions/deletions in the sequences flanking the UGMS repeats. For instance, UGSuM2 ( Figure 4A) and UGSuM27 ( Figure 4B) showing both step-wise and mixed type allele size distribution had the expected (AT) n and (GGC) n UGMS motifs, respectively with varying number of repeat-units in different sugarcane species, genera, varieties and cereals along with a small stretch of nucleotide insertion/deletion in the flanking regions of target microsatellites.
In order to assess the functional significance of the UGMS, gene ontology classification of 767 unigenes carrying perfect and compound microsatellites was carried out. Six hundred seventy-two (87.6%) of these (see Additional files 3 and 4) could be functionally annotated and were shown to encode enzymes for sugar metabolism (39%), structural proteins (26%), transcription and translation factors (18%), signal transduction pathway proteins (8%), cell growth and development factors (5%) and disease resistance proteins (4%) (see Additional file 7). Three hundred sixty-four (54.2%) of 672 unigenes contained microsatellite repeats in 89 different functional domains of proteins encoded by these unigenes (see Additional file 8) which included cytochrome b/c oxidase, chlorophyll A/B binding, FAD and ubiquitin binding, sucrose synthase, alpha-amylase, acetyl COA, aldehyde dehydrogenase, glyceraldehyde-3-phosphate, lipase, S-adenosylmethionine, leucine rich repeat (LRR), ferritin, protein kinase, chitinase, basic leucine zipper (bZIP), zinc finger, TATA box, Myb and WRKY DNA binding domains ( Figure 5). Fifteen primers designed from 15 different unigenes targeting such functional domains (see Additional file 9) that gave single locus amplification and step-wise allele distribution (Table 2) were selected for validation. All of these primers gave fragment length polymorphism among Saccharum complex, varieties and five cereals based on variation in the number of UGMS repeat-units within the functional domains. One of these primers (UGSuM26) showing polymorphism targeted the amylase catalytic domain of β-amylase ( Figure 6). To understand the possible biological significance of the variable UGMS repeat in the amylase domain, the structure predicted from the aminoacid sequences of the functional domain was analyzed in silico. It revealed variation in the active binding site involved in formation of the Ca 2+ ligand complexes due to the presence of variable number of repeats encoding Leucine-Serine aminoacid residues ( Figure 6). This would most likely alter the function of amylase gene in respect of carbohydrate metabolism in sugarcane. Similarly, the expansion/contraction of microsatellite motifs (AG) n encoding Arginine-Glutamine aminoacid residues was observed in the protein kinase functional domain of another sugarcane unigene which was amplified by the primer UGSuM17. In silico analysis suggested a novel secondary protein structure and catalytic domain binding site in the variant form that was different from the native form which contained (AG) 18 microsatellite repeats (see Additional file 10).

Assessment of functional genetic diversity
The pair-wise similarity among 36 genotypes belonging to Saccharum species, related genera and 28 tropical  The genetic relationship among the Saccharum species, related genera, and tropical and sub-tropical varieties is depicted in Figure 7. The UGMS markers clearly discriminated all the 36 genotypes from each other and resulted in a definitive grouping among different genera and species of Saccharum with high bootstrap values (81 to 100) that corresponded well with their known pedigree relationships. All the clones of five Saccharum species were included in the major cluster I in which S. officinarum and S. robustum clones were sub-clustered (Ia) with 98% occurrence in boot strap analysis; S. barberi and S. sinense clones grouped together in a separate sub-cluster (Ib) with 100% occurrence and S. spontaneum grouped distinctly from these sub-clusters with 92% occurrence. The three related genera of Saccharum being highly divergent from the Saccharum species and varieties, grouped in a separate cluster (III) with 98% occurrence.
All the tropical and sub-tropical varieties were included in a distinct and separate cluster (II) with seven distinct sub-clusters ( Figure 7) with high (88) bootstrap value. The tropical variety Co 62175 grouped in a major sub-cluster IIa with its tropical male parent Co 419 (similarity 0.71) with a bootstrap value of 99%. In this sub-cluster (IIa), the other tropical varieties namely, Co 8021, Co 8371, Co 7704 and Co 6304 were also grouped together with average similarity of 0.77 (supported by 93% occurrences in bootstrap) possibly because of their common ancestry involving Co 419. The two sub-tropical varieties, Co 8347 and Co 87263, and three other sub-tropical varieties, CoPant 84211, Co 89003 and Co 7717, in spite of having a common ancestor Co 775, were included in two distantly placed major sub-clusters; IIb and IIg, respectively. It could be due to inclusion of two diverse sub-tropical varieties, Co 87268 and CoLk 8102 having common parent Black Cheribon within the sub-cluster IIb. Similarly, the clustering of two tropical varieties namely, CoC 671 and Co 87025 in the sub-cluster IIc with average similarity of 0.69 (supported by 100% occurrences) and five tropical varieties, Co 7219, Co 740, Co 86010, Co 86032 and Co 85002 together in another separate large sub-cluster (IIf) with average similarity of 0.64 (supported by 92%

Discussion
Unigene resources representing the transcriptome of an organism provide opportunities to understand the sequence organization in the genic regions of complex polylploids and allow design of sequence based robust markers for various genotyping applications. Considering the availability of a large unigene database of sugarcane in public domain, this resource was studied for the presence and functional relevance of different microsatellite repeats. The frequency of microsatellites in sugarcane unigenes (1/10.9 kb) was lower than that obtained in rice, Sorghum, barley and maize (1/3.6 kb, 1/5.9 kb, 1/8.9 kb and 1/9 kb, respectively) but similar to wheat (1/10.6 kb). In polyploids, extensive loss of duplicated genes and chromosomal rearrangement after polyploidization possibly have resulted in shortening and loss of microsatellite repeat-motifs from the genic coding sequences leading to dosage compensation, developmental stability and functional plasticity [16,17]. It would be of interest to explore the mechanism that restricts repeat expansion in genic regions of wheat and sugarcane leading to lower microsatellite frequency in these species.
The observed frequency of the mononucleotides (12%) in sugarcane was much less than that reported earlier [18] in the unigenes of maize (75.8%), wheat (71%), barley (42.4%) and rice (41.6%), but similar to Sorghum (13%). This suggested a lack of correspondence with genome size and ploidy. The presence of longer (69 bases or more) A/T mononucleotide repeat-motifs in sugarcane is comparable to earlier observations on the nature and the frequency of such repeats in the genomic and EST sequences of cereals [19]. The abundance of trinucleotide microsatellite motifs in the sugarcane unigenes is consistent with the earlier observations based on EST sequences of sugarcane [3,5] and unigene sequences of cereals [18]. The limited expansion of nontriplet microsatellites in the unigenes of sugarcane could be due to selection against frameshift mutations in the coding regions resulting from length changes in non-triplet repeats. We observed that more than 50% of the identified trinucleotide repeat motifs in sugarcane were GC-rich, possibly due to their high GC content [3,5] and consequent usage bias in the coding sequences [20]. The abundance of GC-rich trinucleotide repeat motifs coding for small and hydrophilic aminoacid alanine and GA-rich dinucleotide motifs in sugarcane coding sequences parallels their abundance in cereal exons [18]. The GA-rich dinucleotide UGMS motifs were observed particularly in the 5' and 3' (32%) UTRs with balanced (46 to 52%) GC content and therefore would support better amplification as polymorphic markers [21].
The unigenes being longer in higher quality sequences offer advantages over the EST sequence resources for the development of microsatellite markers. In the present study, 961 (810 perfect and 151 compound) primerpairs were designed from the 767 different unigene sequences carrying microsatellite repeats in the expressed component of the sugarcane genome. The primers designed from the unigene sequences flanking the microsatellite motifs were highly efficient with amplification success rate of 94.9% that suggested the utility of the unigene database in designing sequence based robust genic markers. The paucity of usable and robust sequence based markers in sugarcane has been a major limitation in genetic analysis in this important sugar crop. The UGMS markers developed by us have been placed in the public domain and thus would be immediately useful in various genotyping applications in sugarcane.
For most (87.6%) of the unigene sequences from which the primers were designed, putative functions have been predicted. For instance, about 39% and 4% of the primers were from sequences related to sugar metabolic enzymes and disease resistance, respectively. Interestingly, 54% of the microsatellite repeats were present in various functional domains of proteins encoded by the unigenes. Correlation between fragment length polymorphism due to variable number of UGMS repeat-units in the functional domains and alteration of the predicted protein structure and active ligand binding sites suggested functional relevance of the genic microsatellites. The evolutionary and adaptive advantages of such variable microsatellite repeats affecting the structure and function of the encoded proteins to generate favorable alleles for relaxation of environmental stress impact under the action of high natural selection pressure through modulation of mutation/ recombination in these loci have been reported in many eukaryotes [22]. For instance, in Saccharomyces cerevisiae, the expansion and contraction of microsatellite repeats in the coding regions of protein kinase genes have provided greater adaptability to various abiotic and biotic stresses [22]. Sugarcane is a tropical crop. However, varieties adapted to subtropical condition have been developed in India. Variation in microsatellites in the coding regions of these two groups was evident. For example, the tropical sugarcane varieties Co7219 and Co740 contained (AG) 18 microsatellite repeats whereas varieties Co89003 and Co7717 adapted to sub-tropical condition showed contraction of AG repeats to (AG) 10 leading to alteration of protein structure and possibly function. Understanding the adaptive significance of such variation is of relevance that needs further experimentation. Designing of genic microsatellite markers targeting functional domains in solanaceous crops like tomato and pepper [23] was reported. Association of such markers with many traits including diseases like neuronal disorders [22] and cancers [24] in humans based on expansion/contraction of repeated tracts of microsatellites encoding aminoacid residues in the functional domains of proteins that changes their secondary structure and function has been demonstrated. The information generated in this study thus would allow selection of candidate gene-based markers for rapidly establishing marker-trait linkages in sugarcane.
The efficiency of sugarcane UGMS markers to detect polymorphism within Saccharum complex and among five cereals was evaluated in silico that was further validated experimentally. In silico polymorphism based on variation in length of the UGMS repeat among S. officinarum and each of five cereal species could be due to divergent microsatellite evolution in these lineages [25]. The actual level of polymorphism detected by automated fragment analysis using fluorescent dye labeled primers was much higher than that based on in silico analysis. The expected relationship in evolution was reflected by actual polymorphism observed between sugarcane and cereals. With the use of automated fragment analysis system, all the allelic variants could be captured efficiently in a set of 36 sugarcane genotypes and five cereal species, thereby providing a platform for rapid, automated and large-scale genotyping of sugarcane. Further, the allele size information generated would enable multiplexing of the UGMS markers, thus making them useful in various high-throughput genotyping applications. The observed polymorphism (97.7%) among the members of the Saccharum species and related genera is higher than that estimated earlier using the fluorescent dye labeled sugarcane genomic (90%, [2]) and EST derived microsatellite markers (81%, [3]). The higher potential of UGMS markers particularly the longer class I di-and tetra-nucleotide repeat-motifs to detect polymorphism as compared to the class I trinucleotide and class II motifs reflected correspondence between the type and length of repeats, and the level of polymorphism as observed earlier in sugarcane [4] and rice [21,26]. Higher polymorphic potential of UGMS markers derived from the UTRs than that from the conserved coding sequences, which are constrained by purifying selection [25,27] suggested the utility of unigenes having such repeat-motifs as a source of polymorphic microsatellite markers. The level of inter-varietal polymorphism (86%, mean PIC of 0.74) detected by the fluorescent dye labeled primers was higher than the level reported previously with the labeled sugarcane EST derived (38%, PIC of 0.23, [3]) microsatellite markers. The discrepancies could be due to use of different sets of markers and genotypes. However, genotyping using the automated fragment analysis system in this study revealed comparable level of polymorphism with the genic and genomic microsatellite markers in sugarcane. It suggested that the genic microsatellite markers developed in this study would be highly informative and useful in sugarcane genetics, genomics and breeding.
Unigene sequences usually have advantages of unique identity and position in the transcribed regions of the genome. If this is the case then primers designed from the unigene sequences flanking the microsatellite repeats should amplify unique single locus. In contrast, in the present study, multiple loci and thus amplification of multiple sequences of the same gene was observed for 28 (65.1%) of the 43 primers designed from different microsatellite carrying unigenes. This is possibly due to poor representation of unigene sequences in the database that was scanned for microsatellite repeats. Alternatively, all the copies of a microsatellite carrying gene that are PCR amplified might not be transcribed due to dosage compensation leading to silencing of all but one copy in the large polyploid sugarcane genome [16,28]. In spite of high polyploidy and heterozygosity of the sugarcane genome, 15 (34.9%) of the 43 polymorphic UGMS primers amplified a single discrete locus with step-wise allele distribution and showed fragment length polymorphism across the members of Saccharum complex and varieties in an automated fragment analysis system. This could be due to polyploidization followed by selective gene loss [29,16,28] resulting in retention of a single copy during evolution of Saccharum. Gene ontology analysis of these 14 unigenes showed correspondence with the genes coding for basic metabolic process of energy generation, degradation of cellular building blocks, DNA recombination and repair proteins which played a vital role in plant biology and thus remained as single copy without alteration.
Distribution pattern of size variant alleles amplified at UGMS loci showed higher proportionate (68%) distribution of alleles showing step-wise mutation than that of mixed allele distribution (32%). High-quality sequence alignment of the size variant alleles showing both stepwise and mixed distributions confirmed the presence of variable number of repeat-units in different amplified alleles and additional insertions/deletions in the flanking regions of microsatellite repeat-motifs, which contributed to the UGMS fragment length polymorphism observed in the Saccharum complex, varieties and cereal species. Such complex pattern of allele distribution and fragment length polymorphism at microsatellite loci due to variation in the copy number of microsatellite repeats and insertions/deletions in the flanking sequences of microsatellite motifs have been observed earlier in species like maize with a large genome size [30].
It is important to evaluate molecular diversity existing among the members of the Saccharum species and related genera, since major varietal improvement in sugarcane was through inter-specific hybridization. Use of S. spontaneum and Erianthus is vital since they carry many traits of agricultural importance including tolerance to biotic and abiotic stresses [14]. Identification of true inter-specific and inter-generic hybrids is crucial for successful transfer of the target traits. Besides, monitoring of introgression is required to verify the transfer of the target genomic regions from the wild relatives to the cultivated genetic backgrounds. 97.7% of markers revealed inter-specific and inter-generic polymorphism and thus would enable precise identification of interspecific and inter-generic hybrids, and also provide opportunities to assess transfer of genic regions to desirable genetic backgrounds, thereby offering advantages over the random markers such as RAPD and AFLP.
Evaluation of molecular diversity in a set of 28 commercial Indian tropical and sub-tropical sugarcane varieties using unigene based genic microsatellite markers revealed a wider range (0.33 to 0.84 with an average of 0.40) of genetic similarity than the level detected previously with RAPD (0.59 to 0.81 with average of 0.71, [12]), maize microsatellite (0.40 to 0.73 with average of 0.64, [13]), sugarcane genomic microsatellite (0.39 to 0.82 with average of 0.53, [4]) and AFLP (0.52 to 0.83 with average of 0.62, [14]) markers. Hence, high efficiency of UGMS markers in assaying functional diversity in the transcribed component of the sugarcane genome makes them valuable for understanding diversity pattern, and variety identification. The commercial Indian sugarcane varieties used in this study were bred for either tropical or subtropical agro-climatic conditions with differential contribution of S. spontaneum, the most variable Saccharum species [14]. The random markers assaying genetic variation largely in different non-genic genomic regions that contribute to large genome size [31] would be of little relevance to phenotypic selection exercised during variety development. In contrast, the genic microsatellite markers assaying variation in the transcribed non-repetitive regions of the genome might be directly related to the phenotypic variation [25]. Our results thus suggested that the genetic base of the Indian sugarcane varieties is not very narrow particularly in the genic regions of the genome, which is most likely due to selection for wider adaptability of these varieties. The adaptive environment as well as parentage of these varieties corresponded well with the clustering pattern obtained using a small set of these markers, that further suggested the usefulness of the designed markers in realistic assessment of genetic diversity.

Conclusion
The present study identified microsatellites in sugarcane unigenes and assessed their functional relevance. A total of 961 primer-pairs were designed targeting 767 different unigenes carrying microsatellite repeats in the expressed component of the sugarcane genome, which would extend the accessibility of such microsatellite markers to researchers for many genetic studies in sugarcane. Precise allele sizing in automated fragment analysis system has encouraging implications to various high-throughput genotyping applications in sugarcane. Assessment of functional genetic diversity revealed that the genetic base of the Indian sugarcane varieties is not narrow.

Methods
Mining of microsatellites in the unigenes of sugarcane and assessment of their functional relevance Fifteen thousand five hundred ninty-four sugarcane (Saccharum sp.) unigenes (Build 13.0, 19 th Feb' 2008) comprising of 9.17 million base sequences were acquired in FASTA format from the recent NCBI FTP UniGene repository database of S. officinarum [32] in batches and searched for microsatellites using a perl scripting language based program MISA (MIcroSAtellite) [33] considering complementary sequence of repeat-motifs (like AG, GA, TC and CT) in the same class [18]. The identified microsatellites were characterized as perfect (monomers to hexamers repeated up to 100 times without any interruption), compound (non-interrupting) with at least two different repeat-motifs without any interruption, compound (interrupting) with a maximum of 100 nucleotides interposing two microsatellite repeat-motifs, class I (≥20 nucleotides) and class II (12 to 20 nucleotides) types. The microsatellite (excluding monomers) containing unigenes were used in a batch module for designing forward and reverse primers employing the microsatellite primer discovery tool of BatchPrimer3 [34]. All default options except for the optimum and maximum primer sizes of 22 and 24 nucleotides, respectively, were used. The putative functions for these unigenes were determined using NCBI BLASTX search tool [35] as against the nr-protein database. The BLASTX output was annotated into four functional categories viz, exact, putative, unknown and hypothetical, and then extracted to Excel sheets. Different structural components namely, coding and untranslated regions (5'UTRs and 3'UTRs) in the unigene sequences was determined using the software tools FGENESH [36] and UTRScan [37], respectively. The aminoacid sequences encoded by the predicted coding nucleotide regions of the microsatellite carrying unigenes were analyzed using the software Pfam [38] to determine the presence of functional domain/protein family within the UGMS. The aminoacid sequences of the functional domain carrying microsatellites was analyzed further using the I-TASSER automated web server [39,40] for prediction of three dimensional protein structure and active binding sites with ligands. Five different protein models and active binding sites were predicted at significant cut-off confidence (C ≥ -1.5) and binding site (BS ≥ 0.5) scores. The high quality protein model of correct topology and protein-ligand complex active binding sites was identified based on high C-and BS-scores.

Evaluation of polymorphic potential
To assess the amplification success rate of microsatellite markers designed from sugarcane unigenes, 176 primers were used to amplify one genotype of S. officinarum. Forty-seven, including thirty-eight class I and nine class II markers of these, labeled with 6-carboxyflourescein (6-FAM) dye phosphoramidites, were used to genotype one accession each belonging to five Saccharum species and three related genera, and twenty-eight commercial Indian tropical and sub-tropical sugarcane varieties (Table 3) for evaluating their polymorphic potential and assessment of genetic diversity. Five cereal species namely, rice, wheat, maize, Sorghum and barley were included in the experiment for comparison. Standard PCR constituents and cycling conditions except for annealing temperature, which varied from 55°C to 64°C depending on the primers, were used for PCR amplification. The amplified products were purified, mixed with 3.75 μl of MegaBACE formamide loading buffer and 0.25 μl of internal-lane MegaBACE™ET 550-R ROX size standard, denatured, cooled and resolved in automated MegaBACE 1000 DNA sequencer (Amersham Biosciences, Piscataway NJ, USA). The electropherogram containing trace files were analyzed and automated allele calling was carried out using the Binning Peak Post Processor tool of MegaBACE™Fragment Profiler Software Version 1.2 (Amersham Biosciences, Piscataway NJ, USA). The actual allele size (bp) was determined and fragment length polymorphism among the sugarcane genotypes and cereal species was identified. The average peak height and peak quality of alleles generated for all the UGMS loci were graphed and alleles exhibiting average peak height and quality of ≥ 1500 and ≥ 6.0 fluorescence units [41,42] respectively were considered for sizing and base-pair estimation. Allele binning was performed for individual UGMS loci for precise and accurate allele size determination and discrimination of homozygous and heterozygous allele types [42,43] for each genotype.
Variation in the fragment size (bp) of amplified alleles at each polymorphic UGMS marker locus was compared with changes in the number of microsatellite repeatunits at that target locus, and the occurrence of "stepwise" and "mixed" type of allele size distribution was inferred. When the allele size differences strictly corresponded to the variation in the number of repeat-units, it was considered as stepwise distribution. A mixed allele distribution was assumed when the allele size differences could partly be explained by the stepwise model. Multiple amplicons obtained by a primer-pair with peaks ≥ 1500 fluorescence units showing ≥ 100 bp allele size differences were binned into different loci. To confirm that the primers amplified the target microsatellite repeat motifs in different species and genera, the amplified products were purified using Micropon PCR purification kits (Millipore, Bedford, MA, USA) and sequenced two times in both forward and reverse directions using a capillary-based Automated DNA Sequencer (MegaBACE 1000, Amersham Biosciences, Piscataway NJ, USA). The trace files were base called, checked for quality and assembled into contigs [44]. The high quality sequences thus obtained were used for interspecies comparison using CLUSTALW multiple sequence alignment tool employing BIOEDIT software [45]. Ten size variant amplicons each of three microsatellite markers (UGSuM2, UGSuM26 and UGSuM27) showing both step-wise and mixed type of allele size distribution in sugarcane species, genera, varieties and cereals were eluted, purified, cloned in pGEM-T Easy Vector (Promega, USA) and sequenced as described above.
Research (IPK) for the availability of microsatellite search tool MISA and Dr. Athiappan Selvi, Sugarcane Breeding Institute, Coimbatore for providing the sugarcane genotypes.
Author details