EST analysis of the scaly green flagellate Mesostigma viride (Streptophyta): Implications for the evolution of green plants (Viridiplantae)

Background The Viridiplantae (land plants and green algae) consist of two monophyletic lineages, the Chlorophyta and the Streptophyta. The Streptophyta include all embryophytes and a small but diverse group of freshwater algae traditionally known as the Charophyceae (e.g. Charales, Coleochaete and the Zygnematales). The only flagellate currently included in the Streptophyta is Mesostigma viride Lauterborn. To gain insight into the genome evolution in streptophytes, we have sequenced 10,395 ESTs from Mesostigma representing 3,300 independent contigs and compared the ESTs of Mesostigma with available plant genomes (Arabidopsis, Oryza, Chlamydomonas), with ESTs from the bryophyte Physcomitrella, the genome of the rhodophyte Cyanidioschyzon, the ESTs from the rhodophyte Porphyra, and the genome of the diatom Thalassiosira. Results The number of expressed genes shared by Mesostigma with the embryophytes (90.3 % of the expressed genes showing similarity to known proteins) is higher than with Chlamydomonas (76.1 %). In general, cytosolic metabolic pathways, and proteins involved in vesicular transport, transcription, regulation, DNA-structure and replication, cell cycle control, and RNA-metabolism are more conserved between Mesostigma and the embryophytes than between Mesostigma and Chlamydomonas. However, plastidic and mitochondrial metabolic pathways, cytoskeletal proteins and proteins involved in protein folding are more conserved between Mesostigma and Chlamydomonas than between Mesostigma and the embryophytes. Conclusion Our EST-analysis of Mesostigma supports the notion that this organism should be a suitable unicellular model for the last flagellate common ancestor of the streptophytes. Mesostigma shares more genes with the embryophytes than with the chlorophyte Chlamydomonas reinhardtii, although both organisms are flagellate unicells. Thus, it seems likely that several major physiological changes (e.g. in the regulation of photosynthesis and photorespiration) took place early during the evolution of streptophytes, i.e. before the transition to land.


Background
The Viridiplantae (literally meaning green plants) include all green algae and embryophyte plants. They represent a monophyletic group of organisms, which display a surprising diversity with respect to their morphology, cell architecture, life histories and reproduction, and their biochemistry. The colonization of the terrestrial habitat by streptophyte algae 450 -470 million years ago [reviewed in [1]] was undoubtedly one of the most important steps in the evolution of life on earth [2][3][4], which paved the way for the evolution of the various groups of land plants (embryophytes = bryophytes, pteridophytes and spermatophytes) resulting in our current terrestrial ecosystems [5].
A thorough understanding of the evolution of land plants requires knowledge about the phylogeny of green algae and embryophytes as well as insight into the evolution of plant genomes with special reference to developmental processes. Whereas our knowledge about the phylogeny of the Viridiplantae has greatly increased over the last years, the latter has hardly been addressed to date.
Two aspects in the evolution of land plants seem to be important in this respect. First, many key evolutionary inventions of plants took already place within the streptophyte algae. According to Graham et al. [21] one can distinguish several major transitions in the evolution of land plants starting with a Mesostigma-like flagellate ancestor: development of a cellulosic cell wall, multicellularity, cytokinesis by a phragmoplast, plasmodesmata, apical meristematic cell and apical cell proliferation leading to branching, asymmetric cell division, cell differentiation, retention of zygotes, heteromorphic life history, and a root meristem. Of these distinguishing features only the latter two evolved not until the embryophytes emerged. Second, the colonization of the terrestrial habitat with its exposure to air, increased solar radiation and life in a desiccating environment led to adaptations of cell architecture, metabolism and body plan to survive in the terrestrial ecosystems [5]. The evolutionary history of these adaptations is currently not known. Important questions are: How did the green algal progenitor adapt to the terrestrial habitat? Which genomic changes were associated with this transition? And which of these genes are derived from streptophyte green algae? To gain insight into these questions we have started to analyze ESTs from various streptophyte algal lineages.
Here, we present an analysis of 10,395 ESTs representing 3306 non-redundant expressed genes obtained from Mesostigma viride. We show that the number of genes shared is higher between Mesostigma and the embryophytes than between Mesostigma and Chlamydomonas. Comparison of expressed genes from Mesostigma with the genomes of Arabidopsis, Chlamydomonas, the red alga Cyanidioschyzon, and rice as well as ESTs from Physcomitrella and Porphyra allowed us to identify conserved and derived cellular functions within the different evolutionary lines and to obtain a first insight into the metabolic capabilities of the flagellate ancestor of green plants.

Preparation and characterization of libraries
Total RNA was isolated from an axenic culture of Mesostigma viride during the light phase. The culture con- ESTs were assembled using the PHRAP software yielding 3300 contigs with an average size of 769 bases (57 -4452 bases) after manual curation. Further analysis based on sequence similarity searches revealed that 294 of these contigs were of plastidic, mitochondrial, or possibly bacterial origin (sequences showing the highest similarity to organellar or bacterial genomes, Table 2). These contigs were excluded from the data set. 1315 of the 3006 contigs analyzed (44%) showed significant similarity at the protein level to sequences from the public databases (Table  2). Hence, approximately 56% of the contigs represent either novel sequences with unknown function or untranslated regions of a gene. However, when the 1691 contigs with no significant similarity to known proteins were searched against the Interpro protein motif database, 574 (33.9%) of these contigs contained a recognizable protein motif ( Table 2). The most common protein motifs found in all 3006 expressed gene sequences were bipartite nuclear localization signals (IPR001472, 197x), prolinerich regions (IPR000694, 150x) and cytochrome c hemebinding sites (IPR000345, 99x).  A functional catalogue was assembled using the 3006 Mesostigma contigs and the KOG-database and is presented in Table 3. As expected for an interphase cell, genes in the categories (1) translation, ribosomal structure and biogenesis (168), (2) posttranslational modification, protein turnover, chaperones (101), and (3) energy production and conversion (87) are represented by the largest number of contigs (Table 3). In the following, the assembled contigs are referred to as (expressed) genes.

Classification of Mesostigma ESTs according to homologous genes in other organisms
EST data represent only a fraction of all genes of an organism. Thus, comparisons of EST data alone cannot be used to describe unique or shared genes of an organism. For embryophytes, chlorophytes and red algae complete genome sequences of at least one organism exist. This makes it possible to find potential orthologous genes if present. Moreover, the surplus of genes of an organism in respect to a complete genome can be detected in EST data.
In tBLASTX analyses of the 1315 expressed genes with similarity to known proteins 90.3 % matched proteins from streptophytes, 76.1 % from chlorophytes and 61 % from rhodophytes, respectively. In addition, 46 genes showed similarity to known proteins, which have not been reported from plants or red algae to date. The overlap of Mesostigma genes with different organisms can be visualized in a Venn diagram ( Figure 1). For 211 genes, we detected similar proteins only within the streptophyte but not in the chlorophyte or rhodophyte lineages. Conversely, for 62 genes we detected similar proteins only within the chlorophyte but not in the streptophyte or rhodophyte lineages. Surprisingly, we also found 6 genes which showed significant similarity to rhodophyte proteins but for which we could not detect any similar protein sequences within the Viridiplantae. Removal of BLAST hits with significant but low similarity (see Table 2) reduced the overall numbers to 972 expressed genes, but gave similar results ( Figure 1). A complete list of genes showing only similarity to proteins with known functions present in specific subgroups of organisms can be found in supplemental Table 1 [see Additional file 1]. We will discuss important differences below.

Overall protein similarities between various photoautotrophic organisms
To compare the overall similarity between Mesostigma and various photoautotrophic organisms with completed genomes or large data sets of ESTs, we decided to calculate the average identity of a protein between Mesostigma and the various organisms. To compare Mesostigma genes with the genomes or ESTs from different organisms, we calculated the average identity (AI) between Mesostigma and another organism as the mean value of all pair wise identities of the BLAST-matches for each organism ( Table 4).
The AI between Mesostigma and Chlamydomonas or the embryophytes are very similar. The highest AI value obtained was for Physcomitrella/Mesostigma followed by Arabidopsis/Mesostigma, Chlamydomonas/ Mesostigma and Oryza/Mesostigma. The full data set includes many proteins, which we detected only in some species using Mesostigma expressed genes as a query. Therefore, we constructed a constrained data set (314 expressed genes, including at least 46 nuclear encoded plastidic, 9 nuclear encoded mitochondrial, and 73 cytosolic ribosomal proteins), containing only Mesostigma genes which gave matches with all completed genomes from photoautotrophic eukaryotic organisms (including the diatom Thalassiosira). This constrained data set represents a conserved core set of nuclear encoded expressed proteins from photoautotrophic eukaryote organisms. We calculated AI values for the constrained data set using complete genomes and the available ESTs of Physcomitrella, Porphyra, and Chlamydomonas. The results are included in Table 4. We obtained the highest AI-values in the constrained data set for the three embryophytes, followed by Chlamydomonas. The similar AI values for the three differ-Classification of expressed genes from Mesostigma according to the presence of similar proteins in other organisms in a Venn diagram Figure 1 Classification of expressed genes from Mesostigma according to the presence of similar proteins in other organisms in a Venn diagram. All non-redundant expressed genes were used as a query in (t)blastx similarity searches with the Swissprot, Genbank, Chlamydomonas, Cyanidioschyzon, Porphyra, Physcomitrella, Arabidopsis and Oryza data sets. The outermost circle represents all Mesostigma expressed genes. The inner circles, which are labeled chlorophyte, streptophyte and rhodophyte, represent genes, which have similarity to chlorophyte, streptophyte or rhodophyte sequences, respectively. The areas depicted are not proportional to the gene numbers and the number of Mesostigma expressed genes in each category is written in each segment. Numbers in brackets indicate the number of expressed genes in a category after removal of low similarity hits (see Table 2 for a definition of low similarity hits).
ent embryophytes suggest that the overall evolutionary rate was very similar for the embryophytes investigated, when compared with Mesostigma (see below).
To test whether the observed differences are significant a paired students t-test was performed, and the results are shown in Table 5. Applying a significance level of 0.0072 [0.05/7 Bonferroni adjustment [22]] the differences in AI between Mesostigma/Chlamydomonas and Mesostigma/ embryophytes are highly significant (Table 5), whereas the differences in AI among the embryophytes are not significant (Table 5). Furthermore, when we varied the numbers of expressed genes used for the calculation of the AI, we observed that when more than 100 ESTs were included the significance of the differences became very stable ( Fig.  2A). In addition, to evaluate the consistency of the data set we calculated 8 times the AI for 150 randomly selected expressed genes from the constrained data set. A clear difference between the AI from the various organisms was always observed (Fig. 2B 1 -8). The expression level of the expressed genes (as revealed by the number ESTs in a contig) had no effect on the differences between the investigated organisms (Fig. 2B, compare 9 and 10), although highly expressed genes are better conserved (Fig. 2B, 9 and 10).
Two other results are remarkable. First, for the calculation of the AI it is possible to use large EST-data sets instead of genomes. We obtained the same result for Mesostigma/ Chlamydomonas genome and for Mesostigma/ Chlamydomonas ESTs (AI = 0.653 for both data sets; p = 0,975, Table 5, using 244 expressed genes from Mesostigma). Similarly, when Mesostigma/Physcomitrella ESTs were compared with the Mesostigma/Arabidopsis genome Table 5: Statistical significance of the obtained AI values. A paired students t-test was performed for the constrained data set to test whether the observed differences between the average identity of pair wise comparisons of Mesostigma expressed genes with the indicated organismal data set are significant. Differences are considered significant when p is < 0.0071 (0.05/8 Bonferroni adjustment [22]  The total data set contains all Mesostigma expressed genes with significant similarity to proteins from other organisms with known or unknown function (see Table 2). The constrained data set contains only Mesostigma expressed genes with significant similarity to proteins in all completely sequenced eukaryotic autotroph organisms. 1) Number of ESTs showing similarity to Mesostigma expressed genes from the constrained data set in a tBLASTX analysis. 2) Evolutionary distances were calculated using the constrained data set and the approximation given by Kimura [28]: D = -ln (1 -p -0.2 p 2 ), where p is the fraction of amino acid that differs between the two species.
and with the Mesostigma/Oryza genome only small differences were observed (AI = 0,675/0,681; 0,675/0,673 respectively, using 302 expressed genes from Mesostigma, Table 5). Statistical analysis (paired students t-test) showed that the observed differences are not significant. Furthermore, we note that the genome of the diatom Thalassiosira pseudonana shows a similar AI in respect to Mesostigma as the red algal genome and ESTs ( Table 4). The difference values of these distantly related genomes represent presumably an upper threshold for reasonable AI value calculations.

Analysis of metabolic pathways
ESTs have been widely used for the identification of metabolic pathways [23]. A complete list of all metabolic pathways identified is presented in supplemental Table 2 [see Additional file 2]. Indeed, many ESTs showed similarity to proteins required for photosynthesis (66 expressed genes), nucleotide synthesis (6), nucleotide sugar conversion, the biosynthesis of precursors of scale polysaccharides (6), heme and chlorophyll biosynthesis (6), fatty acid and lipid biosynthesis (9), terpenoid biosynthesis (6), glycolysis (11) and the TCA-cycle including pyruvate dehydrogenase and respiration (12). The biosynthetic pathways for several amino acids were also well represented in our ESTs (21 expressed genes for Ala, Arg, Gly, Ile, Leu, Lys, Pro, Ser, Thr, Trp and Val). However, for several other amino acids (Asn, Asp, Cys, Gln, Glu, His, Met, Phe, Tyr) we did not find a single EST which could be matched to the known biosynthetic pathways.
All enzymes except one (triose isomerase) of the Calvin cycle are represented by at least one EST. Interestingly, we found several genes coding for subunits of the plastidic GAPDH. In angiosperms the plastidic GAPDH consists of an A 2 B 2 heterotetramer [24]. Compared to GAPDH A, which is present in the plastids of all eukaryotic algae, GAPDH B has a C-terminal extension that contains the two conserved cysteine residues, which are required for regulation by the thioredoxin system. To our knowledge, GAPDH B has only been reported from streptophytes. Two genes of Mesostigma showed significant similarity to GAPDH B from angiosperms. We present an alignment of the C-terminus of Mesostigma GAPDH B with the C-terminus of spinach GAPDH B in Figure 3. The two sequences are very similar and the two cysteines required for regulation by the thioredoxin system are conserved in Mesostigma indicating that the activity of plastidic GAPDH came under the control of the thioredoxin system early during the evolution of streptophytes. We found no evidence for a GAPDH B in Chlamydomonas or other chlorophytes. Therefore, the evolution of a GAPDH B might represent a molecular characteristic (synapomorphy) of the streptophytes.
A total of 25 expressed genes encode components of the light-harvesting complex. There are some light-harvesting complex proteins, which Mesostigma shares only with the chlorophytes and red algae (e.g. so called fucoxanthin/ chlorophyll a-binding proteins). For others, we detected similar proteins only within embryophytes. However, the Consistency of the constrained data set used to calculate AI value lhc proteins form a large superfamily and their phylogenetic analysis is beyond the scope of this study.
Several genes encode proteins of the photorespiratory C2cycle (glycolate phosphatase, peroxisomal glycolate oxidase, a component of the glycine decarboxylase enzyme complex, and a peroxisomal serine-glyoxylate transaminase). As in embryophytes, the NADH required for reduction of hydroxy pyruvate is produced by a peroxisomal NADH malate dehydrogenase.
A glycolate oxidase activity was never detected in chlorophytes by biochemical enzyme assays, but one Chlamydomonas protein is currently annotated as a glycolate oxidase (gene model C_340068, JGI Chlamydomonas reinhardtii v2.0) We therefore performed a phylogenetic analysis for glycolate oxidases and lactate dehydrogenases, which are both members of the same protein superfamily, from embryophytes, Mesostigma, Chlamydomonas, Cyanidioschyzon, Dictyostelium, a few metazoans and some bacteria (Fig. 4). The glycolate oxidases from embryophytes, Mesostigma and Cyanidioschyzon are monophyletic. In contrast, the glycolate oxidase-like sequence from Chlamydomonas clusters with bacterial sequences, which are annotated as lactate dehydrogenase and glycolate oxidases. Therefore, we conclude that, in agreements with the biochemical findings, Chlamydomonas does not contain a plant-type peroxisomal glycolate oxidase.
We did not find evidence for a hexokinase and sucrose biosynthesis in interphase cells of Mesostigma. Several ESTs represent plastidic pyruvate kinase, however, only a single EST coded for the cytosolic isoform. Expressed genes for PEP carboxylase and a cytosolic malate dehydrogenase are present, suggesting that malate may be the major substrate for respiration in the mitochondrion of Mesostigma as in many embryophytes. The plastidic pyruvate kinase probably functions in the generation of acetyl-CoA required to sustain fatty acid synthesis in plastids.
Scales consist mainly of the 2-keto sugar acids 3-deoxymanno-octulosonic acid (2-keto-3-deoxy-oktonate, kdo), 5OMekdo, 3-deoxy-lyxo-heptulosaric acid, dha) and gal, galA, gul and some minor monosaccharides [25]. Expressed genes coding for kdo synthesis, and activation of kdo as CMP-kdo are present. The obtained sequence similar to a CMP-sialA transporter might actually be the CMP-kdo transporter necessary for uptake of CMP-kdo into the Golgi apparatus, as kdo and sialA are structural analogs. Interestingly, kdo-synthase and CMP-kdo-transferase are among the most conserved proteins between Mesostigma and the embryophytes. As in embryophytes [26], galA is synthesized via the UDP-glc dehydrogenase pathway and the myo-inositol oxygenase pathway. We could not detect the latter enzyme in Chlamydomonas or red algae.
Our EST-data support the presence of vitamin B12-biosynthesis and the production of a phosphagen phosphoarginine by arginine kinase in Mesostigma.
Exceptions to this rule are proteins of the cytoskeleton (Table 7) and proteins involved in protein folding (chap-erones, Remarkably, the three embryophytes behave differently in our analysis. We obtained higher AI values with Physcomitrella regarding the categories protein folding (chaperones), vesicular transport, transcription, and regulation ( Table 7). In contrast, proteins related to DNA structure, replication, cell cycle and RNA-metabolism were more conserved between Mesostigma and the angiosperms Arabidopsis and Oryza than between Mesostigma and Physcomitrella (Table 7).

Discussion
In this study, we have analyzed about 3000 expressed genes from the scaly green flagellate Mesostigma viride. We compared the expressed genes with the complete genomes from the angiosperms Arabidopsis thaliana and Oryza sativa, the chlorophyte Chlamydomonas reinhardtii, the red alga Cyanidioschyzon merolae and the diatom Thalassiosira pseudonana, as well as the ESTs from the moss Physcomitrella patens, and the red alga Porphyra yezoensis. Altogether, the Mesostigma proteome is more similar to the embryophytes than to Chlamydomonas, although Mesostigma and Chlamydomonas are both flagellate unicells. Mesostigma shares more genes with the embryophytes than with Chlamydomonas, including several enzymes confined to the streptophytes (e.g. GAPDH B, [Cu-Zn] superoxide dismutase), and the average identity of shared proteins is higher between Mesostigma and the embryophytes than between Mesostigma and Chlamydomonas. Therefore, we consider Mesostigma to be a member of the streptophytes, although Mesostigma clearly shares some ancestral characters with chlorophytes. Plastidic (with the exception of the Calvin cycle) and mitochondrial functions e.g. seem to be more conserved between Mesostigma and chlorophytes than between Mesostigma and embryophytes, i.e. these functions are more derived in embryophytes, probably due to adaptation of embryophytes to the terrestrial habitat. In contrast, other cellular functions except for the cytoskeleton are more conserved between Mesostigma and embryophytes than between Mesostigma and Chlamydomonas. Interestingly, in previous phylogenetic analyses plastidic and mitochondrial genes failed to show a clear relationship between Mesostigma and the streptophytes [14,15], whereas actin and nuclear-encoded SSU rDNA phylogenies support the notion that Mesostigma is a member of the streptophytes [10][11][12]. The dif-Phylogenetic tree of glycolate oxidase and glycolate oxidase-like genes Figure 4 Phylogenetic tree of glycolate oxidase and glycolate oxidase-like genes. The tree shown was derived by Bayesian inference analysis from 402 amino acid positions using a mixed model for amino acid substitutions and a gamma correction for rate variation among sites. The Bayesian inference utilized MRBAYES, Ver. 3.0 * with posterior probabilities derived from 100000 generations and discarding a burnin of 1000. The tree obtained with a parsimony analysis using PHYLIP gave essentially the same topology. ferent evolutionary rates for different cellular functions observed in this study might explain this discrepancy.
We calculated the average identity (AI) values from automatically generated BLAST output alignments. Automatically derived alignments are prone to errors. However, we believe that our approach is justified for the following reasons: (1) the BLAST alignments cover only the conserved parts of proteins and our calculated AI values indicate that in most alignments more than half of the amino acids are identical enhancing the quality of the automatically produced alignments; (2) although small mistakes may occur, they are insignificant given the high number of amino acids used to calculate the AI. On average the BLAST alignments contained about 150 amino acids and therefore about 45,000 amino acid positions were used in the constrained data set. In large data sets small unbiased errors become irrelevant [27]. Our results indicate that at least 100 (better are 150-200) expressed genes have to be used to obtain statistically significant results. It could be argued that our analysis uses only similarity values and no real evolutionary distances. AI values can be easily converted into evolutionary distances using an approximation given by Kimura [28], with the effect that the differences between the various organisms become larger but no changes occur in the order of relatedness (included in Table 4). We conclude that the AI of proteins shared between different organisms represents a reasonable measure of evolutionary relatedness, if sufficiently large data sets are used.
In the following, we briefly discuss some major differences in coding potential observed between the different photosynthetic eukaryotic organisms. Table 1 [see Additional file 1] which are shared only by Mesostigma and Chlamydomonas are associated with flagellar functions such as axonemal dyneins or components of the IFT (intra-flagellar transport) machinery. Most likely, the angiosperms lost these proteins during evolution together with the ability to produce flagellate cells. The absence of these proteins in the ESTs from the moss Physcomitrella, is presumably due to the fact that ESTs from developing spermatozoids are not available.

of 18 proteins included in supplemental
Proteins shared by Mesostigma and the embryophytes but not present in chlorophytes perform diverse functions. There are some well known biochemical differences between chlorophytes and streptophytes such as the presence of (Cu-Zn) superoxide dismutase [29,30] and glycolate oxidase in streptophytes [31,32] but not in chlorophytes. In addition, streptophytes use the DXP and mevalonate pathways for isoprene biosynthesis whereas chlorophytes posses only the DXP pathway [33]. For all these functions, we find molecular support in our expressed gene data set except for the mevalonate pathway of isoprene biosynthesis. Two genes matched two different enzymes of the DXP pathway; however, no matches for the MVA pathway were obtained, although the presence of this pathway has been demonstrated biochemically [33]. This could be due to the selective expression of one or the other pathway under different environmental conditions.
Remarkably, our list of proteins uniquely shared by Mesostigma and the embryophytes includes several proteins involved in steroid biosynthesis (e.g. a 3-oxo-5-beta-steroid dehydrogenase and a C-4 sterol oxidase), a homeobox protein of the knox family and proteins of the F-box family. The latter protein family underwent a dramatic expansion in the embryophytes (Arabidopsis has more than 700 members of this family).
Our expressed protein data set contains sequences similar to a protein involved in vitamin B-12 metabolism (present in rhodophytes and chlorophytes), an arginine kinase and a ARL6 protein, the latter two are absent in chlorophytes, embryophytes and red algae. It has been shown that arginine kinase is part of the ATP regeneration system in cilia of Paramecium [34]. Chlamydomonas lacks arginine kinase and recently Pazour et al. [35] showed that enzymes of the late glycolytic pathway are present in the flagella of Chlamydomonas, suggesting that the ATP required for flagellar function is produced by the glycolytic pathway in Chlamydomonas. The ARL6 protein has been implicated in protein translocation at the rER [36], although its exact function is still not known.
There are some typical embryophyte pathways that we failed to detect in Mesostigma, e.g. sucrose metabolism, hexokinase, and enzymes of cellulose biosynthesis. There are no reports about the presence of sucrose metabolism and hexokinase in green algae in the literature, whereas embryophyte-like Ces genes (catalytical subunit of cellulose synthase) have been reported in the streptophyte alga Mesotaenium [37]. Although we cannot exclude that Mesostigma lost these genes, we do expect to find theses genes in the genome of Mesostigma.

Evolution of photosynthesis and photorespiration
It is well known that embryophytes and chlorophytes differ in important aspects of photosynthesis and its regulation, and in photorespiration (e.g., presence of GAPDHB, number of enzymes regulated by thioredoxin, glycolate oxidase vs. glycolate dehydrogenase, and presence or absence of (Cu-Zn) superoxide dismutase). Table 8 summarizes the available information on the regulation of plastidic proteins by the thioredoxin system. The number of thioredoxin-regulated proteins has apparently increased during evolution and Mesostigma in this respect most closely resembles the embryophytes. Similarly, the peroxisomes of Mesostigma have been biochemically characterized as "leaf-type peroxisomes" [38] in full agreement with our EST-data. In contrast, chlorophytes lack glycolate oxidase and photorespiration involves only chloroplast and mitochondrial enzymes [38]. Interestingly, red algae possess a peroxisomal glycolate oxidase whereas the other enzymes of the photorespiratory cycle are located in the mitochondrion [32]. Thus, it seems likely that at the onset of streptophyte evolution major changes occurred in the regulation of the Calvin cycle and the subcellular organization of photorespiration. What might have been the driving force for these changes? We note that rhodophytes and chlorophytes both presumably evolved in a marine environment [red algae in a coastal benthic habitat, whereas chlorophytes proliferated as marine phytoplankton [39]]. Streptophyte algae most likely originated in a freshwater/brackish environment. In contrast to their marine counterparts, they had to deal with much higher light intensities and fluctuating environmental conditions such as salinity and temperature. With higher temperature, the rate of photorespiration increases. The observed changes in regulation of the Calvin cycle and photorespiration might be adaptations to this stress. It is possible that these adaptations to a shallow freshwater/brackish environment prepared streptophytes to colonize the terrestrial habitat later during evolution. In this respect we note that in extant chlorophytes activation of carbon concentrating mechanisms (CCM) is the dominant reaction to compensate for increased photorespiratory losses [38]. In contrast, streptophytes are able to channel large amounts of glycolate through the photorespiratory cycle [38]. According to Badger and Price [40] CCMs did not evolve until 400 million years ago, long after streptophytes had evolved and the colonization of the terrestrial habitat by streptophyte algae took place. Therefore during the palaeozoic era with reduced CO 2 -and increased O 2 -levels [40] streptophyte algae might have had an advantage over chlorophyte algae allowing them to colonize the terrestrial habitat during that time.

Conclusion
In summary, our EST analysis shows that Mesostigma shares more genes with the embryophytes than with the chlorophyte Chlamydomonas reinhardtii, although both organisms are flagellate unicells. Thus, it seems likely that many typical biochemical characteristics of streptophytes evolved early during the evolution of streptophytes, i.e. before the transition to land. Alternatively, such characteristics may haven been lost in the chlorophyte lineage or remain to be discovered in other chlorophytes. A decision between these alternatives requires further information on the genomes of other preferentially early branching chlorophytes such as Pyramimonas.
Our EST-analysis of Mesostigma supports the notion that this organism should be a suitable unicellular model for the last flagellate common ancestor of the streptophytes.

Plant material, RNA preparation and construction of libraries
Total RNA was isolated from cultures of Mesostigma viride Lauterborn (strain NIES 476, Tsukuba, Japan) and mRNA isolated using the mTRAP™ Total Kit (Active Motif). 5 µg of mRNA were converted into cDNA using the Super-Script™ Plasmid System (Invitrogen) and the cDNA obtained was fractionated by column chromatography. A large and a small size fraction were cloned into the pSPORT1 vector (Invitrogen).
Normalized full-length cDNA was prepared by Evrogen JSC (Moscow, Russia). cDNA was prepared from total RNA using the SMART approach [41] normalized using the DSN normalization method [42] and then amplified by PCR. cDNAs were either directly cloned into a pPCR-Script Amp SK(+) Vector (PCR-Script Amp Cloning Kit, Stratagene) or a large size fraction was isolated by column chromatography and then cloned into a pGEM-T Easy vector (Promega). All libraries were transformed into TOP 10 E. coli cells (Invitrogen) by electroporation.

Sequencing, contig assembly and data analysis
Clone preparation and sequencing Isolated plasmids were sequenced by the cycle sequencing method using an ABI3700 96 capillary sequencer. A min-imal contig set was assembled using the phrap assembler and all contigs were manually curated.

Annotation
Each contig was compared as 3-frame translations to the protein databases Swissprot and genpept using blastx. Furthermore, all contigs were compared using the tBLASTX search algorithm to the genome sequence of Chlamydomonas reinhardtii, Cyanidioschyzon merolae, Thalassiosira pseudonana, Arabidopsis thaliana, Oryza sativa and to the EST databases of Physcomitrella patens and Porphyra yezoensis. The results were compiled to an Excel compatible file. Analyses of COG and KOG categories [43,44] and Interpro protein domains [45] for the contigs were also performed.

Analysis of metabolic pathways
Using the metabolic pathways present at the AraCyc website [46], we identified all expressed genes with significant similarity to Arabidopsis genes present in AraCyc.
Expressed genes that showed no significant similarity to Arabidopsis genes but to enzymes from other organisms were assigned to a pathway using the MetaCyc database [47].   4) n.d. not detected in Mesostigma. 1) In Galdieria (Cyanidioschyzon) 2 (1) of the 3 conserved cysteines occurring in the Viridiplantae are present [48]. 2) Chlorophyte NADP-malate dehydrogenase possesses a C-and N-terminal extension like the embryophyte enzyme, however only the C-terminal cysteines of the embryophyte enzyme are conserved [49,50]. 3) A few cyanobacteria contain an unusual rubisco activase. Only the central AAA + domain shows similarity to plant rubisco activases, whereas the N and C terminal domain are very different [51]. 4) Many angiosperms contain two forms of rubisco activase. Only the long form is regulated by the thioredoxin system [52]. . The nearly complete Mesostigma glycolate oxidase sequence was obtained by complete sequencing of EST clone Meso2b12b08. The sequences were aligned using Clustal X. The alignment was checked manually. Phylogenetic analyses were performed using the Phylip (neighbour joining and parsimony method) and MRBAYES software v 3.0 (Bayesian inference).

Data deposition
Sequence data from this article have been deposited with the EMBL/Genbank data libraries under accession numbers DN254242 to DN264595.

Authors' contributions
AS constructed the libraries and participated in the analysis of the EST-data. GG sequenced the ESTs, participated in the analysis of the EST-data and helped to draft the manuscript. MF participated in the analysis of the EST-data. MM participated in the design of the study, the analysis of the EST-data and helped to draft the manuscript. BB conceived the study, and participated in its design, coordination and analysis of the EST-data, and helped to draft the manuscript. All authors read and approved the final manuscript.