Large-scale collection and annotation of full-length enriched cDNAs from a model halophyte, Thellungiella halophila

Background Thellungiella halophila (also known as Thellungiella salsuginea) is a model halophyte with a small plant size, short life cycle, and small genome. It easily undergoes genetic transformation by the floral dipping method used with its close relative, Arabidopsis thaliana. Thellungiella genes exhibit high sequence identity (approximately 90% at the cDNA level) with Arabidopsis genes. Furthermore, Thellungiella not only shows tolerance to extreme salinity stress, but also to chilling, freezing, and ozone stress, supporting the use of Thellungiella as a good genomic resource in studies of abiotic stress tolerance. Results We constructed a full-length enriched Thellungiella (Shan Dong ecotype) cDNA library from various tissues and whole plants subjected to environmental stresses, including high salinity, chilling, freezing, and abscisic acid treatment. We randomly selected about 20 000 clones and sequenced them from both ends to obtain a total of 35 171 sequences. CAP3 software was used to assemble the sequences and cluster them into 9569 nonredundant cDNA groups. We named these cDNAs "RTFL" (RIKEN Thellungiella Full-Length) cDNAs. Information on functional domains and Gene Ontology (GO) terms for the RTFL cDNAs were obtained using InterPro. The 8289 genes assigned to InterPro IDs were classified according to the GO terms using Plant GO Slim. Categorical comparison between the whole Arabidopsis genome and Thellungiella genes showing low identity to Arabidopsis genes revealed that the population of Thellungiella transport genes is approximately 1.5 times the size of the corresponding Arabidopsis genes. This suggests that these genes regulate a unique ion transportation system in Thellungiella. Conclusion As the number of Thellungiella halophila (Thellungiella salsuginea) expressed sequence tags (ESTs) was 9388 in July 2008, the number of ESTs has increased to approximately four times the original value as a result of this effort. Our sequences will thus contribute to correct future annotation of the Thellungiella genome sequence. The full-length enriched cDNA clones will enable the construction of overexpressing mutant plants by introduction of the cDNAs driven by a constitutive promoter, the complementation of Thellungiella mutants, and the determination of promoter regions in the Thellungiella genome.


Background
Thellungiella halophila (also known as Thellungiella salsuginea) is well known as a model halophyte for studying abiotic stress tolerance, as the plant exhibits extreme salt and freezing tolerance [1][2][3][4][5][6][7][8][9]. Thellungiella is closely related to Arabidopsis, and its genes share approximately 90% identity to those of Arabidopsis [1,10,11]. Moreover, Thellungiella is characterized by good features from the perspective of genetic studies, such as small plant size, a short life cycle, a high seed number, and the ability to selfpollinate. Furthermore, as in Arabidopsis, transformation of Thellungiella plants can be accomplished by means of the floral dipping method. Since the sequence identities between Thellungiella and Arabidopsis are very high at the cDNA level, Arabidopsis cDNA microarrays or oligomicroarrays can be used for transcriptome analysis of Thellungiella plants. We previously compared expression levels of various genes between Thellungiella and Arabidopsis plants under normal or high-salinity conditions using an Arabidopsis cDNA microarray composed of 7,000 Arabidopsis genes. Interestingly, a large number of genes known to be inducible by abiotic and biotic stresses were highly expressed in Thellungiella under normal growth conditions [5]. The use of a 70-mer oligoarray with 25 000 Arabidopsis genes revealed that Arabidopsis exhibited a global defense strategy required for bulk protein synthesis, whereas induced genes in Thellungiella were involved in protein folding, modification, and redistribution [2]. However, because of failed hybridization or a low hybridization rate between Arabidopsis DNAs and Thellungiella mRNAs, the data obtained from heterologous microarrays cannot provide an accurate evaluation of the expression levels. Recently, Thellungiella plants (Yukon ecotype) treated with drought, salinity, and freezing stresses were used to construct expressed sequence tag (EST) libraries with a total of 3628 unique genes [9]. A cDNA microarray was established with these cDNAs, and the transcriptional profiles of Thellungiella plants under various stress conditions were obtained [8].
In previous work, we reported the development of fulllength enriched Arabidopsis cDNA libraries from plants grown under different conditions [13,26] using the biotinylated CAP trapper method with trehalose-thermoactivated reverse transcriptase [27][28][29][30]. A total of 155 144 RIKEN Arabidopsis Full-Length (RAFL) clones were isolated and clustered into 14 668 non-redundant cDNA groups [14]. Using the full-length cDNAs, we also created a microarray to analyze the expression profiles of Arabidopsis genes under various stress conditions or in various mutants and transgenic plants [12,26]. Using ectopic expression of full-length cDNAs, a novel gain-of-function system, termed the "FOX hunting system" (Full-length cDNA Over-eXpressing gene hunting system) was developed [31]. The Arabidopsis genome sequence and resources, including full-length cDNAs, also provide powerful tools for comparative genomics in furthering the understanding of the biology and evolution of other plant species [2,5,10,32]. In the present study, we constructed a full-length enriched cDNA library from whole Thellungiella plants and various tissues, in addition to cDNAs from seedlings subjected to high salinity, chilling, or freezing stress or to abscisic acid (ABA) treatment. We determined their DNA sequences from both the 5'-and the 3'-ends to permit the functional annotation of the Thellungiella full-length cDNAs, and we discuss their predicted functions related to abiotic stress tolerance.

Full-length enriched cDNA library construction and sequencing of 20 000 cDNAs
We used the biotinylated CAP trapper method [29] to construct a full-length cDNA library of Thellungiella halophila (Shandong ecotype) from whole plants as well as from various tissues, including leaves, roots, flowers, siliques, and mature seeds, of plants treated with high salinity, chilling, freezing stress, and ABA ( Table 1). The λFLCIII vector [33], which accommodates cDNAs in a broad range of sizes and is useful for the high-efficiency cloning of long cDNA fragments, was used for the con- struction of the cDNA library. To reduce the frequency of representation of highly expressed mRNAs in the library, normalization procedures [29] were employed in the construction process. The 20000 recombinant clones were randomly selected and sequenced from both ends. We determined 18636 and 16535 sequences from the forward and reverse directions, respectively, and from among the 20000 clones we obtained the forward or reverse sequences of a total of 19429 clones (Table 2). A total of 35171 sequences have been deposited in the DDBJ public sequence database (accession numbers, BY800476 to BY835646). We have named these "RTFL" (RIKEN Thellungiella Full-Length) cDNAs. Figure 1 shows the size distribution of the Thellungiella cDNA inserts from 1161 randomly selected clones.

Sequence assembly and the proportion of full-length cDNA clones in the library
The 35171 sequences were assembled by using the CAP3 program [35] to evaluate the level of sequence redundancy. Assembling these sequences generated 7402 contigs and 6556 singletons, and the sequenced 20000 cDNAs were clustered into 9569 nonredundant scaffolds that represented distinct genes ( Table 2). Figure 2 shows the degree of redundancy in the sequences from the cDNA library. The majority of the cDNAs (6024 cDNAs, 63% of the total), consisted of a single cDNA in a cluster, and only 5% contained more than six cDNAs in a cluster, indicating that the normalization procedures were successful.
The cDNA sequence data were submitted for the BLASTN search to compare with the green plants (Viridiplantae) mRNA databases in GenBank. Of the 19429 clones that we obtained as clean sequences, 18 295 clones (94%) showed > 80% identity, whereas the remaining clones (6%) had no significant identity to any plant sequences in GenBank.
We examined the proportion of full-length cDNA clones in our library. We selected clones that (1) had sequences from both ends, and (2) showed an Expected Value (E) < 1.0e -20 on the basis of a fastx search using forward sequences as queries against Arabidopsis proteins (TIGR v5, ATH1.pep, ftp://ftp.tigr.org/pub/data/a_thaliana/ ath1/SEQUENCES/), with the correct direction of the reading frame. We considered clones to be full-length if they met the following criteria: (1) they contained the first methionine, and (2)

Functional annotation of RTFL cDNAs
The 9569 nonredundant genes were submitted to InterPro [36] to obtain functional domain information. InterPro is an integrated resource for protein families, domains and functional sites that integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D, and PANTHER. Protein matches in InterPro are pre-calculated by using InterProScan software, which combines the different protein signature recognition methods offered by the InterPro member databases into one resource and provides the corresponding InterPro accession numbers and Gene Ontology (GO) annotations [37]. A total of 8289 sequences were assigned to InterPro IDs and GO terms [see Additional file 1]. According to the obtained GO terms, the 8289 genes were remapped and classified by using Plant GO Slim (http://www.geneontology.org/ GO.slims.shtml; [38]).

Figures 3 and 4 show the categorization of Arabidopsis genes and the 8289
Thellungiella genes assigned to the GO terms. In most categories, we observed no obvious differences between the numbers of sequences from Arabidopsis and Thellungiella, including genes involved in biological processes, cellular components, and molecular  4). Furthermore, the number of Arabidopsis genes associated with the nucleus under cellular components was also much higher than that in Thellungiella (Fig. 3A). Although the total number of transcription factors is not clear in Thellungiella genome, there seems to be no significant difference in the proportion of transcription factor genes between Arabidopsis and Thellungiella genome. The population of cDNAs may reflects the levels of gene expression. Thus, the expression level of Thellungiella transcription factors may be lower than that of Arabidopsis. Arabidopsis responds strongly to abiotic and biotic stresses at the transcriptional level. In contrast, Thellungiella does not initiate immediate changes in transcription in response to abiotic stresses, and instead constitutively expresses a large number of genes that correspond to stress-inducible genes in Arabidopsis [5]. Thus, the difference between Arabidopsis and Thellungiella in their responses to various stimuli at the transcriptional level may reflect differences in the number of transcription-related genes between the organisms and may depend partly on the number of transcription factors.

Features of Thellungiella-specific genes
Most Thellungiella genes have a high sequence identity (approximately 90% at the cDNA level) to Arabidopsis genes. Numerous studies of salt tolerance in Arabidopsis suggest that this plant contains most, if not all, the salttolerance related genes that might be found in halophytes [39]. The current hypothesis is that halophytes employ salt-tolerance mechanisms similar to those found in glycophytes, including Arabidopsis. However, subtle differences in this regulation result in large variations in salt tolerance between glycophytes and halophytes [11]. In addition, halophytes are hypothesized to exhibit specific salt-tolerance mechanisms resulting from the induction of halophyte-specific genes. We divided the 8298 genes into two groups on the basis of their sequence identities, using BLASTX searches against the Arabidopsis database. The group with high sequence identity to Arabidopsis genes (E value ≤ 1.0e -50 ) included 7535 genes, and the group with low identity to Arabidopsis genes (E value > 1.0e -50 ) included 763 genes. Previous studies revealed that a plasma membrane Na + /H + antiporter (SOS1), a vacuolar Na + /H + antiporter (NHX1), and a plasma membrane Na + transporter (HKT1) are essential for the salt tolerance of Arabidopsis [40][41][42], and these mutants exhibit a salthypersensitive phenotype. In contrast, plants that overexpress SOS1 and NHX1 show higher salt tolerance than wild-type plants [43,44]. The co-ortholog Thellungiella genes belong to the first gene group, exhibiting high iden-Size distribution of the RTFL clones Figure 1 Size distribution of the RTFL clones. The sequence lengths of the Thellungiella cDNA inserts were determined from a total of 1161 clones by digestion with SfiI (in the cDNA cloning site) or PCR amplification using T3 and T7 primers.

Number of RTFL clones
Predicted insert size (kbp) Sequence redundancy in the normalized cDNA library Figure 2 Sequence redundancy in the normalized cDNA library. A total of 35171 sequences were assembled and divided into 9569 clusters. The graph represents the number of clones per assembled scaffold. Clusters containing a single cDNA accounted for 63% of the identified sequences, whereas clusters that contained more than six cDNAs accounted for only 5% of the total.  Biological process categories for Arabidopsis genes, nonredundant Thellungiella genes, and Thellungiella genes that showed low identity to Arabidopsis genes Figure 4 Biological process categories for Arabidopsis genes, nonredundant Thellungiella genes, and Thellungiella genes that showed low identity to Arabidopsis genes. The 28227 Arabidopsis genes, the 8298 nonredundant Thellungiella genes, and the 763 Thellungiella genes that showed low identity to Arabidopsis genes assigned to InterPro IDs were classified according to the GO terms using Plant GO Slim http://www.geneontology.org/GO.slims.shtml for biological processes. † indicates the categories in which the number of Thellungiella genes was more than 1.5 times the number of Arabidopsis genes. * indicates categories in which the number of Thellungiella genes was more than 1.5 times the number of Arabidopsis genes. The number of Thellungiella genes under the categories of transport, DNA metabolic process, generation of precursor metabolites and energy, response to abiotic stimulus, multicellular organismal development, response to external stimulus, and cell differentiation was more than 1.5 times that in Arabidopsis.

Arabidopsis
Thellungiella low-identity Thellungiella gene tity to Arabidopsis genes. This suggests that some salt-tolerance mechanisms are common to both glycophytes and halophytes.
We compared the categorization of the whole Arabidopsis genome with the categories of the 763 Thellungiella genes that exhibited low identity to Arabidopsis genes (Fig. 4).
Of the genes involved in biological processes, the number of genes in the categories for transport, DNA metabolic process, generation of precursor metabolites and energy, response to abiotic stimulus, multicellular organismal development, response to external stimulus, and cell differentiation in Thellungiella were more than 1.5 times the number in Arabidopsis (Fig. 4). Moreover, in regards to molecular function, the proportion of genes involved in transporter activity in Thellungiella was also higher than in Arabidopsis. Less NaCl accumulates in Thellungiella plants than in Arabidopsis under similar salinity conditions, suggesting that Thellungiella has a superior system for suppressing Na + influx or for excreting Na + [5]. Electrophysiological analysis indicates that Thellungiella also exhibits high potassium/sodium selectivity, implying that Thellungiella has specific ion channel features that lead to superior homeostasis with respect to sodium and potassium [7]. Arabidopsis that overexpresses a plasma membrane Na + /H + antiporter gene, SOS1, shows salinity tolerance and represses its sodium uptake compared with that of wild-type plants [44]. Likewise, the expression level of SOS1 in Thellungiella is higher than in Arabidopsis [5,45]. Although SOS1 overexpression suggests a contribution of this gene to the salt tolerance of Thellungiella, the large proportion of transport genes may imply that Thellungiella has a distinct ion transportation system regulated by these specific genes. Table 3 [see Additional file 2] lists the Thellungiella genes with low identity (E value > 1.0e -50 ) to the Arabidopsis genes classified under transporter genes. Several transporters, including chloride channels and P-type H + -ATPase, play important roles in the salt tolerance of plants. Homeostasis of Na + and Clis an important mechanism to reduce NaCl stress in higher plants. Chloride channels (CLCs) are a group of voltage-gated Clchannels originally identified in animals [46]; they have diverse cellular functions such as stabilizing cell membrane potential and regulating cell volume and transcellular chloride transport [47]. Recently, a chloride channel gene, GmCLC1, was cloned from soybeans [48]. Transgenic tobacco BY-2 cells expressing GmCLC1 were able to drain Clmore efficiently from vacuoles than was the case in untransformed BY-2 cells, and the transgenics showed a higher NaCl tolerance [48]. The plant cell membrane is energized by an electrochemical gradient produced by P-type H + -ATPase (proton pump). These pumps are encoded by at least 12 genes in Arabidopsis. One of the Arabidopsis P-type H + -ATPase genes, AHA4, was expressed most strongly in the root endodermis [49]. The aha4 mutant plants exhibited a clear growth reduction under a mild stress of 75 mM NaCl compared with wild-type plants, and the ratio of Na + to K + in the aha4 mutants increased to between four and five times the values in wild-type plants. These results suggest that the aha4 mutants were compromised in their ability to exclude Na + under salinity stress [49]. P-type H + -ATPases were also found in a halotolerant cyanobacterium, Aphanothece halophytica, and a marine alga, Tetraselmis viridis [50,51]. Aphanothece halophytica grows under a wide range of salinity conditions (from 0.25 to 3.0 M NaCl), and Na + /H + antiporters in A. halophytica play a crucial role in Na + efflux to provide enhanced salt tolerance. Since the efflux of Na + mediated by Na + /H + antiporters utilizes protons as the motive force provided by a primary proton pump, H + -ATPase, the P-type H + -ATPase is thought to contribute to the salt tolerance of this species [52]. On the other hand, vacuolar ATPase (V-ATPase) is the major proton pump that establishes and maintains an electrochemical proton gradient across the tonoplast. Expression of several V-ATPase subunits or an increase in V-ATPase activity induced by salt stress has been observed in a number of glycophytic species [53], suggesting that increased V-ATPase levels or activity are required to drive Na + sequestration under salt stress. Recently, the V-ATPase-deficient det3 Arabidopsis mutant was shown to be extremely salt sensitive. Moreover, SOS2, a protein kinase that phosphorylates SOS1, interacted directly with the V-ATPase regulatory subunits B1 and B2 [54]. These studies indicate that V-ATPase activity plays a key role in salt tolerance. Although most Thellungiella genes show approximately 90% identity with Arabidopsis genes, the Thellungiella genes encoding transporters appear to be remarkably different from their Arabidopsis co-orthologs. Whether the sequence diversities among these genes are the source of the large differences in salt tolerance between Thellungiella and Arabidopsis is a topic of great interest.

Conclusion and cDNA resources
We generated a full-length enriched cDNA library of Thellungiella halophila from various tissues and whole plants treated with salinity, chilling, freezing stresses, or ABA. We isolated about 20000 full-length enriched cDNA clones (RTFL cDNAs) and sequenced them from both ends, and we outlined the features of their predicted functions (coding Thellungiella proteins) by comparing them with those of Arabidopsis. Moreover, the 35171 RTFL cDNA sequences have been deposited in the DDBJ public data center. The number of T. halophila (T. salsuginea) ESTs entries was 9388 as of July 2008, which means that our effort has increased the number of ESTs by four times the number before our study. Our sequences will thus contribute to correct annotation of the Thellungiella genome sequence in the near future. The RTFL cDNA clones will also enable the construction of overexpressing mutant plants by introduction of the cDNAs driven by a constitutive promoter, as well as the complementation of Thellungiella mutants and the determination of promoter regions in the Thellungiella genome. The RTFL clones will be available for distribution through the RIKEN Bioresource Center http://www.brc.riken.go.jp/lab/epd/Eng/. Rosette leaves and cauline leaves, roots, flowers, and siliques were collected from 7to 10-week-old plants, and mature seeds were collected from 12-to 20-week-old plants.

RNA extraction and construction of a full-length cDNA library
Total RNA was prepared by using TRIZOL Reagent (Life Technologies, Rockville, MD, USA) from the treated samples. A full-length cDNA library was constructed as previously reported [14,27,28] by means of the biotinylated CAP trapper method using trehalose-thermoactivated reverse transcriptase [28]. We used the λFLCIII [33] vector, which accommodates cDNAs in a broad range of sizes and is thus useful for the high-efficiency cloning of long cDNA fragments, for construction of the cDNA libraries [33].
The λFLCIII vectors can also be bulk-excised by a Cre-loxbased system free of size bias to generate the plasmid libraries. Normalization [29] was also introduced in the construction of the full-length cDNA library to reduce the frequency of highly expressed mRNAs in the library. The method is based on hybridization of first-strand, fulllength cDNA as the tester and cellular biotinylated RNA extracted from stress-treated plants and various tissues of Thellungiella as the normalizing driver.

Sequencing of Thellungiella cDNA clones
The DNA of each clone was directly amplified from 384 bacterial cultures in a glycerol stock plate by the RCA  The M13 (-21) primer (5'-TGTAAAACGACGGCCAGT-3') and the 1233 primer (5'-AGCGGATAACAATT-TCACACAGGA-3') were used for forward and reverse sequencing, respectively.

Trimming of sequence data and assembly
We used sim4 software for the detection of vector sequences [56]. Raw sequence data were base-called using the Phred software [57,58]. Regions of low quality found at both edges of each raw sequence were discarded, and we extracted only the high-quality region (Phred quality score > 14, and more than 20 bases repeated). After this initial evaluation, sequence data shorter than 100 bases in length or with many low-quality regions (Phred quality score ≤ 14, and more than 50% of its total length) were omitted. The ESTs were assembled by using CAP3 software [35] with its default parameters. All sequences were submitted to the DNA Databank of Japan (DDBJ) with accession numbers BY800476 to BY835646.

cDNA insert size of the RTFL clones
The sequence lengths of the Thellungiella cDNA inserts were determined from a total of 1,161 clones by digestion with SfiI (in the cDNA cloning site) or PCR amplification using T3 (5'-TGTAAAACGACGGCCAGT-3') and T7 primers (5'-AATACGACTCACTATAGGG-3').

Full-length cDNA library quality
To examine the proportion of full-length cDNA clones in this library, we selected the following clones as calculation objects: (1) clones with sequences from both ends, and (2) clones showing an Expected Value of (E) < 1.0e -20 in a fastx search using forward sequences as queries against Arabidopsis proteins (TIGR v5, ATH1.pep), with the correct direction of the reading frame. We used the following criteria to classify clones as full-length: (1) clones must include the first methionine, and (2) the reverse reading sequence must include the polyA sequence.

Scaffold construction
In order to obtain a non-redundant set of transcripts, the clones were clustered according to clone names. To accomplish this, we parsed the .ace file from the CAP3 program output to build scaffolds, which are groups of sequences that represent a unique transcript for which the relative position and orientation of the fragments can be inferred. Using clone names, the contigs or singletons corresponding to the two ends of a given clone were joined by adding 20 N's in the middle of both sequences. Since 20 is more than the default window size in BLAST searches, these N's did not interfere with the BLAST analyses.

Functional annotation of the sequences
Once these scaffolds were created, the sequences were submitted to InterPro [36] to obtain functional domain information. Protein matches in InterPro were pre-calculated with InterProScan software, available from http:// www.ebi.ac.uk/Tools/InterProScan/ [37,59]. InterProScan provided the corresponding InterPro accession numbers and GO annotation in the results [37]. The genes assigned to InterPro ID were classified according to the GO terms developed by InterPro using Plant GO Slim (http://www.geneontology.org/GO.slims.shtml; [38]).