An EST database from saffron stigmas

Background Saffron (Crocus sativus L., Iridaceae) flowers have been used as a spice and medicinal plant ever since the Greek-Minoan civilization. The edible part – the stigmas – are commonly considered the most expensive spice in the world and are the site of a peculiar secondary metabolism, responsible for the characteristic color and flavor of saffron. Results We produced 6,603 high quality Expressed Sequence Tags (ESTs) from a saffron stigma cDNA library. This collection is accessible and searchable through the Saffron Genes database http://www.saffrongenes.org. The ESTs have been grouped into 1,893 Clusters, each corresponding to a different expressed gene, and annotated. The complete set of raw EST sequences, as well as of their electopherograms, are maintained in the database, allowing users to investigate sequence qualities and EST structural features (vector contamination, repeat regions). The saffron stigma transcriptome contains a series of interesting sequences (putative sex determination genes, lipid and carotenoid metabolism enzymes, transcription factors). Conclusion The Saffron Genes database represents the first reference collection for the genomics of Iridaceae, for the molecular biology of stigma biogenesis, as well as for the metabolic pathways underlying saffron secondary metabolism.


Background
Saffron (Crocus sativus L.) is a triploid, sterile plant, probably derived from the wild species Crocus cartwrightianus. It has been propagated and used as a spice and medicinal plant in the Mediterranean area for thousands of years [1]. The domestication of saffron probably occurred in the Greek-Minoan civilization between 3,000 and 1,600 B.C. A fresco depicting saffron gatherers, dating back to 1,600 B.C. has been unearthed on the island of Santorini, Greece.
Saffron is commonly considered the most expensive spice on earth. Nowadays, the main producing countries are Iran, Greece, Spain, Italy, and India (Kashmir). Apart from the commercial and historical aspects, several other characteristics make saffron an interesting biological system: the spice is derived from the stigmas of the flower ( Figure  1A), which are harvested manually and subjected to desiccation. The main colors of saffron, crocetin and crocetin glycosides, and the main flavors, picrocrocin and safranal, are derived from the oxidative cleavage of the carotenoid, zeaxanthin [2,3] (Figure 1B). Saffron belongs to the Iri-daceae (Liliales, Monocots) with poorly characterized genomes of relatively large size.
The characterization of the transcriptome of saffron stigmas is likely to shed light on several important biological phenomena: the molecular basis of flavor and color biogenesis in spices, the biology of the gynoecium, and the genomic organization of Iridaceae. For these reasons, we have undertaken the sequencing and bioinformatics characterization of Expressed Sequence Tags (ESTs) from saffron stigmas.

Results and discussion
Sequencing and assembly An oriented cDNA library from mature saffron stigmas in lambda Uni-ZAP [2] was kindly provided by Prof. Bilal Camara, University of Strasbourg. The library was subjected to automated excision, and the cDNA inserts were subjected to PCR amplification and sequenced from the 5' end.
9,769 electropherograms were analyzed with the Phred program [4]. Low quality sequences were removed from the 5' and 3' ends, and the sequences were further processed to remove vector contaminations and to mask low complexity and/or repeat sub-sequences. This process reduced the original dataset to 6,603 high-quality sequences longer than 60 nucleotides. Only 6,202 EST fragments whose length is greater than or equal to 100 nucleotides were considered for the submission to the NCBI dbEST division. They are accessible under the accession numbers from EX142501 to EX148702.
The EST dataset was subjected to a clustering/assembling procedure [5], in order to group ESTs putatively derived from the same gene and to generate a tentative consensus sequence (TC) per putative transcript. The total number of clusters generated are 1,893. Each cluster should corre-spond to a unique gene, i.e. it represents a gene index. 1,376 clusters are made up of a single EST and are therefore classified as singletons. The remaining 517 clusters  are made up of 5,324 ESTs, assembled into 534 TCs (Table  1). In 11 clusters, ESTs are assembled so that multiple TCs are defined (ranging from 2 to 6). Multiple TCs in a cluster have common regions of high similarity that may be due to possible alternative transcripts, to paralogy or to domain sharing. The GC content distribution in the dataset is reported in Figure 2. The average GC content is around 44%.

The database and the web interface
The dataset was used to construct the Saffron Genes database [6]. The database architecture consists of a main MySQL relational database where all the data generated are deposited, and two satellite databases myGO and myKEGG. A user-friendly web interface is created using HTML and PHP scripts. A pre-defined query system supports data retrieval; HTML-tree graphical display is implemented to browse enzyme classes and metabolic pathways. Transcripts, which correspond to criteria defined by the user, can be mapped on-the-fly onto the KEGG metabolic maps, which are accessible as GIF images [7]. The electropherograms of the single ESTs can be downloaded to re-check sequence quality.

Automated functional annotation
In order to assign a preliminary function to each transcript, the TCs and singletons were compared using BLASTX to the UniProtKB/Swiss-Prot database. Of 1,910 transcripts, 1,158 (60.6%) have no hits, while the remaining 752 (39.4%) have at least one significant match in the protein database. Within this latter set, 131 (6.9%) are described as hypothetical, unknown or expressed proteins thus not confirming an effective functional role of the transcript product.
Gene Ontology terms were assigned automatically to those 157 transcripts matching a protein in the Uni-ProtKB/Swiss-Prot database whose accession numbers are present into the satellite database myGO (see Methods). In many cases, multiple gene ontology terms could be The saffron spice Figure 1 The saffron spice. A. Crocus flowers. Arrowheads point to the stigmas, which, harvested and desiccated, constitute the saffron spice. B. Biosynthetic pathway of the main saffron color (crocin) and flavors (picrocrocin and safranal) (from [2], modified).
assigned to the same transcript, resulting in 210 assignments to the molecular function, 944 to the biological process and finally 2,192 to the cellular component class.
To give a broad overview of the ontology content, the entire set of the ontologies was mapped onto the plant GO Slims terms. In the molecular function ontology class, the most represented terms describe catalytic (33.3%) and hydrolase activity (20.0%) ( Figure 3A). The remaining categories are less represented. Considering the biological process class, the vast majority of the GO assignments corresponds to the more general transport category (~78.8%) ( Figure 3B). Finally, for the cellular component class the assignments were mainly given to the plastid (36%), mitochondrion (33%), and cytoplasmic membranebound vesicle (29%) components ( Figure 3C). 64 transcripts are associated to 46 distinct enzymes as they are classified and described into the ENZYME repository [8]. 35 out of the 46 enzymes had mappings to 55 KEGG biochemical pathways [9]. As we know, some enzymes can occur in more than one pathway; on the other hand there are 8 enzymes which only act in a single pathway, that were classified as pathway-specific (data not shown).

Genes expressed in Crocus stigmas
EST abundance in a contig can be indicative of the mRNA relative abundance in the stigma tissue. We identified the TCs that are composed of ≥ 20 ESTs ( Table 2). The most highly expressed TC, Cl000057:2 (547 ESTs), bears homology to short chain dehydrogenases (PF00106.12). This protein family comprises members involved in hormone biosynthesis, like the ABA2 gene of Arabidopsis which catalyzes the conversion of xanthoxin into ABA aldehyde [10], or in sexual organ identity, like the TASSELSEED2 (TS2) gene of maize ( Figure 4). TS2 is expressed in pistil primordia cells of maize, where it activates a cell death process eliminating these cells from male reproductive organs [11]. Biochemical studies suggest that the TS2 protein is a hydroxysteroid dehydrogenase [12]. It will be interesting to determine the function and substrate specificity of the saffron Cl000057:2 product.
A large number of Cytochrome P450 sequences are expressed in saffron stigmas, some of which at very high levels (Tables 2 and 3). Also, lipid metabolism seems to be very active, judging from the TCs encoding proteins involved in this process (Table 3).
Cl001432:1 encodes a protein similar to plastid terminal oxidase, involved in phytoene desaturation [14], while EST cr36_B21 encodes a protein similar to fibrillin, which is a carotenoid-binding protein in pepper chromoplasts [15]. Cl000468 encodes a carboxyl methyltransferase very similar to the one catalyzing the synthesis of bixin [16] ( Figure 4). This TC seems to encode a "short" form of the annatto and crocus methyltransferases from GenBank, possibly derived from alternative splicing (Figure 4). Although a methyltransferase reaction has not been described in saffron stigmas, the biosynthesis of bixin and that of crocin share some features in common, since both pigments are derived from the oxidative cleavage of a carotenoid [17]. Finally, Cl000045:1 encodes a protein highly similar to the cauliflower Or gene product, a plastid-associated protein with a cysteine-rich DnaJ domain. A dominant Or mutation induces β-carotene accumulation in cauliflower inflorescences, suggesting that Or is somehow involved in the control of chromoplast differentiation [18,19].
Several TCs encode putative transcription factors ( Table  3). The most abundantly expressed, Cl000348:1, encodes a Myb-like protein with high similarity to LhMyb (from Lilium, GenBank accession BAB40790) Myb8 (from Gerbera [20] -also showing similarity to Cl000348:2) and Myb305 (From Antirrhinium [21]). All three factors are highly expressed in flowers. Also highly expressed is Cl001329:1, encoding a putative MADS box transcription factor. This protein shows high similarity to AODEF, a Bfunctional transcription factor from Asparagus expressed in stamens and inner tepals [22] and to LMADS1, a lily protein whose ectopic expression in dominant negative form causes an ap3-like phenotype in Arabidopsis [23].

Conclusion
The Saffron Genes database [6] has been designed to manage and to explore the EST collection from saffron stigmas, providing a reference for the expression pattern analysis in this tissue as well as a primary view of the genomic properties of this species, representative of Iri-ClustalW alignments of deduced protein sequences expressed in Crocus stigmas  daceae. The complete set of raw EST sequences, as well as of their electopherograms, are maintained in the database allowing users investigate on library qualities and on single EST structural features (vector contamination, repeat regions). Annotation is provided for single ESTs as well as for their assemblies (tentative consensus), to evaluate the consistency of the automated functional assignments. The putative transcripts determined to be associated to enzymes are organized into classes and can be viewed also in terms of enzyme assignments to metabolic pathways. This represents a straightforward way to investigate the properties of the stigma transriptome. As discussed above, this transcriptome contains a series of interesting sequences, whose function can now be tested using in vivo or in vitro approaches.
Publish with Bio Med Central and every scientist can read your work free of charge