ocsESTdb: a database of oil crop seed EST sequences for comparative analysis and investigation of a global metabolic network and oil accumulation metabolism
© Ke et al.; licensee BioMed Central. 2015
Received: 12 June 2014
Accepted: 22 December 2014
Published: 21 January 2015
Oil crop seeds are important sources of fatty acids (FAs) for human and animal nutrition. Despite their importance, there is a lack of an essential bioinformatics resource on gene transcription of oil crops from a comparative perspective. In this study, we developed ocsESTdb, the first database of expressed sequence tag (EST) information on seeds of four large-scale oil crops with an emphasis on global metabolic networks and oil accumulation metabolism that target the involved unigenes.
A total of 248,522 ESTs and 106,835 unigenes were collected from the cDNA libraries of rapeseed (Brassica napus), soybean (Glycine max), sesame (Sesamum indicum) and peanut (Arachis hypogaea). These unigenes were annotated by a sequence similarity search against databases including TAIR, NR protein database, Gene Ontology, COG, Swiss-Prot, TrEMBL and Kyoto Encyclopedia of Genes and Genomes (KEGG). Five genome-scale metabolic networks that contain different numbers of metabolites and gene–enzyme reaction–association entries were analysed and constructed using Cytoscape and yEd programs. Details of unigene entries, deduced amino acid sequences and putative annotation are available from our database to browse, search and download. Intuitive and graphical representations of EST/unigene sequences, functional annotations, metabolic pathways and metabolic networks are also available. ocsESTdb will be updated regularly and can be freely accessed at http://ocri-genomics.org/ocsESTdb/.
ocsESTdb may serve as a valuable and unique resource for comparative analysis of acyl lipid synthesis and metabolism in oilseed plants. It also may provide vital insights into improving oil content in seeds of oil crop species by transcriptional reconstruction of the metabolic network.
Oil crop seeds are important sources of fatty acids (FAs) and proteins for human and animal nutrition as well as for non-dietary uses . As a major goal of oil crop seed research, studies focusing on engineering seeds with enhanced oil quantity and quality has prompted efforts to better understand the processes involved in seed metabolism, especially in the accumulation of storage products . There are four major oil crops with different oil content in seeds: rapeseed (Brassica napus), soybean (Glycine max), sesame (Sesamum indicum) and peanut (Arachis hypogaea). Accumulation levels of seed storage compounds, such as triacylglycerol (TAG), proteins and carbohydrates, show significant species-specific variations. Sesame and peanut have heterotrophic oilseeds (non-green oilseeds) that contain up to 60% FAs of dry seed, whereas soybean and rapeseed have autotrophic oilseeds (green seeds) that contain up to 20% and 40% FAs of dry seed, respectively . Although non-green seeds of sesame and peanut can accumulate oil without the benefit of photophosphorylation, they have the highest oil content among oilseeds. This suggests that there are many differences in terms of carbon flow, carbon recapture and ATP and NADPH production between non-green seeds and green seeds [4,5].
cDNA and/or genome sequence data of these important oil crops are becoming publicly available. To date, the soybean reference genome has been released. Progress has been made in genome sequencing projects for peanut (the international peanut genome initiative [IPGI]) and sesame [6,7] and Brassica napus . Large-scale expressed sequence tag (EST) collections are also making valuable contributions to the investigation of genetic traits of crops. More than 2,328,985 EST sequence entries are available in the public database (dbEST database of NCBI, as of October 2013) for the important oil crops B. napus (643,944), G. max (1,461,723), S. indicum (44,820) and A. hypogaea (178,498). However, these huge data sets are under-utilised due to the scarcity of informatics databases. A handful of such informatics resources are currently available, which provide high-level analysis of crop functional genomics in searchable forms [9-11].
Genome-scale metabolic network models have been successfully used to describe metabolic processes in various microbial organisms . These system-based frameworks enable systematic biological studies and have the potential to contribute to metabolic engineering. Reconstruction of a complete, genome-scale metabolic network is usually based on annotated genomic sequences , but the activities of many proteins and enzymes are highly tissue-specific, and therefore, metabolic networks should be tissue-specific as well . The biochemical pathways and metabolisms in specific plant tissues are more complicated than those in bacteria. For example, during the development of oilseeds, the synthesis of large quantities of stored TAG relies on sucrose and hexose transport from the mother plant. Recent studies have revealed that a broad range of metabolites are taken up and utilised by plastids for FA synthesis; this process depends on the plant species, organs and stage of development . As a result, whole genome metabolism network construction of oilseeds from multiple oil crops in specific developmental stages is quite essential as well as based on the whole genome data. Based on such a resource, protein coding sequences (CDSs) can be identified, annotated by Enzyme Commission (EC) numbers, and linked to specific biochemical reactions. The reactions can then be connected and further interpreted as a network and analysed using the Cytoscape program .
Some comprehensive repositories of plant resources have been established. PlantGDB is a popular site for plant genomic and EST data. This site provides tools and data of plant EST assemblies and genome annotation . Plant Metabolic Network (PMN) consists of plant metabolic pathway databases  (http://www.plantcyc.org/). In recent years, numerous studies on oilseed development and lipid metabolism have integrated extensive data sets. Most studies have focused on the model plant Arabidopsis thaliana and have included projects such as Microarray Analysis of Developing Arabidopsis Seeds [19-21], ARALIP: Arabidopsis Acyl-Lipid Metabolism , and Quizzing the Chemical Factories of Oilseeds (NSF-Plant Genome Grant) (http://bioinfolab.unl.edu/oilseeds/databases.html). Each of these databases and websites not only provide information on Arabidopsis seed lipid metabolism and the network of gene expression during Arabidopsis seed filling but also include EST data and seed transcriptional profiling data of some other oilseed species.
The ‘-omics’ data of oil crops in publicly available databases are usually far from comprehensive and integrated. Comparative analyses between oil crop tissues to identify species- or tissue-specific genes involved in lipid and oil metabolic are absent. Also, there is a lack of databases that assemble oil crop species together with annotations based on comparative genomics. To understand molecular metabolism involved in oil crop propagation, the accumulation of a storage product and oil biosynthesis, we collected EST sequences on a large scale from seeds at different developmental stages for rapeseed, soybean, sesame and peanut. To understand the EST sequence and full-length CDSs of seeds of four oil crops and to facilitate research on comparative metabolic networks, we constructed a new database called ‘ocsESTdb’ (oil crop seed EST database) with seed EST sequences and metabolic networks of four oil crop species with different objectives. The first objective is to provide large-scale EST sequences and complete amino acid sequences from full-length CDSs and to provide information on clusters, annotations and pathways. The second objective is to provide comparative annotations of four oil crops and metabolic pathways. The third objective is to develop a genome-scale metabolic network model based on the large-scale sequencing of oilseeds at different developmental stages. The ocsESTdb database integrates knowledge of seed EST sequences and full-length CDSs of four oil crops seeds and reconstructs of the metabolic network with insights into comparative oil crop genomics. ocsESTdb can be accessed via the Web interface at http://www.ocri-genomics.org/ocsESTdb/.
Construction and content
Raw data source for EST sequences
Series of Seq.
Raw data processing and clustering analysis
Combined with SMART techniques (Clontech), three normalised cDNA libraries enriched in full-length sequences were constructed for the generation of ESTs using mRNA isolated from immature seeds of three high-oil content cultivars (soybean, peanut and sesame) at three prominent different oil accumulation stages after pollination [23-25]. The cDNA library of B. napus was constructed from immature seeds of two rapeseed lines, B. napus cv. ZY036 (high-oil content, HO) and B. napus cv. 51070 (low-oil content, LO) by 454 sequencing (2 weeks after flowering)  (Table 1). Quality control of raw DNA sequences was performed by using Phred program  to remove sub-standard reads, the vector and adapter sequences, followed by EST-trimmer (http://pgrc.ipk-gatersleben.De/misa/download/est_trimmer.pl) to eliminate 3' polyA and 100 bp EST reads. After screening of low-quality DNA and trimming of vector sequences, Phrap program was used to cluster the overlapping ESTs into contigs . Groups that contained only one sequence were classified as singletons.
Comprehensive annotation of oil crop unigenes
Summary of expressed sequence tags (ESTs) from the five oil crops seed cDNA libraries
No. of sequences generated
No. of high-quality sequences
Average size of high-quality sequences (bp)
No. of unigene
No. of ORF
full length gene (%)
B. napus (HO)
B. napus (LO)
Statistics of annotation result for unigenes from the five oil crops seed cDNA libraries
No. of unigenes
Hit to A. thaliana (%)
Hit to Nr (%)
Hit to Swiss-Prot (%)
Hit to TrEMBL (%)
Annotated to GO (%)
B. napus (LO)
B. napus (HO)
Reconstruction of a global metabolic network of four oil crops
The ocsESTdb database provides a user-friendly interface that is divided into five main functional tabs: ‘Home’ , ‘Browse’ , ‘Search’ , ‘Document’ and ‘Help’. Each functional tab provides a specific capability for users to retrieve information on oil crop seed ESTs or unigenes from the database or to view the oil crop seed ESTs or unigenes in the context of participating in the either the acyl-lipid metabolism pathways or networks constructed by metabolites of oil crop seed unigenes.
Major friendly interface provided by ocsESTdb
In the network, different colours represent different metabolic types, and each node point represents the meridians involved in the metabolism of possible metabolites.
The ocsESTdb database also supplies a pipeline of data processing and database construction, statistics of data collected in this database, literature and open resource in this field. Users can employ the ‘Help’ functional unit to access and download data of EST and unigene sequences, annotations and pathways.
General search in database by names or identifiers
The ocsESTdb database provides a full-featured searching function. The user can retrieve information of interest from the search module. Users can obtain detailed annotation information on the target unigene by entering the ID of the specific unigene and corresponding type of unigenes from different species by entering the relevant GO terms, InterPro entry or COG ID. For further comparative research and analysis, users can determine unigenes participating in different pathways of different species by entering the target pathway entry.
Searching sequence similarity using BLAST
To implement the sequence similarity searching function, ocsESTdb supplies a customised BLAST search from standard NCBI BLAST module for users to retrieve similar or identical sequences from the database with different interests. Users can offer nucleic acid or amino acid sequence via directly pasting or file uploading to match against the oil crop seed ESTs or unigenes database from B. napus, G. max, S. indicum and A. hypogaea. Through comparisons using the BLAST search, users can get the annotations of their query sequences with the deposited data in ocsESTdb quickly.
ocsESTdb collects oil crop seed ESTs and unigenes from B. napus, G. max, S. indicum and A. hypogaea and supplies a public resource for researchers to comparatively analyse and investigate oil accumulation metabolism. Analyses of these four oilseed EST sets have helped to identify similar and different gene expression profiles during seed development. The BLAST and annotation results could be chosen as an example to comparatively analyse the differences in functional genes between four oilseeds. The comparative results of COG annotation and functional genes of four oilseeds can be found at the ‘Statistics’ pages. There is an obvious difference in ratio of functional category between the green and non-green seeds, especially in the metabolism-related gene category. Two non-green seeds (sesame and peanut) have the same ratios in all categories. The ‘metabolism’ category of the non-green seeds of the two crop species was approximately two times higher than that of soybean seeds. Four oil crops have similar ratios of ‘lipid transport and metabolism’.
Distribution of reactions of the inferred genome-wide metabolic network in different functional categories
B. napus (LO)
B. napus (HO)
Amino Acid metabolism
Metabolism of other amino acids
Glycan biosynthesis and metabolism
Biosynthesis of polyketides and nonribosomal peptides
Metabolism of cofactors and vitamins
Biosynthesis of secondary metabolites
Xenobiotics biodegradation and metabolism
Pathway and unigenes statistics of five oilseeds metabolic network
Unigenes in total pathway
Unigene in fatty acid metabolism pathway
Reactions involved in Acetyl-CoA
Reactions involved in pyruvate
B. napus (HO)
B. napus (LO)
The ocsESTdb database is the first integrated comparative analysis database of EST sequences from the seed of four oil crops. This database supplies a user-friendly interface, in which data can be freely accessed and downloaded. ocsESTdb is a uniquely comprehensive world-wide oil crop seed EST database, which also includes sufficient information on unigenes that represent the characteristics of oil crops in terms of oil content. Information necessary to investigate the properties of oil crop genes at the molecular and function levels is also supplied in the database. Moreover, the ocsESTdb database is a tool for information retrieval, visualisation and management. The large set of full-length cDNA clones from oil crops reported in this study will serve as a useful resource for gene discovery and will aid in the precise annotation of the oil crop genome. In addition, this database also serves as a platform to visualise and analyse ‘omics’ data. Furthermore, the overall topology of metabolic networks provides insight into the properties of the network, whereas flux analysis permits phenotype predictions at the metabolic level to guide metabolic engineering. The ocsESTdb database will supply a model to derive new non-trivial hypotheses for exploring plant metabolism. Integration of large EST sequences, metabolic pathways and metabolic network data during oil crop seed development gives us insights into the comparative metabolic networks and their difference between green and non-green oilseeds responsible for the synthesis and metabolism of seed oil.
Availability and requirements
The database is freely available at: http://www.ocri-genomics.org/ocsESTdb/. All data sets are free to use and can be downloaded via the Web interface. There are no restrictions on use of the database or all stores of data sets.
This work was financially supported by grants from the China National Basic Research Program (2011CB109300), the Genetically modified organisms breeding major projects (2009ZX08004-002B), the Open Project of Key Laboratory for Oil Crops Biology, the Ministry of Agriculture, PR China (201202) and the Core Research Budget of the Non-profit Governmental Research Institution.
- Harwood JL. What's so special about plant lipids? In: Harwood JL, editor. Plant lipid biosynthesis: Fundamentals and agricultural applications. Cambridge: Cambridge University Press; 1998. p. 1–26.Google Scholar
- Thelen JJ, Ohlrogge JB. Metabolic engineering of fatty acid biosynthesis in plants. Metab Eng. 2002;4(1):12–21.PubMedView ArticleGoogle Scholar
- Weiss EA. Oilseed Crops. 2nd ed. Oxford; Malden, MA: Blackwell Science; 2000.Google Scholar
- Houston NL, Hajduch M, Thelen JJ. Quantitative proteomics of seed filling in castor: comparison with soybean and rapeseed reveals differences between photosynthetic and nonphotosynthetic seed metabolism. Plant Physiol. 2009;151(2):857–68.PubMed CentralPubMedView ArticleGoogle Scholar
- Schwender J, Goffman F, Ohlrogge JB, Shachar-Hill Y. Rubisco without the Calvin cycle improves the carbon efficiency of developing green seeds. Nature. 2004;432(7018):779–82.PubMedView ArticleGoogle Scholar
- Zhang H, Miao H, Wang L, Qu L, Liu H, Wang Q, et al. Genome sequencing of the important oilseed crop Sesamum indicum L. Genome Biol. 2013;14(1):401.PubMed CentralPubMedGoogle Scholar
- Wang L, Yu S, Tong C, Zhao Y, Liu Y, Song C, et al. Genome sequencing of the high oil crop sesame provides insight into oil biosynthesis. Genome Biol. 2014;15(2):R39.PubMed CentralPubMedView ArticleGoogle Scholar
- Chalhoub B, Denoeud F, Liu S, Parkin IA, Tang H, Wang X, et al. Plant genetics. Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science. 2014;345(6199):950–3.PubMedView ArticleGoogle Scholar
- Cheng KC, Stromvik MV. SoyXpress: a database for exploring the soybean transcriptome. BMC Genomics. 2008;9:368.PubMed CentralPubMedView ArticleGoogle Scholar
- Grant D, Nelson RT, Cannon SB, Shoemaker RC. SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Res. 2010;38(Database issue):D843–846.PubMed CentralPubMedView ArticleGoogle Scholar
- Wu GZ, Shi QM, Niu Y, Xing MQ, Xue HW. Shanghai RAPESEED Database: a resource for functional genomics studies of seed development and fatty acid metabolism of Brassica. Nucleic Acids Res. 2008;36(Database issue):D1044–1047.PubMed CentralPubMedGoogle Scholar
- Francke C, Siezen RJ, Teusink B. Reconstructing the metabolic network of a bacterium from its genome. Trends Microbiol. 2005;13(11):550–8.PubMedView ArticleGoogle Scholar
- Ma H, Zeng AP. Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics. 2003;19(2):270–7.PubMedView ArticleGoogle Scholar
- Hao T, Ma HW, Zhao XM, Goryanin I. The reconstruction and analysis of tissue specific human metabolic networks. Mol Biosyst. 2012;8(2):663–70.PubMedView ArticleGoogle Scholar
- Rawsthorne S. Carbon flux and fatty acid synthesis in plants. Prog Lipid Res. 2002;41(2):182–96.PubMedView ArticleGoogle Scholar
- Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, Workman C, et al. Integration of biological networks and gene expression data using Cytoscape. Nat Protoc. 2007;2(10):2366–82.PubMed CentralPubMedView ArticleGoogle Scholar
- Dong Q, Lawrence CJ, Schlueter SD, Wilkerson MD, Kurtz S, Lushbough C, et al. Comparative plant genomics resources at PlantGDB. Plant Physiol. 2005;139(2):610–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Bassel GW, Gaudinier A, Brady SM, Hennig L, Rhee SY, De Smet I. Systems analysis of plant functional, transcriptional, physical interaction, and metabolic networks. Plant Cell. 2012;24(10):3859–75.PubMed CentralPubMedView ArticleGoogle Scholar
- Ruuska SA, Girke T, Benning C, Ohlrogge JB. Contrapuntal networks of gene expression during Arabidopsis seed filling. Plant Cell. 2002;14(6):1191–206.PubMed CentralPubMedView ArticleGoogle Scholar
- White JA, Todd J, Newman T, Focks N, Girke T, de Ilarduya OM, et al. A new set of Arabidopsis expressed sequence tags from developing seeds. The metabolic pathway from carbohydrates to seed oil. Plant Physiol. 2000;124(4):1582–94.PubMed CentralPubMedView ArticleGoogle Scholar
- Girke T, Todd J, Ruuska S, White J, Benning C, Ohlrogge J. Microarray analysis of developing Arabidopsis seeds. Plant Physiol. 2000;124(4):1570–81.PubMed CentralPubMedView ArticleGoogle Scholar
- Li-Beisson Y, Shorrosh B, Beisson F, Andersson MX, Arondel V, Bates PD, et al. Acyl-lipid metabolism. Arabidopsis Book. 2013;11:e0161.PubMed CentralPubMedView ArticleGoogle Scholar
- Ke T, Dong C, Mao H, Zhao Y, Chen H, Liu H, et al. Analysis of expression sequence tags from a full-length-enriched cDNA library of developing sesame seeds (Sesamum indicum). BMC Plant Biol. 2011;11:180.PubMed CentralPubMedView ArticleGoogle Scholar
- Huang J, Yan L, Lei Y, Jiang H, Ren X, Liao B. Expressed sequence tags in cultivated peanut (Arachis hypogaea): discovery of genes in seed development and response to Ralstonia solanacearum challenge. J Plant Res. 2012;125(6):755–69.PubMedView ArticleGoogle Scholar
- Sha AH, Li C, Yan XH, Shan ZH, Zhou XA, Jiang ML, et al. Large-scale sequencing of normalized full-length cDNA library of soybean seed at different developmental stages and analysis of the gene expression profiles based on ESTs. Mol Biol Rep. 2012;39(3):2867–74.PubMedView ArticleGoogle Scholar
- Hu Z, Huang S, Sun M, Wang H, Hua W. Development and application of single nucleotide polymorphism markers in the polyploid Brassica napus by 454 sequencing of expressed sequence tags. Plant Breeding. 2012;131(2):293–9.View ArticleGoogle Scholar
- Gordon D, Green P. Consed: a graphical editor for next-generation sequencing. Bioinformatics. 2013;29(22):2936–7.PubMed CentralPubMedView ArticleGoogle Scholar
- Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28(1):33–6.PubMed CentralPubMedView ArticleGoogle Scholar
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008;36(Database issue):D480–484.PubMed CentralPubMedGoogle Scholar
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21(18):3674–6.PubMedView ArticleGoogle Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37(Database issue):D211–215.PubMed CentralPubMedView ArticleGoogle Scholar
- Zdobnov EM, Apweiler R. InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001;17(9):847–8.PubMedView ArticleGoogle Scholar
- Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, et al. WEGO: a web tool for plotting GO annotations. Nucleic Acids Res. 2006;34:W293–297.PubMed CentralPubMedView ArticleGoogle Scholar
- Sun J, Lu X, Rinas U, Zeng AP. Metabolic peculiarities of Aspergillus niger disclosed by comparative metabolic genomics. Genome Biol. 2007;8(9):R182.PubMed CentralPubMedView ArticleGoogle Scholar
- Ma H, Zeng AP. The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics. 2003;19(11):1423–30.PubMedView ArticleGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.PubMed CentralPubMedView ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.