Pepper EST database: comprehensive in silico tool for analyzing the chili pepper (Capsicum annuum) transcriptome
- Hyun-Jin Kim†1,
- Kwang-Hyun Baek†2,
- Seung-Won Lee1,
- JungEun Kim1,
- Bong-Woo Lee1,
- Hye-Sun Cho3,
- Woo Taek Kim4,
- Doil Choi5Email author and
- Cheol-Goo Hur1Email author
© Kim et al; licensee BioMed Central Ltd. 2008
Received: 23 July 2008
Accepted: 09 October 2008
Published: 09 October 2008
There is no dedicated database available for Expressed Sequence Tags (EST) of the chili pepper (Capsicum annuum), although the interest in a chili pepper EST database is increasing internationally due to the nutritional, economic, and pharmaceutical value of the plant. Recent advances in high-throughput sequencing of the ESTs of chili pepper cv. Bukang have produced hundreds of thousands of complementary DNA (cDNA) sequences. Therefore, a chili pepper EST database was designed and constructed to enable comprehensive analysis of chili pepper gene expression in response to biotic and abiotic stresses.
We built the Pepper EST database to mine the complexity of chili pepper ESTs. The database was built on 122,582 sequenced ESTs and 116,412 refined ESTs from 21 pepper EST libraries. The ESTs were clustered and assembled into virtual consensus cDNAs and the cDNAs were assigned to metabolic pathway, Gene Ontology (GO), and MIPS Functional Catalogue (FunCat). The Pepper EST database is designed to provide a workbench for (i) identifying unigenes in pepper plants, (ii) analyzing expression patterns in different developmental tissues and under conditions of stress, and (iii) comparing the ESTs with those of other members of the Solanaceae family. The Pepper EST database is freely available at http://genepool.kribb.re.kr/pepper/.
The Pepper EST database is expected to provide a high-quality resource, which will contribute to gaining a systemic understanding of plant diseases and facilitate genetics-based population studies. The database is also expected to contribute to analysis of gene synteny as part of the chili pepper sequencing project by mapping ESTs to the genome.
Pepper is a member of the family Solanaceae, which is one of the largest families in the plant kingdom and includes more than 3,000 species . The Solanaceae family includes important crops, such as pepper, tomato, tobacco, potato, and eggplant and has been highly cultivated over the years for human nutrition and health. Capsicum species are consumed worldwide and are valued because of their unique color, pungency, and aroma. Capsicum peppers include C. annuum, C. chinense, C. baccatum, C. frutescens, and C. pubescens and are cultivated in different parts of the world. Of these, the varieties of the chili pepper plant species C. annuum, having a modest-sized diploid genome (2n = 24), are the most heavily consumed due to their nutritional value and spicy taste . The chemical that is primarily responsible for the pungency of C. annuum has been identified as capsaicin , which elicits numerous biological effects and is the target of extensive investigation.
Expressed Sequence Tags (ESTs) are short subsequences derived from randomly isolated cDNAs . With the advent of massive computational and biostatistical analysis, large-scale EST data sets can be efficiently analyzed to monitor gene expression [5–7]. ESTs in vegetable plants provide the opportunity to expand our knowledge of the genetic control of complex traits and the findings are applied in the agricultural industry to advance efforts to screen ecologically important phenotypes and reduce plant disease . EST databases also provide comparative data for analyses of organisms that lack comparable genomic resources .
The development of automated high-throughput chili pepper EST sequencing projects in Korea has generated hundreds of thousands of EST sequences. Previous studies indicate that EST databases provide valid and reliable data for understanding gene expression and for gene mining . Databases have been constructed for ESTs accumulated for tomato species to permit scoring of gene expression patterns in silico; these include the Tomato Stress EST (TSED), Micro-Tom (MiBASE) , and TomatEST databases . Two pepper EST databases have been constructed, including the DFCI pepper gene index  and Pepper unigene at the sol genomics network . Those databases were built on approximately 31,000 EST sequences, among the EST sequences, around 21,000 sequences were provided by our group; however, there has been a growing international need for a more comprehensive chili pepper EST database to enable extensive digital analysis of gene expression in pepper species because of the increasing interest in pepper's nutritional and pharmaceutical properties, as well as its spicy taste.
In this report, we present the Pepper EST database, a web-based database of of chili pepper plant ESTs. Pepper EST contains significantly more ESTs than existing databases (122,582 ESTs vs approximately 31,000 ESTs) and provides several advanced features, such as linking ESTs and their digital expression data. We constructed Pepper EST as a pipeline for comprehensive EST data analyses for investigations of expressed gene data. The database contains (i) raw sequence data; (ii) high-quality consensus sequences obtained from the assembly phase; (iii) tissue-specific ESTs; (iv) full-length cDNAs; (v) and functional annotation and assignment to metabolic pathways based on BLAST similarity searches. The unique feature of the Pepper EST database is the data set. ESTs were derived from cDNA sequences derived from different tissues of plants of a single chilli pepper variety, grown under constant growth conditions with exposure to a variety of stress agents.
The current release includes 116,412 refined ESTs from 122,582 sequenced ESTs from 21 chili pepper libraries. All libraries were constructed to represent 11 different tissues, developmental stages, or conditions of stress. Messenger RNA (mRNA) for constructing the cDNA libraries was extracted from plants of a single variety (C. annuum cv. Bukang) grown under the same conditions, including temperatures of 25°C/18°C (day/night) and a 16 h photoperiod .
EST sequences may contain a variety of contaminants, which should be removed before the sequences are used. We use Phred  to extract high-quality regions from raw sequence data. The Cross_Match program is used to mask contaminant and vector (pBluescriptSK-) sequences. Python is used to construct the trimming script to remove the masked vector, linker sequences (EcoRI and XhoI), and polyA/T regions. In addition, all short ESTs (< 100 bp) are eliminated because these are considered non-informative for EST analysis.
Clustering and assembly
After confirming EST quality and trimming the vector sequence to obtain high-quality sequence, EST sequences are assembled into contigs to reduce inherent redundancy and to build unigene sets. Only EST sequences sharing > 96% identity over a region longer than 100 nucleotides (nts) are selected and further grouped into clusters. EST sequences in a cluster are usually defined to represent the same gene; therefore, each cluster is treated as a gene index. In the process of clustering and assembly, however, one or more consensus sequences often occur in certain EST sets within a cluster. Possible explanations for multiple contigs within a cluster include (i) alternative splicing, (ii) existence of common protein domains, or (iii) paralogy. However, the reasons for this remain to be determined since genome sequence data for chili pepper are not yet available. To address this, the sequence assembly process is performed using stackPACK™  for clustering and assembling EST sequences into contigs and singletons.
Parameters of the Pepper EST database
Parameters of Pepper EST clusters and contigs
Pepper raw ESTs
Pepper ESTs refined
Gene indices (clusters)
Full-length cDNAs clues (with E-value ≤ 1e-10)
Functional annotation is performed using BLAST. The filtered ESTs and assembled EST contigs are compared with the UniProt databases containing all plant protein data using the BLASTX program, with E-value set at less than 1e-3. For successful matches, only the top five hits and their alignment results are stored, annotated, and reported in the Pepper EST database. When the subject accession number is matched to the Gene Ontology (GO)  database, the corresponding classification is included to provide additional information on the putative functionalities. The subject accession number given to an Enzyme Commission (EC) number is also mapped onto known KEGG metabolic pathways. A total of 5,685 putative full-length cDNA clues are identified in our chili pepper EST-derived contigs and singleton data. The start codon and protein coding region are indicated in the "CDS candidate" feature on the website. We use the TargetIdentifier algorithm , which does not require "training" with previously known sequences and uses only the BLASTX output.
In silicoanalysis for identification of tissue-specific genes
Tissue-specific and selective genes in the Pepper EST database
No. of specific genesa
No. of selective genesb
Pathogen infected leaf
Pepper EST is a chili pepper EST database for EST data management and analysis. The Pepper EST database server is composed of a web interface and a MySQL database management system. The web interface is implemented in static HTML pages and PHP scripts for querying the database to allow retrieval of unigenes based on BLASTX hits and other functional annotation results. The MySQL system is used to store the collected sequence information and the analyzed data.
The "Tissue-specific" page contains two tables designated "Tissue Category" and "Audic's test score". The tissue-specificity data can be analyzed by the combination of cut-off below 0.05, 0.01, or 0.001, and element sequence count of more than one, three, or five. The default tissue-specific parameters are set as a cut-off below 0.01 and an element sequence count of more than three. The tissue-specific genes analyzed by Audic's test can be specifically examined by the combinations of cut-off (below 0.05, 0.01, or 0.001), tissue name (including all, pathogen-infected leaf, flower, anther, fruit, root, placenta, seed, bark, peduncle, callus, and seedling, and), and type (specific or selective) (Table 2).
The "Functional Category" menu includes the information for the submenu for Pathway, GO, and MIPS. Pathway includes the information for 1,376 enzymes associated with metabolic pathways and includes Oxidoreductases (370 hits), Transferases (404 hits), Hydrolases (313 hits), Lyases (112 hits), Isomerases (75 hits) and Ligases (102 hits). Enzymes are divided into classes and subclasses, according to the guidelines of the Nomenclature Committee of the IUBMB. GO (Gene Ontology) contains the information regarding biological process (4,832 hits), cellular component (4,457 hits), and molecular function (932 hits). In MIPS FunCat, 2,381 genes were classified by the MIPS functional catalogue , which data analysis was based on the genes of Arabidopsis. The MIPS Functional Catalogue provides a search tool to browse the functional categories of genes with the subgroup gene function and the consensus ID and sequences.
The "Search" menu holds three different tab-style submenus for querying; these are "Annotated data", "Raw EST sequences", and "BLAST" search. The annotated data menu brings the user to a page where it is possible to search against keyword and other parameters. Raw EST sequences allow the user to search and download each pre-processed EST sequence or whole cDNA library element sequences. Users can use the BLAST search to compare their own sequences with in-house sequences in the Pepper EST database. The display of search results contains links to singleton and consensus sequences.
The unique feature of the Pepper EST database is that it is the first of its kind in that ESTs were derived from cDNA sequences from different tissues of plants of a single chili pepper variety, grown under constant growth conditions with exposure to a variety of stress agents, in one laboratory. Having incorporated the sequence data for tissue-specific expression patterns, developmental stages, and normal conditions and conditions of stress into our newly established database, we expect Pepper EST to provide a comprehensive in silico tool for analyzing numerous biological parameters in chili pepper plants.
The Pepper EST database is a chili pepper-specific workbench for investigation of EST data. The Pepper EST database shares a subset of the same ESTs (21,000) with Pepper unigene at the Solanacea Genomics Network, however, with to the applications of significantly more ESTs and the equipment of more advanced features, the Pepper EST dabatase can provide more extensive information about chilli pepper ESTs. Pepper EST database will provide a high-quality resource for chili pepper EST analysis and also for comparative genomics for the family Solanaceae. In the future, the Pepper EST database will significantly aid analyses of gene synteny in the chili pepper genome and comparable studies within the family Solanaceae.
The PEPPER EST database is freely available to academic researchers at http://genepool.kribb.re.kr/pepper/ after registering on the website and obtaining approval for its use. The entire content of the database is available for download from the website. Our group will service the Pepper EST database continuously and update it annually. Questions, comments, and requests regarding this database should be sent to Dr. Cheol-Goo Hur at firstname.lastname@example.org
We thank Dr. Jeong Mee Park and Dr. Suk-Yoon Kwon of KRIBB for helpful discussions. This work was supported by grants from the Crop Functional Genomics Center (CFGC) and Plant Diversity Resource Center (PDRC), one of the 21st Century Frontier Research Programs of the Ministry of Education, Science, and Technology (MEST) to Cheol-Goo Hur and Doil Choi.
- Knapp S: Tobacco to tomatoes: a phylogenetic perspective on fruit diversity in the Solanaceae. J Exp Bot. 2002, 377: 2001-2022. 10.1093/jxb/erf068.View ArticleGoogle Scholar
- Govindarajan VS, Sathyanarayana MN: Capsicum-production, technology, chemistry and quality. Part. V. Impact on physiology, pharmacology, nutrition and metabolism, structure, pungency, pain, and desensitization sequences. Crit Rev Food Sci Nutr. 1991, 29 (6): 435-474.PubMedView ArticleGoogle Scholar
- Monsereenusorn Y, Kongsamut S, Pezalla PD: Capsaicin-a literature survey. Crit Rev Toxicol. 1982, 10: 321-339. 10.3109/10408448209003371.PubMedView ArticleGoogle Scholar
- Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC: Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet. 1993, 4: 373-380. 10.1038/ng0893-373.PubMedView ArticleGoogle Scholar
- Ewing RM, Kahla AB, Poirot O, Lopez F, Audic S, Claverie JM: Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res. 1999, 9: 950-959. 10.1101/gr.9.10.950.PubMedPubMed CentralView ArticleGoogle Scholar
- Ogihara Y, Mochida K, Nemoto Y, Murai K, Yamazaki Y, Shin-I T, Kohara Y: Correlated clustering and virtual display of gene expression patterns in the wheat life cycle by large-scale statistical analyses of expressed sequence tags. Plant J. 2003, 33: 1001-1011. 10.1046/j.1365-313X.2003.01687.x.PubMedView ArticleGoogle Scholar
- Fizames C, Muñis S, Cazettes C, Nacry P, Boucherez J, Gaymard F, Piquemal D, Delorme V, Commes T, Doumas P, Cooke R, Marti J, Sentenac H, Gojon A: The Arabidopsis root transcriptome by serial analysis of gene expression gene identification using the genome sequence. Plant Physiol. 2004, 134: 67-80. 10.1104/pp.103.030536.PubMedPubMed CentralView ArticleGoogle Scholar
- Fei Z, Tang X, Alba RM, White JA, Ronning CM, Martin GB, Tanksley SD, Giovannoni JJ: Comprehensive EST analysis of tomato and comparative genomics of fruit ripening. Plant J. 2004, 40: 47-59. 10.1111/j.1365-313X.2004.02188.x.PubMedView ArticleGoogle Scholar
- Rensink WA, Lee Y, Liu J, Iobst S, Ouyang S, Buell CR: Comparative analyses of six solanaceous transcriptomes reveal a high degree of sequence conservation and species-specific transcripts. BMC Genomics. 2005, 6: 124-10.1186/1471-2164-6-124.PubMedPubMed CentralView ArticleGoogle Scholar
- Brady SM, Long TA, Benfey PN: Unraveling the dynamic transcriptome. Plant Cell. 2006, 18: 2101-2111. 10.1105/tpc.105.037572.PubMedPubMed CentralView ArticleGoogle Scholar
- Yano K, Watanabe M, Yamamoto N, Tsugane T, Aoki K, Sakurai N, Shibata D: MiBASE: a database of a miniature tomato cultivar Micro-Tom. Plant Biotechnol J. 2006, 23: 195-198.View ArticleGoogle Scholar
- D'Agostino N, Aversano M, Frusciante L, Chiusano ML: TomatEST database: in silico exploitation of EST data to explore expression patterns in tomato species. Nucleic Acids Res. 2007, 35: D901-D905. 10.1093/nar/gkl921. 14PubMedPubMed CentralView ArticleGoogle Scholar
- DFCI Pepper Gene Index. [http://compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/gimain.pl?gudb=pepper]
- Sol Genomics Network. [http://www.sgn.cornell.edu]
- Sanghyeob L, Sooyong K, Eunjoo C, Younghee J, Hyunsook P, Cheolgoo H, Doil C: EST and microarray analyses of pathogen-responsive genes in hot pepper (Capsicum annuum L.) non-host resistance against soybean pustule pathogen (Xanthomonas axonopodis pv.glycines). Functional & Integrative Genomics. 2004, 4: 196-205.Google Scholar
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.PubMedView ArticleGoogle Scholar
- Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA: A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 1999, 9: 1143-55. 10.1101/gr.9.11.1143.PubMedPubMed CentralView ArticleGoogle Scholar
- Christoffels A, Van GA, Greyling G, Miller R, Hide T, Hide W: STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 2001, 29: 234-238. 10.1093/nar/29.1.234.PubMedPubMed CentralView ArticleGoogle Scholar
- Gene Ontology (GO). [http://www.geneontology.org]
- Min XJ, Butler G, Storms R, Tsang A: TargetIdentifier: a webserver for identifying full-length cDNAs from EST sequences. Nucleic Acids Res. 2005, 33: W669-W672. 10.1093/nar/gki436.PubMedPubMed CentralView ArticleGoogle Scholar
- Sanghyeob L, Doil C: Platform of hot pepper defense genomics: isolation of pathogen-responsive genes in hot pepper (Capsicum annuum L.) non-host resistance against soybean pustule pathogen (Xanthomonas axonopodis pv. glycines). Plant Pathol J. 2004, 20: 46-51.View ArticleGoogle Scholar
- Audic S, Claverie JM: The significance of digital gene expression profiles. Genome Res. 1997, 7: 986-995.PubMedGoogle Scholar
- Mégy K, Audic S, Claverie JM: Heart-specific genes revealed by expressed sequence tag (EST) sampling. Genome Biol. 2002, 3: research0074.1-research0074.11. 10.1186/gb-2002-3-12-research0074.Google Scholar
- MIPS Functional Catalogue (FunCat). [http://mips.gsf.de/projects/funcat]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.