OryzaPG-DB: Rice Proteome Database based on Shotgun Proteogenomics
© Helmy et al; licensee BioMed Central Ltd. 2011
Received: 18 April 2010
Accepted: 12 April 2011
Published: 12 April 2011
Proteogenomics aims to utilize experimental proteome information for refinement of genome annotation. Since mass spectrometry-based shotgun proteomics approaches provide large-scale peptide sequencing data with high throughput, a data repository for shotgun proteogenomics would represent a valuable source of gene expression evidence at the translational level for genome re-annotation.
Here, we present OryzaPG-DB, a rice proteome database based on shotgun proteogenomics, which incorporates the genomic features of experimental shotgun proteomics data. This version of the database was created from the results of 27 nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for analyzing tryptic digests from undifferentiated cultured rice cells. Peptides were identified by searching the product ion spectra against the protein, cDNA, transcript and genome databases from Michigan State University, and were mapped to the rice genome. Approximately 3200 genes were covered by these peptides and 40 of them contained novel genomic features. Users can search, download or navigate the database per chromosome, gene, protein, cDNA or transcript and download the updated annotations in standard GFF3 format, with visualization in PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the rice proteome, providing peptide-based expression profiles, together with the corresponding genomic origin, including the annotation of novelty for each peptide.
The OryzaPG database was constructed and is freely available at http://oryzapg.iab.keio.ac.jp/.
Among high-throughput experimental methods, genome sequencing represents a turning point in the understanding of biological systems. Nevertheless, the biological significance of the sequenced genome cannot be understood unless the protein-coding genes and their products are accurately identified. Thus, genome annotation has become major issue [1–3]. Genome annotation is the process of gene structure and function determination, and it usually takes place after genome sequencing and before data deposition in a database or databank [2, 4, 5].
In typical genome annotation work, experimental and computational methods are integrated to analyze the huge volume of sequence data [2, 4, 6, 7]. Thus, genome annotation is highly dependent on the expression evidence, usually transcriptional, provided by experiments and the algorithms implemented in the computational tools . Consequently, the annotation process suffers from several limitations. For instance, most of the sequenced genomes lack rich transcriptional evidence, e.g., a full-length cDNA library. Even when such information is available, evidence of expression at the transcriptional level does not necessarily imply translation into a protein [8, 9]. Therefore, annotation is highly reliant on de novo annotations of protein-coding genes performed using gene prediction programs [2, 4, 8].
On the other hand, gene/protein prediction tools have proven their usefulness and utility in the annotation process. However, the prediction accuracy varies from one tool/algorithm to another and from one organism to another, depending on the genome complexity [2, 8, 10, 11]. For instance, in the human and Arabidopsis genomes, the prediction accuracy amounted to 50% and ~66%, respectively, indicating the need for better identification and validation methods [11, 12].
Mass spectrometry-based proteomics, as an experimental approach to measure proteins, can provide translation-level expression evidence for the predicted protein-coding genes; this is the so-called proteogenomics approach of using large-scale proteome data in genome annotation refinement [3, 8, 13, 14]. This approach seems the best option for identification and validation of protein-coding genes, or at least a significant portion of them, in an independent and unambiguous way. This can be achieved by detecting the naturally occurring proteins (proteomics) and systematically mapping them back to the genome sequence (genomics) [3, 8, 13, 14]. In addition to validating predicted gene models at the translation level [15, 16], proteogenomics has other useful applications, such as finding new gene models , determination of protein start and termination sites , finding and verifying splice isoforms at the protein level  and verification of hypothetical and putative genes/proteins [17, 19]. The results of proteogenomic studies are usually made freely available via specialized databases such as AgBase  or are included in databases developed particularly to host data from specific projects, such as the AtProteome database developed to host the Arabidopsis proteogenomics data . Overall, proteogenomics represents a promising approach for application to both completed and newly sequenced genomes.
Rice (Oryza sativa) is one of the most important food crops; almost half of the world's population is estimated to rely totally or partially on it. Moreover, rice considered a model organism because of its relatively small genome (12 chromosomes and ~370 Mbp) [22, 23]. The whole genome sequence and annotation have been published and updated several times (5 builds for the genome and 6 builds for the annotation to date) [24–26]. However, there has been little attempt to include proteome information in the genome-wide annotation, except for the work of Itoh and colleagues, who used rice proteome data, available through the rice proteome database , to confirm 834 ORFs . The virtual absence of proteome-based genome annotation for rice is possibly due to the absence of accurate and detailed rice proteome information.
Here we present (OryzaPG-DB) a rice proteome database based on shotgun proteogenomics. Unlike the currently available rice proteome database , which provides the 2D-PAGE-based proteome, OryzaPG-DB contains peptides obtained from shotgun-based proteomics with their product ion spectra, as well as updated annotations, side by side with the corresponding protein, cDNA, transcript and genomic sequences and information.
Construction and content
Generation of a reference dataset by shotgun proteomics
Summary of the currently available contents in OryzaPG-DB*
Corresponds to MSU/TIGR (6.1) models
Corresponds to MSU/TIGR (6.1) locus/transcript (TU)
All identified non-redundant peptides
Peptides that are not present in the protein database
Genes to be revised
Genes with peptides mapped to novel regions such as intron, exon-intron boundary and non-coding regions
Proteogenomics analysis to find novel genomic features
Next, we performed proteogenomic data analysis using bioinformatics approaches to map the identified peptides back to the genome and find novel genomic features as follows:
▪ Download the original annotation from MSU genome browser with only the MSU Osa1 Rice Gene Models and MSU Osa1 Rice Loci features selected. The original files can also be obtained from the OryzaPG-DB download page.
▪ Align all peptides identified from the MSU protein, cDNA and transcript databases to their corresponding genomic origin (genomic-unspliced mRNA), using the Basic Local Alignment Search Tool (BLAST) . The alignments were performed using a local version of NCBI BLAST (blast2seq)  and perl script.
▪ Extract the alignment results of the peptides identified from the MSU genome database directly from MASCOT output files.
▪ Create perl scripts that read the alignment results and convert them to standard GFF3 format. Each peptide's alignment was converted to a GFF3 line indicating its type, identification source, start, end, parent and OryzaPG-DB peptide ID.
▪ Map the peptides identified from the MSU genome to the genes by comparing the peptide alignment coordinates (start and end) with the gene coordinates. If a peptide's start and end are between or overlapping with a gene's start and end, we map this peptide to that gene and create a GFF3 line similar to the one described above.
▪ Update the original annotation files by appending the peptides' GFF3 lines obtained from our analysis to the end of the corresponding gene. So far, we have created an updated annotation in GFF3 format containing the original annotation and the proteome information. Thus, the "Type" column in the updated GFF3 files includes the type "peptide" beside the original types (3'UTR, 5'UTR, CDS,...etc). However, the identified novel peptides require further analysis to find out whether or not they represent novel genomic features.
Peptide novelty assessment and visualization of gene features
This analysis revealed 51 new genomic features in 40 genes. The majority of the novel features consisted of intronic peptides (36), while the exon boundary-spanning peptides consisted of 13 donor-spanning and 2 acceptor-spanning peptides. The remaining novel peptides were mapped to known coding regions.
Generic scheme design for the relational database
Database implementation and web interface development
As mentioned above, PGFeval exports two reports: genes report and peptides report. Both reports are designed in master-slave style. Thus, both were imported directly into the database. The protein, cDNA, and transcript information such as the IDs, aliases, descriptions, lengths and sequences were extracted from the FASTA files and the GFF3 files obtained from the MSU website and MSU genome browser, then converted to tables using perl scripts. Next, HTML files, similar to MASCOT peptide view files, were generated and imported into the OryzaPG-DB server. The data were later imported into a database implemented using the MySQL server. The annotation files (GFF3) and the visualization files (PNG) are stored in the web server directly. The whole system is thus a two-tier web-based system.
The web interface was developed using HTML, Java script and the server side scripting was done using PHP. The database was implemented using the MySQL server. We host the system on Microsoft Internet Information Service (IIS V7.5) on a Dell server running Windows 7 at the Institute for Advanced Biosciences (IAB), Keio University.
OryzaPG-DB Application Programming Interfaces (APIs)
The application programming interface (API) is an interface implemented by the application to allow interaction with the operating system or other programs. An API determines the protocol and parameters required to run certain functions or parts of the program and to return the results of its execution .
In OryzaPG-DB, we provide users with several URL APIs for data retrieval. For each entity, we provide users with an API that returns the results per record, per chromosome or for the whole genome. The data are returned in tabular view or in FASTA format with minimum formatting to allow easy processing. The complete list of the available APIs and their parameters can be found on OryzaPG-DB API Guide, available in the OryzaPG-DB website.
Utility and Discussion
The current rice genome annotation includes 56,797 genes, of which most are either putative (23,348 genes) or hypothetical (8,885 genes) or conserved hypothetical (2,003 genes) , http://rice.plantbiology.msu.edu]. Thus, the total number of genes for which experimental expression evidence is lacking represents more than 60% of the total annotated genes. Moreover, the available expression evidence (for 6,311 genes, representing about 10% of the total) is based on transcription, which does not necessarily imply translation to protein . This indicates the need for a novel approach to improve and refine the rice genome annotation.
To perform genome annotation refinement of rice by means of a proteogenomics approach, we firstly need accurate and high-throughput proteome information. Thus, we generated the rice MS/MS-based proteome using our highly accurate nanoLC-MS/MS proteomics facility (see additional file 1). We started with undifferentiated cultured rice cells to generate data for the construction of our bioinformatics pipeline and data repository system, because a relatively unbiased expression profile of the rice proteome was expected, based on the report that an Arabidopsis thaliana proteomics study using cultured cells covered over 70% of the differentiated organ proteome . We plan to generate similar datasets for all vegetative organs throughout the rice life cycle (see future work).
The generated data were compared against four databases (protein, cDNA, transcript and genome) for peptide/protein identification and the resultant peptides were filtered using Mascot score and peptide length to select peptides with high identification confidence and high specificity (p value < 0.001 and false-positive rate (FPR) < = 1%). Then, the peptides identified from the protein database together with the novel peptides identified uniquely from the other three databases were used to create list of peptides identified from non-redundant product ion spectra.
We utilized these peptides to perform proteogenomic analysis for the rice genes within our sample coverage. Our analysis revealed novel genomic features in 40 genes. In addition, 112 peptides, from the genome database-identified peptides, were mapped to intergenic regions, indicating the possible existence of non-annotated genes.
Plans for further development of OryzaPG-DB are mainly focused on the content and consequently also the interface. We plan to extend the data to include rice root, stem, leaf blade and other organs as soon as we generate those proteomes. In addition, proteogenomic analysis will be available for all genes covered by the new samples. The interface, therefore, will be updated to allow browsing the data by sample, organ, etc., and we will also add advanced search parameters, enabling auto-generation of updated FASTA sequences using experimentally based genome re-annotation.
The rapid growth of available sequenced genomes requires novel approaches to identify genes and their functions, as well as sustainable data repository systems to store the accumulated data and make it publicly available for researchers. Proteogenomics is a novel approach combining MS/MS-based proteomics with genomic information and bioinformatics to enhance genome annotation. In this report, we present OryzaPG-DB, the data repository system of the Rice Proteogenomics Project. OryzaPG-DB provides interested rice biologists with the MS/MS-based proteome and the results of proteogenomic analysis, together with all the genomic information within our coverage. The database currently contains the results for cells from undifferentiated culture, and it is planned to be updated periodically with the results of analysis of samples from all vegetative organs of rice. We believe OryzaPG-DB will be an important resource and data-serving tool for rice biologists.
Availability and Requirements
OryzaPG-DB is freely available at http://oryzapg.iab.keio.ac.jp. In the development of Oryza-PG DB, we followed the usual standards of web applications development and the Java scripts employed are cross-browser scripts. We have confirmed that OryzaPG-DB is fully functional on four web-browsers, Google Chrome, Mozilla Firefox, Microsoft Internet Explorer and Safari, in five operating systems, Windows XP, Vista and 7, Linux Ubuntu and Mac OS (10.5), with no need for any plug-ins or special system requirements.
We would like to thank Professor Naoto Shibuya (Meiji University - Japan) and Dr. Hirofumi Nakagami (RIKEN Plant Science Center - Japan) for providing the rice cells, Dr. Nozomu Yachie (Harvard Medical School - USA), Dr. Naoyuki Sugiyama and Kalesh Sasidharan (Keio University - Japan) for valuable discussions, Yasuyuki Igarashi for technical assistance and other members of our institute for their contributions. This study is funded by Yamagata Prefecture and Tsuruoka City grants to Keio University, as well as by the Egyptian Bureau of Culture, Science and Education - Tokyo and The Global-Center of Excellence (G-COE) of Keio University to M. H. and JSPS Grants-in-Aid for Scientific Research (No. 21310129) and Science and Technology Incubation Program in Advanced Regions from Japan Science and Technology Agency to Y.I.
- Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Research. 2009, 19: 521-532. 10.1101/gr.074906.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Siezen RJ, Hijum SAFTV: Genome (re-)annotation and open-source annotation pipelines. Microbial Biotechnology. Microbial Biotechnology. 2010, 3: 8-Google Scholar
- Armengaud J: A perfect genome annotation is within reach with the proteomics and genomics alliance. Current Opinion Microbiolgy. 2009, 12: 292-300. 10.1016/j.mib.2009.03.005.View ArticleGoogle Scholar
- Koonin E, Galperin M: Sequence-Evolution-Function: Computational Approaches in Comparative Genomics. Kluwer Academic Publishers, USA; 2003.View ArticleGoogle Scholar
- Reed JL, Famili I, Thiele I, Palsson BO: Towards multidimensional genome annotation. Nature Reviews Genetics. 2006, 7: 130-141. 10.1038/nrg1769.PubMedView ArticleGoogle Scholar
- Wright JC, Sugden D, Francis-McIntyre S, Riba-Garcia I, Gaskell SJ, Grigoriev IV, Baker SE, Beynon RJ, Hubbard SJ: Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger. BMC Genomics. 2009, 10: 61-10.1186/1471-2164-10-61.PubMedPubMed CentralView ArticleGoogle Scholar
- Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP: Discovery and revision of Arabidopsis genes by proteogenomics. Proceedings of the National Academy of Sciences of the United States of America. 2008, 105: 21034-21038. 10.1073/pnas.0811066106.PubMedPubMed CentralView ArticleGoogle Scholar
- Ansong C, Purvine SO, Adkins JN, Lipton MS, Smith RD: Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Briefings in Functional Genomics & Proteomics. 2008, 7: 50-62. 10.1093/bfgp/eln010.View ArticleGoogle Scholar
- Brent MR: Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Natuar Reviews Genetics. 2008, 9: 62-73. 10.1038/nrg2220.PubMedView ArticleGoogle Scholar
- Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, Stein LD: nGASP--the nematode genome annotation assessment project. BMC Bioinformatics. 2008, 9: 549-10.1186/1471-2105-9-549.PubMedPubMed CentralView ArticleGoogle Scholar
- Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biology. 2006, 7 (Suppl 1): S2 1-31. 10.1186/gb-2006-7-s1-s2.PubMedView ArticleGoogle Scholar
- Allen JE, Pertea M, Salzberg SL: Computational gene prediction using multiple sources of evidence. Genome Research. 2004, 14: 142-148. 10.1101/gr.1562804.PubMedPubMed CentralView ArticleGoogle Scholar
- Castellana N, Bafna V: Proteogenomics to discover the full coding content of genomes: A computational perspective. Journal of Proteomics. 2010, 73: 2124-2135. 10.1016/j.jprot.2010.06.007.PubMedPubMed CentralView ArticleGoogle Scholar
- Armengaud J: Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Reviews Proteomics. 2010, 7: 65-77. 10.1586/epr.09.104.PubMedView ArticleGoogle Scholar
- Jaffe JD, Berg HC, Church GM: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004, 4: 59-77. 10.1002/pmic.200300511.PubMedView ArticleGoogle Scholar
- Wang R, Prince JT, Marcotte EM: Mass spectrometry of the M. smegmatis proteome: protein expression levels correlate with function, operons, and codon bias. Genome Research. 2005, 15: 1118-1126. 10.1101/gr.3994105.PubMedPubMed CentralView ArticleGoogle Scholar
- Tanner S, Shen Z, Ng J, Florea L, Guigo R, Briggs SP, Bafna V: Improving gene annotation using peptide mass spectrometry. Genome Research. 2007, 17: 231-239. 10.1101/gr.5646507.PubMedPubMed CentralView ArticleGoogle Scholar
- Power KA, McRedmond JP, de Stefani A, Gallagher WM, Gaora PO: High-throughput proteomics detection of novel splice isoforms in human platelets. PLoS One. 2009, 4: e5001-10.1371/journal.pone.0005001.PubMedPubMed CentralView ArticleGoogle Scholar
- Ansong C, Yoon H, Norbeck AD, Gustin JK, McDermott JE, Mottaz HM, Rue J, Adkins JN, Heffron F, Smith RD: Proteomics analysis of the causative agent of typhoid fever. Journal of Proteome Research. 2008, 7: 546-557. 10.1021/pr070434u.PubMedView ArticleGoogle Scholar
- McCarthy FM, Bridges SM, Wang N, Magee GB, Williams WP, Luthe DS, Burgess SC: AgBase: a unified resource for functional analysis in agriculture. Nucleic Acids Research. 2007, 35: D599-603. 10.1093/nar/gkl936.PubMedPubMed CentralView ArticleGoogle Scholar
- Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S: Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 2008, 320: 938-941. 10.1126/science.1157956.PubMedView ArticleGoogle Scholar
- Sasaki T: Current status of and future prospects for genome analysis in rice. Springer-Verleg, Japan; 1999.Google Scholar
- Matsumoto T, Wu J, Antonio BA, Sasaki T: Development in rice genome research based on accurate genome sequence. International Journal of Plant Genomics. 2008, 2008: 348621-10.1155/2008/348621.PubMedPubMed CentralView ArticleGoogle Scholar
- IRGSP: The map-based sequence of the rice genome. Nature. 2005, 436: 793-800. 10.1038/nature03895.View ArticleGoogle Scholar
- Itoh T, Tanaka T, Barrero RA, Yamasaki C, Fujii Y, Hilton PB, Antonio BA, Aono H, Apweiler R, Bruskiewich R, et al: Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana. Genome Research. 2007, 17: 175-183. 10.1101/gr.5509507.PubMedPubMed CentralView ArticleGoogle Scholar
- Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, Zheng L, et al: The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Research. 2007, 35: D883-887. 10.1093/nar/gkl976.PubMedPubMed CentralView ArticleGoogle Scholar
- Komatsu S, Tanaka N: Rice proteome analysis: a step toward functional analysis of the rice genome. Proteomics. 2005, 5: 938-949. 10.1002/pmic.200401040.PubMedView ArticleGoogle Scholar
- Wu CC, MacCoss MJ, Howell KE, Yates JR: A method for the comprehensive proteomic analysis of membrane proteins. Nature Biotechnology. 2003, 21: 532-538. 10.1038/nbt819.PubMedView ArticleGoogle Scholar
- Cargile BJ, Bundy JL, Freeman TW, Stephenson JL: Gel based isoelectric focusing of peptides and the utility of isoelectric point in protein identification. Journal of Proteome Research. 2004, 3: 112-119. 10.1021/pr0340431.PubMedView ArticleGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20: 3551-3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.PubMedView ArticleGoogle Scholar
- Mo F, Hong X, Gao F, Du L, Wang J, Omenn GS, Lin B: A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data. BMC Bioinformatics. 2008, 9: 537-10.1186/1471-2105-9-537.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiology Letters. 1999, 174: 247-250. 10.1111/j.1574-6968.1999.tb13575.x.PubMedView ArticleGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Research. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMedPubMed CentralView ArticleGoogle Scholar
- Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, et al: The UCSC Genome Browser Database. Nucleic Acids Research. 2003, 31: 51-54. 10.1093/nar/gkg129.PubMedPubMed CentralView ArticleGoogle Scholar
- Bitton DA, Smith DL, Connolly Y, Scutt PJ, Miller CJ: An integrated mass-spectrometry pipeline identifies novel protein coding-regions in the human genome. PLoS One. 2010, 5: e8949-10.1371/journal.pone.0008949.PubMedPubMed CentralView ArticleGoogle Scholar
- Fermin D, Allen BB, Blackwell TW, Menon R, Adamski M, Xu Y, Ulintz P, Omenn GS, States DJ: Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biology. 2006, 7: R35-10.1186/gb-2006-7-4-r35.PubMedPubMed CentralView ArticleGoogle Scholar
- Tulach J: Practical API Design: Confessions of a Java Framework Architect. New York: Apress; 2008.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.