PLANEX: the plant co-expression database
© Yim et al.; licensee BioMed Central Ltd. 2013
Received: 21 January 2013
Accepted: 16 May 2013
Published: 20 May 2013
Skip to main content
© Yim et al.; licensee BioMed Central Ltd. 2013
Received: 21 January 2013
Accepted: 16 May 2013
Published: 20 May 2013
The PLAnt co-EXpression database (PLANEX) is a new internet-based database for plant gene analysis. PLANEX (http://planex.plantbioinformatics.org) contains publicly available GeneChip data obtained from the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI). PLANEX is a genome-wide co-expression database, which allows for the functional identification of genes from a wide variety of experimental designs. It can be used for the characterization of genes for functional identification and analysis of a gene’s dependency among other genes. Gene co-expression databases have been developed for other species, but gene co-expression information for plants is currently limited.
We constructed PLANEX as a list of co-expressed genes and functional annotations for Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays. PLANEX reports Pearson’s correlation coefficients (PCCs; r-values) that distribute from a gene of interest for a given microarray platform set corresponding to a particular organism. To support PCCs, PLANEX performs an enrichment test of Gene Ontology terms and Cohen’s Kappa value to compare functional similarity for all genes in the co-expression database. PLANEX draws a cluster network with co-expressed genes, which is estimated using the k-mean method. To construct PLANEX, a variety of datasets were interpreted by the IBM supercomputer Advanced Interactive eXecutive (AIX) in a supercomputing center.
PLANEX provides a correlation database, a cluster network and an interpretation of enrichment test results for eight plant species. A typical co-expressed gene generates lists of co-expression data that contain hundreds of genes of interest for enrichment analysis. Also, co-expressed genes can be identified and cataloged in terms of comparative genomics by using the ‘Co-expression gene compare’ feature. This type of analysis will help interpret experimental data and determine whether there is a common term among genes of interest.
A combination of methodologies from the fields of genomics, proteomics and bioinformatics provides a powerful approach to investigating biological processes. Biological functions of genes are usually determined by the interaction of a protein or gene product, and gene expressions are frequently related in biological processes. Therefore, co-expressed genes might be related in a biological pathway and may provide information critical for understanding complex biological systems [1, 2]. Many technical approaches have been used in genome-wide experiments, and the ability to measure the regulation of several thousand genes simultaneously has revolutionized the way biological processes are analyzed. To understand biological systems, co-expression data have been used in a wide variety of experimental designs, including gene targeting, regulatory investigations and identification of potential partners in protein-protein interactions .
Substantial amounts of such expression data are required to estimate co-expressed gene dependency. Unfortunately, these experiments are costly and time consuming. However, a vast number of gene expression data sets have recently become available for several plant species. The most popular public microarray databases are ArrayExpress , Gene Expression Omnibus (GEO) , NASCArrays  and Genevestigator . Still, it is difficult for biological researchers to manage this large amount of gene expression data without a background in bioinformatics. To this end, the field of bioinformatics has accelerated co-expression analysis of biological processes. In addition, the completion of the genome sequences of the model plants Arabidopsis thaliana, Glycine max, Oryza sativa, Solanum lycopersicum, Vitis vinifera and Zea mays have advanced genome and gene expression analysis. For other species with poorly resolved gene expression data, such as Hordeum vulgare and Triticum aestivum, genome resources are improving with The Gene Index Project by the Dana Faber Cancer Institute (DFCI) . The annotated genome sequences have stimulated the development of a number of functional genomic approaches. These materials are valuable for gene expression in genome-scale microarrays.
During the co-expression data set construction, the gene expression data were normalized with summarization methods, including RMA , GCRMA  and MAS5 . One method of identifying co-expressed gene sets is through the estimation of gene expression similarity. The most convenient way to estimate gene expression similarities is to use Pearson's correlation coefficients (PCCs) [1, 18]. If similarity is determined by a correlation metric (e.g. PCCs), a comprehensive pairwise matrix of correlation values are generated that represents expression similarity.
Based on co-expression data set analysis, we focused on improving the construction of gene networks. Principal components analysis (PCA) is a popular technique used to find the major component of a multivariate dataset. In DNA microarray analysis, it is used to find the gene groups that cooperatively change expressions over several experiments , and PCA is done in gene space. Then, the k-mean cluster algorithm is combined to reveal samples with large contributions.
Plant co-expression databases have previously been constructed for Arabidopsis thaliana, Oryza sativa and Hordeum vulgare. These databases, the Arabidopsis Co-expression Toolkit (ACT) , STARNET 2 , RiceArrayNet , ATTED-II , Co-expressed biological Processes (CoP) database  and PlaNet , are used for searching co-expression relationships and incorporating functional data. Given the recent rapid growth of high performance computers with the ability to perform rapid calculations, co-expression database construction is possible using large-scale gene expression data.
In this report, we describe the construction and use of the PLAnt co-EXpression database (PLANEX; Additional file 1: Table S1) and discuss the output produced by user query. PLANEX mines already-computed gene pair correlations across eight species of plants. With PLANEX, we provide Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays co-expression data sets with a user-friendly web interface for retrieving co-expressed gene lists and functional enrichment data of interest. A central motivation for constructing PLANEX was to leverage massive resources of microarray data for biological interactions, expression diversity and the discovery of putative gene regulatory relationships prior to conducting additional costly wet lab experiments. This database provides details that may aid in understanding expression similarity and functional enrichment of input genes.
Co-expression data information contained in PLANEX
Number of microarray slides
Source database of coding sequence
All of the raw data (in CEL file format) were downloaded through programmatic access to GEO ( http://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html). We terminated GEO Series (GSEs) that included truncated GEO Sample (GSM). The cross platform GSMs were also terminated, including GSE13641 (Rorippa amphibia expression profile on Arabidopsis thaliana Affymetrix GeneChip platform; GPL198). We also collected raw data, with the exclusion of subspecies expression data, including Glycine soja on the Glycine max platform (GPL4592; e.g. GSE20323) and Arabidopsis lyrata subsp. petraea and Arabidopsis halleri on the Arabidopsis thaliana Affymetrix GeneChip platform (GPL198; e.g. GSE5738).
The CEL files were used for summarizing probe sets, which were the results of the intensity calculations on the chip pixel value. All expression levels were analyzed using background subtraction, normalization and summarizing probe sets. We estimated quantile normalization using an RMA algorithm for detecting the background information. All microarrays were computed probe sets that summarized each of the eight species using Affymetrix Power Tools .
The thresholds for co-expression values
No. of probes
For clustering, the gene expression values were used for analysis. We applied the k-mean clustering method to the expression data, which assigned each point to the cluster whose center was nearest . We used the PCA to determine the number of cluster k. The PCA was conducted using CLUSTER, so that the clusters were ordered and chosen to maximally explain the remaining variance in data vectors . Consequently, the k-mean clusters were analyzed with the number of clusters in each species. The large amount of expression data required long-term clustering time. Therefore, we compiled the Parallel K-mean Data Clustering code , which was executed on the AIX supercomputer system with MPI. The k-mean algorithm provided nodes of the co-expression network in PLANEX.
The genome sequence and annotation project Phytozome was recently completed and released . We clarified annotations and sequences of the species by downloading all Affymetrix GeneChip probe sequences , and we mapped them against the probe to the nucleotide of the genome of six sequenced plants: Arabidopsis thaliana, Glycine max, Oryza sativa, Vitis vinifera,Solanum lycopersicum and Zea mays (Phytozome V9.0). In contrast, other species whose genome sequences are still unfinished, such as Hordeum vulgare and Triticum aestivum, were mapped with Tentative Consensus sequences from DFCI. The probe matches were made using our unique Perl script. The script processed string-matched nucleotide sequences (including reverse complement) against an individual GeneChip probe of any given species and returned a list of probe set affinities that corresponded to the sequence of each species. Specifically, Zea mays had 15 sequence pairs per probe, and all other plant species had 11 pairs per probe.
Due to the hierarchical tree of the gene ontology (GO) terms and redundancy of the terms, we mapped GO terms against representative gene function. The DFCI provided GO mapping annotation. Phytozome sequence annotation did not support GO mapping annotation, but it did provide Pfam IDs; we mapped the representative Pfam IDs against GO terms. We mapped the external classification system to GO . GO-TermFinder was used to estimate the enrichment of GO terms . GO-TermFinder was integrated into PLANEX using a web interface, which evaluated the enrichment of the principle GO categories, including cellular components, biological processes, and molecular functions with hypergeometric distribution and a False Discovery Rate (FDR) described by Benjamini and Hochberg.
Cohen’s Kappa statistics were used to compare co-expression data between species . An in-house module similar to the online DAVID tool [35, 36] was used to evaluate co-expression similarity using Kappa statistics, which were integrated using a web interface. A protein sequence was used to select two genes from among the species Arabidopsis thaliana, Glycine max, Oryza sativa, Vitis vinifera and Zea mays. After two query genes were submitted, the module compared the co-expression data set of each query gene, which were converted to the Pfam ID . The Kappa measured the percentage of data values in the main diagonal of the table and then adjusted those values for the amount of agreements that could be expected due to chance alone.
After a query is submitted to ‘Search by IDs’ or ‘Search with BLAST’, the probe match results page is shown. The probe match page indicates the number of probes matching the query over the total number of probes, as well as their affinity, shown as ‘match’ (Figure 3B). This probe match page will help discard redundant probes to genes. PLANEX finds many co-expressed genes within the cut-off values (Figure 3C). The duplicated Affymetrix IDs are indicated in the ‘Duplicated’ section of the results page (Figure 3D). The co-expressed gene set can be downloaded in CSV format for analysis by GO-TermFinder. GO-TermFinder provides three GO term enrichment analyses with a hypergeometric p-value < 0.05 at FDR ≤ 10–6 (Figure 3E). After submitting a query to ‘Retrieve PCCs with gene list’, the gene list will show the correlation in pairwise format (Figure 3F). PLANEX does not provide a probe match page, but, instead, it provides all potentially matching probe sets for a gene list, which indicate PCCs and affinity. The data are supported by GO-TermFinder, which is similar to the other searches.
PLANEX is a novel database that helps researchers study complex biological processes by co-expressed gene sets overlayed onto a k-mean cluster. ATTED-II, STARNET 2, RiceArrayNet and CoP provide co-expression relationships, but they contain only one to three sets of co-expression data. Therefore, an advantage of PLANEX is that it combines sets of co-expression data from eight different species. Additionally, it clusters and compares members of co-expressed genes. As far as we know, PLANEX is the only system that combines cluster and PCCs data.
Another advantage of PLANEX is that probes were mapped against representative genes by string match instead of BLAST. Our probe match script produced positive results if each base in a probe sequence matched perfectly with the representative gene sequence without any gap.
One potential application in PLANEX is GO-TermFinder. We generated a Saccharomyces Genome Database (SGD) file format for each species. Model species like Arabidopsis thaliana and Oryza sativa have a large set of functionally annotated genes with GO terms supported by various experimentally-derived evidence codes. In contrast, other organisms only have annotations inferred through electronic annotation (e.g., Vitis vinifera and Zea mays) or completely lack functional annotation. Since we initially lacked functional GO data, we converted Pfam to GO IDs and built an SGD file for functional enrichment analysis. However, this mapping should be used only as a guide.
Our previous report of Oryza sativa genome duplication  evidenced the positive (top 1% of PCCs) value as 0.545, but we used 0.646 as the positive PCCs threshold in Oryza sativa for this report. We established this different criterion because we included more than the given number of microarrays, since we believed that more microarrays generated more significance for the expression study. Also, Aoki et al.  specified a minimum PCCs value (0.55-0.66) for co-expressed gene retrieval to minimize false gene function relationships. We provided a particular threshold to retrieve co-expressed genes for each species that showed normal distribution (Figure 1).
The ‘Co-expression gene compare’ tab on the PLANEX menu provides data for comparative genomics. The Arabidopsis genome is believed to contain similar gene numbers to the rice genome, and both have undergone a whole genome duplication event [47, 48]. The use of Kappa statistics coefficients is expected to be in accordance with the degree of expression divergence of the data. Previously, we reported that the rice gene families evidenced a similar high degree of expression diversity between members using rice public microarrays . The comparison of co-expressed genes may support the understanding of specialization in the direction of complex biological processes between members of a gene family over evolutionary time .
The small, but important, function of comparing co-expressed genes may provide clues to the molecular functional conservation or diversity between orthologus genes, particularly Poaceae family genes. PLANEX can be used to interpret results of co-expressed genes and, also, to perform delicate analyses in comparative genomics. PLANEX complements existing databases and tools such as ATTED-II, CoP and STARNET 2.
Project name: PLANEX
Operating system(s): Platform Independent (tested on Windows, i386 Linux and Mac)
Programming Languages: Perl
Other requirements: Web browser (tested on Chrome, Safari and Explorer)
License: Creative Commons Attribution License
The serve is freely available at http://planex.plantbioinformatics.org
Arabidopsis co-expression toolkit
Advanced interactive eXecutive
Dana faber cancer institute
Gene expression omnibus
Kyoto encyclopedia of genes and genomes
Principal components analysis
Pearson’s correlation coefficients
The PLAnt co-EXpression database
Saccharomyces genome database
The arabidopsis information resource.
The authors would like to thank Kunho Kim for contributing to the PLANEX project and, thereby, making bioinformatics investigations possible. We thank Silex and Jongjin Lee for building the web interface. We also would like to thanks Hojung Yun for his contribution to various anonymous referees for improvements in perl and this manuscript.
This research was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Education, Science and Technology (NRF-2011-0011643).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.