PLANEX: the plant co-expression database

Background The PLAnt co-EXpression database (PLANEX) is a new internet-based database for plant gene analysis. PLANEX (http://planex.plantbioinformatics.org) contains publicly available GeneChip data obtained from the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI). PLANEX is a genome-wide co-expression database, which allows for the functional identification of genes from a wide variety of experimental designs. It can be used for the characterization of genes for functional identification and analysis of a gene’s dependency among other genes. Gene co-expression databases have been developed for other species, but gene co-expression information for plants is currently limited. Description We constructed PLANEX as a list of co-expressed genes and functional annotations for Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays. PLANEX reports Pearson’s correlation coefficients (PCCs; r-values) that distribute from a gene of interest for a given microarray platform set corresponding to a particular organism. To support PCCs, PLANEX performs an enrichment test of Gene Ontology terms and Cohen’s Kappa value to compare functional similarity for all genes in the co-expression database. PLANEX draws a cluster network with co-expressed genes, which is estimated using the k-mean method. To construct PLANEX, a variety of datasets were interpreted by the IBM supercomputer Advanced Interactive eXecutive (AIX) in a supercomputing center. Conclusion PLANEX provides a correlation database, a cluster network and an interpretation of enrichment test results for eight plant species. A typical co-expressed gene generates lists of co-expression data that contain hundreds of genes of interest for enrichment analysis. Also, co-expressed genes can be identified and cataloged in terms of comparative genomics by using the ‘Co-expression gene compare’ feature. This type of analysis will help interpret experimental data and determine whether there is a common term among genes of interest.


Background
A combination of methodologies from the fields of genomics, proteomics and bioinformatics provides a powerful approach to investigating biological processes. Biological functions of genes are usually determined by the interaction of a protein or gene product, and gene expressions are frequently related in biological processes. Therefore, co-expressed genes might be related in a biological pathway and may provide information critical for understanding complex biological systems [1,2]. Many technical approaches have been used in genome-wide experiments, and the ability to measure the regulation of several thousand genes simultaneously has revolutionized the way biological processes are analyzed. To understand biological systems, co-expression data have been used in a wide variety of experimental designs, including gene targeting, regulatory investigations and identification of potential partners in protein-protein interactions [3].
Substantial amounts of such expression data are required to estimate co-expressed gene dependency. Unfortunately, these experiments are costly and time consuming. However, a vast number of gene expression data sets have recently become available for several plant species. The most popular public microarray databases are ArrayExpress [4], Gene Expression Omnibus (GEO) [5], NASCArrays [6] and Genevestigator [7]. Still, it is difficult for biological researchers to manage this large amount of gene expression data without a background in bioinformatics. To this end, the field of bioinformatics has accelerated co-expression analysis of biological processes. In addition, the completion of the genome sequences of the model plants Arabidopsis thaliana [8], Glycine max [9], Oryza sativa [10], Solanum lycopersicum [11], Vitis vinifera [12] and Zea mays [13] have advanced genome and gene expression analysis. For other species with poorly resolved gene expression data, such as Hordeum vulgare and Triticum aestivum, genome resources are improving with The Gene Index Project by the Dana Faber Cancer Institute (DFCI) [14]. The annotated genome sequences have stimulated the development of a number of functional genomic approaches. These materials are valuable for gene expression in genome-scale microarrays.
During the co-expression data set construction, the gene expression data were normalized with summarization methods, including RMA [15], GCRMA [16] and MAS5 [17]. One method of identifying co-expressed gene sets is through the estimation of gene expression similarity. The most convenient way to estimate gene expression similarities is to use Pearson's correlation coefficients (PCCs) [1,18]. If similarity is determined by a correlation metric (e.g. PCCs), a comprehensive pairwise matrix of correlation values are generated that represents expression similarity.
Based on co-expression data set analysis, we focused on improving the construction of gene networks. Principal components analysis (PCA) is a popular technique used to find the major component of a multivariate dataset. In DNA microarray analysis, it is used to find the gene groups that cooperatively change expressions over several experiments [19], and PCA is done in gene space. Then, the k-mean cluster algorithm is combined to reveal samples with large contributions.
Plant co-expression databases have previously been constructed for Arabidopsis thaliana, Oryza sativa and Hordeum vulgare. These databases, the Arabidopsis Coexpression Toolkit (ACT) [20], STARNET 2 [21], RiceArrayNet [22], ATTED-II [23], Co-expressed biological Processes (CoP) database [24] and PlaNet [25], are used for searching co-expression relationships and incorporating functional data. Given the recent rapid growth of high performance computers with the ability to perform rapid calculations, co-expression database construction is possible using large-scale gene expression data.
In this report, we describe the construction and use of the PLAnt co-EXpression database (PLANEX; Additional file 1: Table S1) and discuss the output produced by user query. PLANEX mines already-computed gene pair correlations across eight species of plants. With PLANEX, we provide Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays co-expression data sets with a user-friendly web interface for retrieving coexpressed gene lists and functional enrichment data of interest. A central motivation for constructing PLANEX was to leverage massive resources of microarray data for biological interactions, expression diversity and the discovery of putative gene regulatory relationships prior to conducting additional costly wet lab experiments. This database provides details that may aid in understanding expression similarity and functional enrichment of input genes.

Expression data
Raw microarray data were obtained from the GEO of the National Center for Biotechnology Information (NCBI) through April 2011. We selected data from Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays Affymetrix GeneChip Genome Array, which is one of the most frequently-used and publicly-deposited platforms for plants ( Table 1).
All of the raw data (in CEL file format) were downloaded through programmatic access to GEO (http://www.ncbi. nlm.nih.gov/geo/info/geo_paccess.html). We terminated GEO Series (GSEs) that included truncated GEO Sample (GSM). The cross platform GSMs were also terminated, including GSE13641 (Rorippa amphibia expression profile on Arabidopsis thaliana Affymetrix GeneChip platform; GPL198). We also collected raw data, with the exclusion of subspecies expression data, including Glycine soja on the Glycine max platform (GPL4592; e.g. GSE20323) and Arabidopsis lyrata subsp. petraea and Arabidopsis halleri on the Arabidopsis thaliana Affymetrix GeneChip platform (GPL198; e.g. GSE5738). The CEL files were used for summarizing probe sets, which were the results of the intensity calculations on the chip pixel value. All expression levels were analyzed using background subtraction, normalization and summarizing probe sets. We estimated quantile normalization using an RMA algorithm for detecting the background information. All microarrays were computed probe sets that summarized each of the eight species using Affymetrix Power Tools [26].

Implementation
The gene co-expression data were entered in the PLANEX system by pre-implementation. The data were implemented with expression probe set summarizing data. We provided PCCs to assess the extent of gene coexpression, and we developed novel C++ codes to generate co-expression data. The pairwise co-expression calculations did not require heavy CPU power, but numerous CPUs helped reduce calculation time. We used the GAIA system at the Supercomputing Center of the Korea Institute of Science and Technology Information, [27] which contained 1536 CPU cores. The GAIA system is based on Advanced Interactive eXecutive (AIX) by IBM, which supports Message Passing Interface (MPI) [28]. Our unique C++ code supported MPI and co-expression data were estimated by 512 CPU cores. To retrieve co-expression data, we set thresholds for co-expression values. To specify positive (top 1% of PCCs) and negative (bottom 1% PCCs) values for coexpressed gene sets, the distribution of random gene pairs was assessed by PCCs ( Figure 1). The number of random gene pairs corresponded to the number of probes on the array ( Table 2).

Clustering
For clustering, the gene expression values were used for analysis. We applied the k-mean clustering method to the expression data, which assigned each point to the cluster whose center was nearest [29]. We used the PCA to determine the number of cluster k. The PCA was conducted using CLUSTER, so that the clusters were ordered and chosen to maximally explain the remaining variance in data vectors [1]. Consequently, the k-mean clusters were analyzed with the number of clusters in each species. The large amount of expression data required long-term clustering time. Therefore, we compiled the Parallel K-mean Data Clustering code [30], which was executed on the AIX supercomputer system with MPI. The k-mean algorithm provided nodes of the co-expression network in PLANEX.

Mapping gene identifiers onto probe set IDs
The genome sequence and annotation project Phytozome was recently completed and released [31]. We clarified annotations and sequences of the species by downloading all Affymetrix GeneChip probe sequences [26], and we mapped them against the probe to the nucleotide of the genome of six sequenced plants: Arabidopsis thaliana, Glycine max, Oryza sativa, Vitis vinifera,Solanum lycopersicum and Zea mays (Phytozome V9.0). In contrast, other species whose genome sequences are still unfinished, such as Hordeum vulgare and Triticum aestivum, were mapped with Tentative Consensus sequences from DFCI. The probe matches were made using our unique Perl script. The script processed string-matched nucleotide sequences (including reverse complement) against an individual GeneChip probe of any given species and returned a list of probe set affinities that corresponded to the sequence of each species. Specifically, Zea mays had 15 sequence pairs per probe, and all other plant species had 11 pairs per probe.

Gene ontology term assignment
Due to the hierarchical tree of the gene ontology (GO) terms and redundancy of the terms, we mapped GO terms against representative gene function. The DFCI provided GO mapping annotation. Phytozome sequence annotation did not support GO mapping annotation, but it did provide Pfam IDs; we mapped the representative Pfam IDs against GO terms. We mapped the external classification system to GO [32]. GO-TermFinder was used to estimate the enrichment of GO terms [33]. GO-TermFinder was integrated into PLANEX using a web interface, which evaluated the enrichment of the principle GO categories, including cellular components, biological processes, and molecular functions with hypergeometric distribution and a False Discovery Rate (FDR) described by Benjamini and Hochberg.

Comparative analysis of co-expressed gene sets
Cohen's Kappa statistics were used to compare coexpression data between species [34]. An in-house module similar to the online DAVID tool [35,36] was used to evaluate co-expression similarity using Kappa statistics, which were integrated using a web interface. A protein sequence was used to select two genes from among the species Arabidopsis thaliana, Glycine max, Oryza sativa, Vitis vinifera and Zea mays. After two query genes were submitted, the module compared the co-expression data set of each query gene, which were converted to the Pfam ID [37]. The Kappa measured the percentage of data values in the main diagonal of the table and then adjusted those values for the amount of agreements that could be expected due to chance alone.

System development
The web application of PLANEX was developed with Dancer (Perl web application framework) [38] for the server side and JQuery (Javascript framework) [39] for the client side. The co-expression database was combined with MongoDB (document-oriented database) [40] and TokyoCabinet (management of database) [41]. MongoDB stored co-expression data as a document file, making the integration of data in pairwise co-expression applications easier and faster. TokyoCabinet stored gene ID data by a single key and used hashing techniques to enable fast retrieval of co-expression data of the query gene. This combination markedly improved the processing and accessing speeds of searches. We used the Cytoscape Web [42] to display the network on internet browsers. The Cytoscape Web does not require the installation of a plugin and works fast for all kinds of browsers. PLANEX operates on a Ubuntu 10.04 [43] sever equipped with a 2.66GHz dual CPU and 8GB RAM.

Utility and discussion
Web interface PLANEX can be accessed through a user-friendly web interface (http://planex.plantbioinformatics.org/, see Avaliability an requirements section) that provides three search menus: 'Co-expression search' , 'Cluster network' , and 'Co-expression gene compare' (Figure 2). The 'Co-expression search' can be used for co-expressed gene sets and PCC values. To search the database, an Affymetrix GeneChip ID or a representative gene ID is used to 'Search by IDs' or a paste sequence is used to 'Search with BLAST' [44]; two or more representative gene IDs are used to 'Retrieve PCC with gene list' ( Figure 3A). As shown in Figure 3A, PLANEX depends on the selection of options such as species, target, cut-off, BLAST program and evalue. The distributions of random genes were determined to be cut-off values in each species. After a query is submitted to 'Search by IDs' or 'Search with BLAST' , the probe match results page is shown. The probe match page indicates the number of probes matching the query over the total number of probes, as well as their affinity, shown as 'match' (Figure 3B). This probe match page will help discard redundant probes to genes. PLANEX finds many co-expressed genes within the cut-off values ( Figure 3C). The duplicated Affymetrix IDs are indicated in the 'Duplicated' section of the results page ( Figure 3D). The co-expressed gene set can be downloaded in CSV format for analysis by GO-TermFinder. GO-TermFinder provides three GO term enrichment analyses with a hypergeometric p-value < 0.05 at FDR ≤ 10-6 ( Figure 3E). After submitting a query to 'Retrieve PCCs with gene list' , the gene list will show the correlation in pairwise format ( Figure 3F). PLANEX does not provide a probe match page, but, instead, it provides all potentially matching probe sets for a gene list, which indicate PCCs and affinity. The data are supported by GO-TermFinder, which is similar to the other searches.
PLANEX allows co-expression network data to be displayed in a browser. The 'Cluster network' is based on kmean cluster analysis and PCCs, which support 'Search by IDs' and 'Search with BLAST' functions ( Figure 4A). The network consists of the results of the k-mean cluster analysis, indicated as node, size of node, represented number of the edge, and the edge indicated by PCCs ( Figure 4B).
The Kappa statistics analysis tools in PLANEX can be used to compare co-expressed genes with other species, Figure 2 The homepage of PLANEX.
using the 'Co-expression gene compare' feature ( Figure 5). It accepts only Arabidopsis thaliana, Glycine max, Oryza sativa, Vitis vinifera and Zea mays as protein annotated plant gene IDs. Any two species can be compared with their representative gene ID from Phytozome. The simple Kappa statistics coefficients show the agreement between two co-expressed gene sets, which is measured on a binary scale. This analysis is useful in comparative genomics to determine the similarity of co-expressed gene sets or the functional similarity of family genes. This approach provides a comparative analysis with commonly reported measurements in the medical literature.

Discussion
PLANEX is a novel database that helps researchers study complex biological processes by co-expressed gene sets overlayed onto a k-mean cluster. ATTED-II, STARNET 2, RiceArrayNet and CoP provide co-expression relationships, but they contain only one to three sets of coexpression data. Therefore, an advantage of PLANEX is that it combines sets of co-expression data from eight different species. Additionally, it clusters and compares members of co-expressed genes. As far as we know, PLANEX is the only system that combines cluster and PCCs data.
Another advantage of PLANEX is that probes were mapped against representative genes by string match instead of BLAST. Our probe match script produced positive results if each base in a probe sequence matched perfectly with the representative gene sequence without any gap.
One potential application in PLANEX is GO-TermFinder. We generated a Saccharomyces Genome Database (SGD) file format for each species. Model species like Arabidopsis thaliana and Oryza sativa have a large set of functionally annotated genes with GO terms supported by various experimentally-derived evidence codes. In contrast, other organisms only have annotations inferred through electronic annotation (e.g., Vitis vinifera and Zea mays) or completely lack functional annotation. Since we initially lacked functional GO data, we converted Pfam to GO IDs and built an SGD file for  functional enrichment analysis. However, this mapping should be used only as a guide.
Our previous report of Oryza sativa genome duplication [45] evidenced the positive (top 1% of PCCs) value as 0.545, but we used 0.646 as the positive PCCs threshold in Oryza sativa for this report. We established this different criterion because we included more than the given number of microarrays, since we believed that more microarrays generated more significance for the expression study. Also, Aoki et al. [46] specified a minimum PCCs value (0.55-0.66) for co-expressed gene retrieval to minimize false gene function relationships. We provided a particular threshold to retrieve co-expressed genes for each species that showed normal distribution (Figure 1).
The 'Co-expression gene compare' tab on the PLANEX menu provides data for comparative genomics. The Arabidopsis genome is believed to contain similar gene numbers to the rice genome, and both have undergone a whole genome duplication event [47,48]. The use of Kappa statistics coefficients is expected to be in accordance with the degree of expression divergence of the data. Previously, we reported that the rice gene families evidenced a similar high degree of expression diversity between members using rice public microarrays [45]. The comparison of co-expressed genes may support the understanding of specialization in the direction of complex biological processes between members of a gene family over evolutionary time [49].

Conclusions
The small, but important, function of comparing coexpressed genes may provide clues to the molecular functional conservation or diversity between orthologus genes, particularly Poaceae family genes. PLANEX can be used to interpret results of co-expressed genes and, also, to perform delicate analyses in comparative genomics. PLANEX complements existing databases and tools such as ATTED-II, CoP and STARNET 2.

Availability and requirements
Project name: PLANEX Operating system(s): Platform Independent (tested on Windows, i386 Linux and Mac)