Data collection and processing
The normalized transcriptomic datasets for Arabidopsis thaliana were downloaded from AraGenNet , which contains genome-scale gene-expression data collected under 351 non-redundant conditions. The original datasets are Affymetrix ATH1 Arabidopsis microarray datasets (22,810 probe sets × 1,428 ATH1 microarrays) in TAIR (http://www.Arabidopsis.org). The probe sets in this dataset represent 21,031 Arabidopsis genes among which (a) 1,558 are annotated transcription factors by the DATF database (Database of Arabidopsis Transcription Factors)  and (b) 810 matched biunique known PCW genes according to the Purdue database  except for four GT family 43 genes. The genome sequences of Arabidopsis (version 9), Populus (version 2.0) and Rice (version 6.1) and associated annotations, including protein-encoding sequences and intergenic regions, were obtained from TAIR, Phytozome (http://www.phytozome.net/poplar) and RGAP (rice.plantbiology.msu.edu), respectively. The basic data processing was done using in-house Perl scripts; and statistical analyses were performed using the R package (http://www.r-project.org).
Bi-clustering analysis of gene expression data
To identify genes that are co-expressed with known PCW genes, we adopted a two-step bi-clustering approach to analyze the aforementioned microarray dataset, which is represented as a 21,031 × 351 matrix, a required format by the QUBIC program . The key algorithmic idea of the QUBIC program is based on the graph representation of a microarray dataset, converting the bi-clustering problem into a graph problem .
A seed-containing matrix (810 × 351) was extracted from this matrix, where 810 is the number of the known PCW genes, called seeds, and 351 is the number of experimental conditions. In first step, we run QUBIC on the seed-containing matrix to identify co-expression bi-clusters among the seed genes. In the second step, we run QUBIC on the large matrix (21,031 × 351) to grow the identified bi-clusters on the seed matrix, i.e. to recruit additional genes that are co-expressed with the seed bi-clusters under the same conditions.
Most microarray analysis programs take discretized data matrix to reduce the computation complexity. We have also discretized all the expression values into three levels, -1, 0, 1, representing down-, no- and up-regulation, respectively. QUBIC provides the flexibility in discretizing expression levels ranging from –K to + K, for any fixed positive integer K . We found that K = 1 works well for our study. QUBIC uses a parameter c within [0, 1] as a threshold for controlling the consistency level of the expression patterns among the co-expressed genes within a bi-cluster. To find an appropriate c value, we performed a simulation study, which suggests that the c value between 0.7 and 0.98 should give the best performance result for our bi-clustering analysis; hence we have carried out a grid-based search for an optimal c value within this range using 0.05 as the increment. Specifically we have searched for a two-value (c
) combination that gives the best AUC (area under curve) value for the receiver operating characteristic (ROC) curve analysis [54, 55] (See Additional file 1: Table S13, S14, and support information for details).
Construction of co-expression networks and modules
Genes in a bi-cluster are co-expressed under a sub-set of the 351 experimental conditions. To assess the similarity level of a detected co-expression bi-cluster, we have examined the correlation between the expression patterns of each pair of genes in the same bi-cluster. Specifically, for each bi-cluster we calculated the Spearman’s correlation coefficient rho between the expression patterns of each pair of genes under the conditions associated with the bi-cluster. Note here we used the actual expression values instead of the discretized data (i.e. -1, 0 and 1). Gene pairs with rho > 0.7 (positive co-expression) or < −0.7 (negative co-expression) were considered as significantly co-expressed. This cutoff has been used by numerous published papers [11, 56, 57]. A bi-cluster was removed from further consideration if none of its gene pairs satisfy this cutoff.
For each bi-cluster passing this test, we constructed a co-expression network using Cytoscape  as follows: each node in the network represents a unique gene and each edge represents two genes with similar gene-expression patterns above the rho threshold under the conditions of the current bi-cluster. It should be noted that not all genes are equally co-expressed within a network; and each network generally consists of multiple clusters of highly co-expressed genes while inter-cluster co-expression relationships tend to be substantially weaker, hence having sparse edges. To identify all clusters of highly co-expressed genes within a network, we have applied a popular graph-based clustering algorithm "Molecular complex detection" (MCODE) , a plug-in of Cytoscape, to identify all (non-overlapping) clusters of highly co-expressed genes, each called a co-expression module. Specifically, each module is a connected sub-network with a substantially higher density of edges within the sub-network compared to the density between the sub-network and the rest of the network. The default scoring parameters in MCODE have been optimized to fit the average network well and hence we used them (see the manual of MCODE for details). Note that not all genes in a network are assigned to a co-expression module. It is the specified density level that determines which genes are selected or not. Actually we used this strategy to get rid of accidental predictions of co-expressed genes. When setting the density threshold, we intentionally set it high enough to rule out as many such accidental predictions as possible, which could also exclude some real co-expressed genes.
The final set of co-expression modules are derived from all the networks representing the bi-clusters identified above. Since some of the bi-clusters may have overlaps, i.e., some genes may be co-expressed with different sets of genes under different conditions. Hence the final set of co-expression modules may have overlaps. Such information allows us to infer the cellular-level functional relationship among co-expression modules containing overlapping genes.
Prediction of conserved motifs
To determine if co-expressed genes in the same module are transcriptionally co-regulated, we have examined if they share conserved cis regulatory elements in their promoters. To this end, we have implemented a new pipeline, co-expression gene motif discovery (CGMD), to identify conserved sequence motifs in the promoter sequences of the relevant genes through integration of the prediction results by multiple algorithms, detailed as follows.
To acquire the promoter sequence of each gene in a co-expression module, we extracted an upstream region of 2,000 bps from the translation start site; we did not use the transcription start for this purpose since the current prediction of transcription start sites tends to be not very accurate. In addition, we used a 2,000 bps sequence as the core promoter because the length of a plant promoter is typically about 1,000 bps, plus the length of a 5’ un-translated region in Arabidopsis could be as long as 1,000 bps as our data showed (Additional file 2: Figure S3a).
For motif prediction, we used the following three prediction programs: WeederTFBS 1.4.2 , MotifSampler 3a [61, 62] and PhyloCon 3.2 . These programs were selected because of their recognized strong performance as well as the complementary nature among the programs . WeederTFBS allows the motif length to be 6, 8, 10, or 12 bps long, and it outputs the 15 highest scoring motifs for each run; to-be-identified motifs were assumed to appear in all the underlying sequences; and each motif was allowed to appear more than once in a sequence. MotifSampler uses a prior probability in finding a motif, and sets the default length of the predicted motif at 8 bps. PhyloCon requires phylogenetic information for its motif prediction (the other two do not) so we need to provide orthologs of each concerned Arabidopsis gene in Populus and Rice, which we did using the bi-directional best hit approach  and predicted each motif that is conserved across the three orthologous sequences. For promoter sequences in the other two genomes, we extracted an upstream sequence of 2,000 bps for each Populus gene and an upstream sequence of 4,000 bps for each Rice gene from the translation start site of the gene. The reason is that for the Rice genome, a 5’ un-translated region could be as long as 3,000 bps while for Populus, its 5’ un-translated region is no more than 1,000 bps (Additional file 2: Figure S3b-c).
We have used CompariMotif  to integrate all the predicted motifs by the three programs, particularly highly similar predictions among the co-expression modules. Specifically, a similarity score for each pair of predicted motifs was calculated as the number of matched positions divided by the length of its maximum align-able positions between the two motifs. Based on this score, we then used MCL v10-201  to cluster all the predicted motifs into groups, each of which has a similarity score above a predetermined threshold (the granularity parameter of MCL set at 4). We then aligned the motifs within each group (or cluster) using MAFFT v6.603b , and calculated a consensus sequence from the gapless multiple-sequence alignment of the motifs using the cons program of EMBOSS v6.2.0  and used such consensus sequence as the representative of each motif group.
To annotate the function of such motifs, we have compared the resulting motifs from the above analysis with the known motifs in the two plant motif databases: AGRIS  and PLACE  by using CompariMotif. For motifs in the two databases, we also performed an integration of the best representative from each cluster as done above. For each pair of compared motifs, if their similarity score is > 4 and the percentage of their matched positions >80%, they were considered as essentially the same motif.
To assess the statistical significance of a predicted consensus motif, we have compared the numbers of the known motifs in AGRIS and PLACE matched by the predicted motifs using two different methods, which are separately based on co-expression genes and groups of arbitrarily selected genes from the whole genome of Arabidopsis. Specifically, we created 1,000 arbitrary gene groups with the same size as the average size of all the co-expression modules under consideration. For each such gene group, we predicted motifs using the above procedure (WeederTFBS only). To be consistent, we did motif prediction for the co-expressed genes using WeederTFBS only for this comparison purpose. Our null hypothesis is that the proportion of the known motifs matching the predicted motifs among the co-expressed genes is the same for that of the arbitrarily selected genes. A Chi-square test was employed to test this hypothesis . Based on the Chi-square test p-value on the given datasets, the hypothesis can be rejected or accepted.