Development of a novel data mining tool to find cis-elements in rice gene promoter regions

Background Information on more than 35 000 full-length Oryza sativa cDNAs, together with associated microarray gene expression data collected under various treatment conditions, has made it feasible to identify motifs that are conserved in gene promoters and may act as cis-regulatory elements with key roles under the various conditions. Results We have developed a novel tool that searches for cis-element candidates in the upstream, downstream, or coding regions of differentially regulated genes. The tool first lists cis-element candidates by motif searching based on the supposition that if there are cis-elements playing important roles in the regulation of a given set of genes, they will be statistically overrepresented and will be conserved. Then it evaluates the likelihood scores of the listed candidate motifs by association rule analysis. This strategy depends on the idea that motifs overrepresented in the promoter region could play specific roles in the regulation of expression of these genes. The tool is designed so that any biological researchers can use it easily at the publicly accessible Internet site . We evaluated the accuracy and utility of the tool by using a dataset of auxin-inducible genes that have well-studied cis-elements. The test showed the effectiveness of the tool in identifying significant relationships between cis-element candidates and related sets of genes. Conclusion The tool lists possible cis-element motifs corresponding to genes of interest, and it will contribute to the deeper understanding of gene regulatory mechanisms in plants.


Background
With the completion of rice genome sequencing by the International Rice Genome Sequencing Project [1], the Beijing Genomics Institute (BGI) [2], and Syngenta [3], many rice functional genomic resources have become available, including whole genome sequences from ssp. japonica 'Nipponbare' and ssp. indica line 93-11; a set of rice full-length cDNA clones and their complete and par-tial end sequences [4,5], microarray gene expression systems based on full-length cDNA sequences, ESTs (Expressed Sequence Tag), MPSS (Massively Parallel Signature Sequencing), SAGE (Serial Analysis of Gene Expression), and predicted genes in the genome sequences; and many kinds of insertion mutants with Tos17, Ac-Ds, and T-DNAs [6]. As analytical technology progresses, the database continues to be upgraded and serves as a useful resource for studying mechanisms that regulate gene expression.
Cis-elements in the promoter regions of genes and transacting transcription factors are major biological features to be characterized if we are to achieve an understanding of the systems that regulate gene expression. Identification of candidate cis-elements corresponding to genes is now practicable through the use of available sequence and genome mapping information, combined with information about the responses of genes to specific experimental conditions; such responses have been elucidated by using gene expression profiles now publicly available.
Exhaustive sequence analysis by using available public databases can identify cis-element candidate motifs for further examination, but such approaches are not quite efficient. One confounding factor is that public databases are independently constructed and not generally optimized to facilitate integration of information from many sources with local experimental data. A more perplexing issue for experimental researchers who are not very familiar with bioinformatics techniques is the challenge of finding unknown but biologically notable relationships among genes, cis-elements, and experimental conditions from the huge number of possible combinations generated by large experimental data sets.
To resolve some of these issues, we developed a novel data mining tool to identify cis-elements in the rice genome. It performs the complex bioinformatics analysis mentioned above, then lists cis-element candidates for genes. The genes can be grouped by similarity of expression profiles and other criteria for assessment by researchers, then the tool annotates them with related public database information.
Similar tools have been developed previously. Helden released RSAT, which includes a program that can detect over-represented motifs in upstream regions of co-regulated genes [7]. Holt et al. established CoReg, which links the hierarchical clustering of co-expressed gene sets with frequency tables of promoter elements [8]. Zhao et al. established TRED, which integrates a database and a system for predicting cis-and trans-elements in mammals [9]. Galuschka et al. developed AthaMAP, which includes a program for comparative analysis of cis-elements in sets of co-transcribed genes of Arabidopsis thaliana [10].
Our tool is distinguished by several points: (i) It focuses on the rice genome, being based on full-length cDNAs, and is designed to pick up cis-element candidates associated with genes that users designate. (ii) It evaluates the likelihood score of cis-element candidates by comparing frequency counts in the user-selected gene set and a reference gene set. (iii) It can evaluate previously known ciselement sequences as well as user-specified sequences prepared by other analysis tools, and it can examine several cis-elements together.
The tool carries out both ab initio motif searches of promoter sequences and searches against known plant cis-elements, then performs a likelihood analysis of identified cis-elements on the basis of their presence in a significant proportion of the promoters of a given set of genes. This evaluation is achieved by an association rule analysis.
Here, we present technical details of the tool and demonstrate the practical assessment of its utility with a biologically relevant sample data set.

Implementation
The tool, called Rice Cis-Element Searcher (RiCES), consists of a cis-element searching pipeline, controlled via a Web-based user interface. Fig. 1 summarizes the procedure. The pipeline first reads a list of gene identifiers from the user, which it uses to retrieve the promoter sequences corresponding to the listed genes. Then a preliminary list of cis-element candidates is built by aligning information from the built-in list of plausible motifs, or by ab initio motif searching of the sequence data. Association rule analysis is carried out and reported to support the candidacy of the resulting cis-element list.

Gene list
RiCES assumes that a user has already identified genes of interest from experimental analysis (e.g. clusters of coordinately regulated genes). The list of identifiers is input into a Web-based data entry form. RiCES recognizes Gen-Bank accession numbers, identifiers of transcription units (TUs) as defined in the TIGR pseudomolecular assemblies [11], and several other major gene identification systems. Using the list, it retrieves the set of associated upstream, downstream, or coding region sequences flanking the specified genes from available genomic sequence data.

Preliminary cis-element candidate list
The second step of the analysis is the compilation of a list of motifs as candidate cis-elements. At present RiCES supports two methods to achieve this.
The first method depends on ab initio motif searching based on the supposition that if there are cis-elements playing important roles in the regulation of a given set of genes, they will be statistically overrepresented in the associated promoter sequences as conserved motifs that can be identified by using a suitable motif search program. There are several programs implementing several algorithms. We have chosen to use MEME, which is a publicly available motif discovery program [12] supporting an expectation maximization algorithm. In our analysis algorithm, MEME is invoked to identify motifs 6 to 8 bp long that look highly conserved among promoter sequences of the selected genes. Users can modify some of the search parameters of the MEME program via the Web form.
The second method relies on the hypothesis that common, known cis-elements play important roles under the experimental conditions that gave rise to the list of genes specified by the user. Therefore, RiCES searches for matches to a pre-compiled list of known cis-elements.
Several databases of plant cis-elements are publicly available. PLACE [13] is one of the most popular databases of known cis-elements in plant genomes. AtcisDB, a part of AGRIS [14], includes information on cis-elements involved in gene regulation in Arabidopsis thaliana.
Although these databases are extremely useful resources, it is not straightforward to cross-link information from them directly to the researcher's own data. Current databases are not exhaustive enough to distinguish 'core' motifs, which decide the function of cis-elements, from co-existing sequences in neighboring regions. As a result, many cis-element sequence data in these databases include superficial core motifs for which no evidence of functionality has been obtained. The use of such data prohibits effective informatic analysis. gel shift assays and footprint analyses, categorized by transcription factor, and documented with respect to known activity in the plant genome. Some cis-elements known only in organisms other than plants are also listed, in consideration of their possible, albeit unknown, roles in plants. The database includes four types of cis-elements: (1) G-box and E-box, which bind to common sequences such as bHLH or bZIP in many organisms; (2) A-box, Tbox, and GGTTTAG repeats, which bind to common sequences in many organisms, such as homeodomain and Myb; (3) CArG boxes and GCC-box, which bind to plant MADS, zinc finger, and AP2/EREBP elements; and (4) other cis-elements, binding only in animals, such as HSF, PcG, and HMG.

Association rule analysis
The third step of the analysis is the likelihood evaluation of the cis-element candidates by association rule analysis, which is a data mining method designed to discover significant relationships between pairs of characteristics observed in data sets. Candidates showing the highest likelihood (specificity) are retained in the final cis-element candidate list.
Association rule analysis has been applied to mechanisms that regulate gene expression [e.g. [15,16]]. We used it to find relationships between identified cis-elements and gene expression profiles. The strategy depends on the idea that motifs overrepresented in the promoter region of the genes of interest could play specific roles in regulation of the expression of those genes.
Implied cause-and-effect relationships documented as 'rules' are evaluated by using several well-known indices of likelihood, including support, confidence, and lift [15]. On the basis of sample data sets, the lift index appeared to best discriminate significant relationships between experimental conditions and cis-element candidates.
In a rule described as the presence of motif X in a gene implies that the gene is a member of group Y, lift is the ratio of the posterior probability (the probability that the gene is in group Y if it possess motif X) to the prior probability (the probability of X possession, irrespective of the membership of Y). When lift > 1.0, the coexistence of X and Y is not a random occurrence, but suggests some causal relationship between them. If lift < 1.0, it is not considered probabilistically significant. Consequently, we set the default threshold of lift to 1.0, and the cis-element candidates are included in the final candidate list only if their lift value is higher than this threshold.
RiCES also evaluates pairwise combinations of motifs in the preliminary candidate list (upper right-hand box in Fig. 1), in consideration of possible protein-protein interactions of multiple transcription elements binding cis-elements, as illustrated by experimental evidence [17,18].

Output
The final cis-element candidate list is presented as an association table with the identifier of the submitted genes (TU identifiers based on TIGR gene model annotation are used in the current version) annotated with any available corresponding information from RiceCyc [19] and Gene Ontology [20]. RiCES also provides information on candidate motifs, including the positions of the element in the promoter regions of corresponding TUs, the sequence, and related information from AtcisDB [14]. The position of the cis-element candidates is also presented in both text and graphics.

Validation
To test whether or not the output of RiCES was meaningful, we validated it with a list of auxin-inducible genes with known characteristics, compiled from RiceTFDB 2.0 [21]. First, Aux/IAA genes stored in RiceTFDB were applied as queries in a BLASTN search [22] of GenBank, returning a list containing 28 rice TUs [See Additional file 2]. These genes were fed into the pipeline. When the MEME program was called, the length of target motifs was set to 6, 7, or 8 bases, the number of occurrences of each motif was set to 7, 14, or 21, and the search algorithm was set to 'zoops' to check zero or one occurrence per sequence. The outputs of each option setting were merged but not otherwise filtered.

Results and Discussion
Many Aux/IAA genes are auxin-inducible [23] and contain the TGTCTC element [24]. This element is commonly found in the upstream region of auxin-responsive genes. Thus, the detection of all instances of the motif by the pipeline could serve as a validation of the pipeline algorithm. The auxin-responsive element (AuxRE) containing the TGTCTC motif in some cases requires another proximal AuxRE for biological activity [17,25]. In other contexts, AuxRE functions only when it occurs with its palindromic components separated by 7 or 8 nucleotides [26].  (Table 1), which is derived from the report of Plesch et al. [27], describing auxin-induced expression of the Arabidopsis prha homeobox gene. Another 4 motifs contained the TGTCTC element. The result was consistent with previous work, as TGTCTC was listed as a candidate in the single motif search of Aux/IAA genes. Table 2 shows the result of the validation test with a precompiled cis-element list generated by the test gene list. The analysis returned 22 cis-element candidates with lift > 1.0 [See Additional file 5 and 6]. Some of these candidates were suggested by previous studies to have some kind of relationship to auxin response. For example, RAV1 was found in the promoter region of ABP, which encodes an auxin-binding protein [28]. Expression of LEAFY (LFY) is affected by the auxin gradient in Arabidopsis [29]. ETT is another auxin response factor [30], and LFY and ETT expression are closely correlated [18,31].
The position of a cis-element is important information to consider in relation to the function of the cis-element. For biological activity to occur, the distance of some cis-elements from the coding region or other collaborating elements is constrained. To this end, RiCES highlights the distribution of cis-element candidates. It provides tables  of identified cis-element motifs and graphical motif maps to help researchers grasp positional relationships among the candidate elements.
The positions of the listed elements, some of which include TGTCTC, varied among upstream regions of genes (Fig. 2), and it was hard to detect any skewed distribution of motifs. Goda et al. [32] studied the distribution of TGTCTC motifs in the genome of A. thaliana, and pointed out that 25% of investigated genes had TGTCTC motifs in the upstream region within 1000 bp of the start codon, and 14% within 500 bps. Our results do not seem in conflict of theirs.
TGTCTC motifs are scattered over wide regions of many plant species (Table 3). It is possible that the variety of the roles of genes reflects the variety of mechanisms regulating gene expression and positions of cis-elements, even if the genes in question can be classified as 'auxin-responsive genes' in a larger sense.
A major research concern is how to pick up cis-element candidates worthy of further experimentation. Computational and manual selection of cis-element candidates should play complementary roles to resolve this issue. It should be emphasized that cis-element candidates listed by RiCES are rated according to the likelihood provided by association rule analysis. On the other hand, researchers can check the significance of candidates in detail by using related information derived from several databases. The supported databases include AGRIS, Gene Ontology,  Table 1) was searched in the 1000-bp upstream region of genes, and frequency was counted in segmented regions at an interval of 10 bp. The X-axis represents the position in the upstream region, and the bars designate frequency of motifs (counted after distribution of multiple regions was merged).
and RiceCyc, as well as the map information described above. Fig. 3 is an example of the output for the TGTCTC motif. The outputs are not only easily accessible in a Web browser, but are also usable in further statistical or bioinformatics analysis, as they are also provided in XML format (Fig. 3A), which is a tagged plain-text format compatible with various computer programs.
In some cases, the results of the analysis from the precompiled list of elements will be easily comparable with prior knowledge. In other cases involving solely ab initio evidence from MEME, the results of motif searches should be interpreted carefully, because the result will change considerably in accordance with the options selected. An appropriate set of motif search options should be determined each time, by trial and error. However, as described above, a motif search can find cis-element candidates of which the sequences do not exactly match those of known cis-elements.
Although RiCES is focused on the role of cis-elements in Oryza sativa ssp. japonica, the methodology can be applied easily to studies of other plant species, or of other genome sequence motifs involving gene expression regulation, such as motifs in coding regions of genes or downstream of the gene sequence. Such work can be made possible by replacing the reference data set containing whole genes of rice with other data sets.

Conclusion
We presented here a newly developed tool to search for cis-element candidates in a list of genes. A case study showed the applicability of the tool. The tool is easy to use and publicly available. We expect that its use will deepen understanding of the mechanisms that regulate gene expression in plants.
[37] Arabidopsis amino acid transporters (AAPs); AAP8 is probably responsible for import of organic nitrogen into developing seeds. [45] *) Numbers are equivalent to those shown in the main text.
License: Freely available for use A B study and participated in its design and coordination. All authors read and approved the manuscript.