We have identified a large number (over 100) of AP2/ERF genes from diverse land plant lineages that are orthologous to a set of genes from Arabidopsis known as Cytokinin Response Factors or CRFs.
Sequence analysis and alignment of CRF genes or genes containing a CRF domain, as designated below, was initiated with the six previously identified CRF genes from Arabidopsis, CRF1-6, and two closely related Arabidopsis genes, At1 g71130 and At1 g22985, that we now designate CRF7 and CRF8, respectively. These sequences in addition to previously identified CRF gene homologs from Rice, as seen in Nakano et al., 2006 [8], allowed us to generate a basic consensus sequence that we used to gather further CRF sequences through a series of BLAST analyses. Additional searches with a representative species sequence within genus or specific genome sequencing efforts allowed us to generate a broader CRF protein consensus sequence, the CRF domain of which is presented in Figure 1. Sequences are identified simply by their generic name followed by a number if more than one was identified per genus, unless previously designated a name or for the Solanum lycopersicum or SlCRFs (SlCRF1 also previously designated as PTI6 is from SGN-U314347, SlCRF2 is from SGN-U329134, SlCRF3 is from SGN-U344182, and SlCRF4 is from SGN-U331355), with full gene names included see Additional File 1. These results show that CRF genes are present throughout land plants and always consist of a novel N-terminal CRF motif/domain, an AP2-DNA binding domain near the middle of the protein, and in roughly half of the sequences a putative kinase phosphorylation site in the C-terminal region.
CRF Protein Domains
The AP2 domain as has been previous detailed for CRF protein members is centered on the base amino acid sequence AAEIRD**RR*R*WLGT*DTAEEAA where the underlined WLG amino acids are absolutely required for AP2/ERF domain binding to DNA and the * represent non-specifically conserved amino acids [2, 8].
While the specific sequence found in CRF protein members for this domain is quite similar to previously described alignments using ERF proteins from Arabidopsis and Rice, as seen in Nakano et al., 2006 [8], we have shown it to be present far beyond these two species, in fact occurring throughout land plants. While AP2 domains in general maintain a number of conserved amino acids that are required for their function, there is also specificity within a domain that can in some cases determine DNA sequence binding specificity. A prime example is the difference between the conserved amino acid sequence VAEIRE from the CBF/DREB subfamily of ERF proteins and AAEIRD from the ERF subfamily resulting in binding to DRE/CRT or GCC box cis-elements respectively [2]. The sequence of the AP2 domain found in this study of CRF proteins, indicates that they belong to the ERF protein subfamily, but also indicates a higher level of specificity within this group.
We have also identified a novel domain of approximately 65 amino acids that is present in all CRF proteins throughout all land plants, that we have designated as the CRF domain. The consensus sequence of this domain along with a sequence alignment from representative species is shown in Figure 1, broken into two parts the core CRF domain of about 40AA (Figure 1B) and the TEH region of about 13 AA that precedes the core domain in nearly all sequences belonging to the TEH clade of CRF genes (Figure 1C, Figure 2).
The CRF domain is found in the N-terminal region of the protein and is always accompanied by an AP2-DNA binding domain, roughly 60 AA C-terminal to the CRF domain position (Fig 1A). Therefore, CRF domain containing proteins, or CRF proteins are a subset of AP2/ERF proteins. We have identified CRF domain proteins in liverworts, mosses, lycopods, ferns, conifers and all major lineages of flowering plants. CRF domain containing genes were not found in any species of green algae including the completely sequenced genomes of Chlamydomonas, Micromonas (2 spp.) and Ostreococcus, despite the presence of clearly identifiable AP2/ERF domain proteins in these genomes. Additionally, while highly divergent AP2/ERF-like domains have been detected in some bacteria, no recognizable CRF domains were found in any sequence searches outside of the land plants mentioned above, suggesting that they are unique in their occurrence within this group.
In an attempt to ascribe a specific function to the CRF domain, we performed a motif analysis of the CRF domain sequence. This revealed no similarity to any motifs or domains of known function. The best similarity identified in BLAST analysis, which is very weak, is to the C-terminal region of potassium voltage-gated channel subfamily S member 3 proteins Kv9.3 such as KCNS3 from Humans. However, the region of similarity on the potassium channel protein resides at the very end of the C-terminal of the protein that is not involved in channel structure, protein-protein interaction, or potassium movement, but is in a variable region of unknown function [11]. There is within the C-terminal region of the CRF domain a stretch of amino acids rich in lysines and arginines, which in are often involved in nuclear localization of proteins. However, there is no apparent alignment of amino acids in the CRF domain that corresponds to such known nuclear localization signal. A best guess at CRF domain function from a basic analysis of the eight members with a CRF domain that have any ascribed function, would suggest a role in cytokinin regulation, since the six CRFs from Arabidopsis appear to be regulated by that hormone [10]. It is also possible that the CRF domain may be connected to pathogen resistance, as two (non Arabidopsis) CRF domain genes, Pti6 from tomato and Tsi1 from tobacco, have been linked to pathogen resistance in gene overexpression studies [12–14]. Another possibility is that the CRF domain functions as a protein-protein interaction domain, allowing CRF domain containing proteins to form hetero or homodimers with each other or themselves.
One other small motif: SP(T/V)SVL was identified in roughly half of the CRF proteins for which we have identified full-length sequences. While a part of this motif has been previously noted for a few species we found that this conserved six AA motif occurs in CRF genes across a broad range of land plants including Selaginella [8, 9]. This motif is predicted to function as a putative MAP kinase phosphorylation site [8, 9]. Unlike the CRF domain, the SP(T/V)SVL motif is not specifically linked to either the AP2 or CRF domains (Figure 1). This SP(T/V)SVL motif can be found in 33 other non-CRF proteins in Arabidopsis alone with a variety of functions, including several different types of transcription factors. Interestingly, about half of the genes whose protein contains this domain have also been shown to have altered expression through cytokinin treatment or in a cytokinin mutant background, suggesting that CRF proteins in general may have a role in cytokinin response (Additional File 2).
Phylogenetic Analysis of CRF proteins
CRF proteins from a wide range of land plant lineages can be readily aligned at the protein sequence level and were submitted to various phylogenetic analyses. This result is shown in the neighbor joining (NJ) tree in Figure 2 with species denoted as is Figure 1 and full gene names included in Additional File 1. In this tree there are two distinct clades that we have denoted as A and B, each of which contain sequences from diverse flowering plant lineages, with sequences from the relatively earlier branching land plant lineages at the base of the tree. This division of CRF genes into A and B clades is coincident with the presence or absence of a specific set of amino acids in the beginning of the CRF domain, here referred to as the TEH region. The TEH region of the CRF domain is well conserved and unique, found only in the CRF domains of clade A proteins, with some variability in size and sequence (Figure 1).
Within individual plant species, A clade, TEH proteins appear to be about twice as numerous as B clade: Arabidopsis (8 A clade members: 4 B clade members), Rice (6 A clade: 3 B clade), Vitis (6 A clade: 4 B clade), Populus (8 A clade: 3 B clade). Not surprisingly, A clade CRF sequences are identified in BLAST analyses roughly twice as frequently as those B clade members without a TEH region. A somewhat similar distinction of clades was observed in a cluster analysis using only the AP2 domain of all ERF proteins in Arabidopsis and separately in Rice (Nakano et al., 2006). In these studies the 'A clade' members in each species were identified as part of one of the major subgroups of ERF proteins (group VI) with the 'B clade' proteins being relegated to a smaller, related or like group to these members (group VI-L). The within-clade similarity of amino acid sequence across either the AP2 or CRF domain is quite marked and easily discerned by eye.
As there has not been a previous analysis of CRF domain proteins containing monocots and eudicots, it is interesting to note that within the distinct A and B clades there is an additional, clear division of sequences between monocots and eudicots, particular true for the CRF proteins in clade A (Figure 2). In clade B this appears to also be the general rule, with the one exception of a 'misplaced' Vitis CRF sequence.
The clustering analyses suggest that a number of duplication events accompanied the phylogenetic history of CRF proteins. The particularly striking A/B clade duplication appears to predate the divergence of monocots, magnoliids and eudicots, but happened at some point after the origin of flowering plants. Within each of the A and B clades there are additional duplication events that occurred prior to the branching off of the eudicots, leaving multiple clades of rosid, asterid and caryophyllid sequences in each. Finally, there are a number of species specific duplications that have generated only slightly differentiatied copies of CRF-domain loci. Interestingly, within sequenced genomes there are no known examples of CRF genes having arisen from tandem duplications, but this may change with the increasing number of plant genomes being sequences. The possibility of some subfunctionalization or specialization of any of these CRF subclades will be an interesting area of future research.
Cytokinin regulation of previously unexamined CRFs
We attempted to make use of the phylogeny of so many newly identified CRF genes to further our understanding of potential CRF gene function. We examined four, previously unexamined CRF genes from Tomato that we predicted, based on their phylogenetic placement, to possess possible cytokinin regulation.
Specific primers were generated for each gene such that RT-PCR could be performed on cDNA made from RNA transcripts from Tomato leaves with and without a cytokinin treatment. For simplicity we have further designated each of the four Tomato unigene constructs representing these genes as Solanum lycopersicum or SlCRFs (SlCRF1 also previously designated as PTI6 is from SGN-U579886, SlCRF3 is from SGN-U573201, SlCRF4 is from SGN-U574151, and SlCRF5 is from SGN-U583231). We were able to detect transcript from each of the four SlCRFs in a range of tissues (data not shown), but decided to focus on cytokinin expression in leaves. Transcript levels for all four SlCRFs were found to be induced in Tomato leaves treated for 2 hours with 5 μM cytokinin (benzyladenine) vs. a carrier control DMSO (Figure 3). This induction is similar to some of the previously examined Arabidopsis CRFs and suggests that each of these SlCRF genes is regulated by cytokinin [10]. Not only is this a novel function not previously ascribed to any of these genes, but it is the first ascribed gene function for SlCRF3, 4, and 5. This highlights the potential power of a broad phylogenetic framework for determining function of previously unknown genes.