Phylogenetic analysis and classification of the Brassica rapa SET-domain protein family
© Huang et al; licensee BioMed Central Ltd. 2011
Received: 4 November 2011
Accepted: 14 December 2011
Published: 14 December 2011
Skip to main content
© Huang et al; licensee BioMed Central Ltd. 2011
Received: 4 November 2011
Accepted: 14 December 2011
Published: 14 December 2011
The SET (Su(var)3-9, Enhancer-of-zeste, Trithorax) domain is an evolutionarily conserved sequence of approximately 130-150 amino acids, and constitutes the catalytic site of lysine methyltransferases (KMTs). KMTs perform many crucial biological functions via histone methylation of chromatin. Histone methylation marks are interpreted differently depending on the histone type (i.e. H3 or H4), the lysine position (e.g. H3K4, H3K9, H3K27, H3K36 or H4K20) and the number of added methyl groups (i.e. me1, me2 or me3). For example, H3K4me3 and H3K36me3 are associated with transcriptional activation, but H3K9me2 and H3K27me3 are associated with gene silencing. The substrate specificity and activity of KMTs are determined by sequences within the SET domain and other regions of the protein.
Here we identified 49 SET-domain proteins from the recently sequenced Brassica rapa genome. We performed sequence similarity and protein domain organization analysis of these proteins, along with the SET-domain proteins from the dicot Arabidopsis thaliana, the monocots Oryza sativa and Brachypodium distachyon, and the green alga Ostreococcus tauri. We showed that plant SET-domain proteins can be grouped into 6 distinct classes, namely KMT1, KMT2, KMT3, KMT6, KMT7 and S-ET. Apart from the S-ET class, which has an interrupted SET domain and may be involved in methylation of nonhistone proteins, the other classes have characteristics of histone methyltransferases exhibiting different substrate specificities: KMT1 for H3K9, KMT2 for H3K4, KMT3 for H3K36, KMT6 for H3K27 and KMT7 also for H3K4. We also propose a coherent and rational nomenclature for plant SET-domain proteins. Comparisons of sequence similarity and synteny of B. rapa and A. thaliana SET-domain proteins revealed recent gene duplication events for some KMTs.
This study provides the first characterization of the SET-domain KMT proteins of B. rapa. Phylogenetic analysis data allowed the development of a coherent and rational nomenclature of this important family of proteins in plants, as in animals. The results obtained in this study will provide a base for nomenclature of KMTs in other plant species and facilitate the functional characterization of these important epigenetic regulatory genes in Brassica crops.
Epigenetic regulation acts through heritable changes in genome function that occur without a change in DNA sequence. One well-known epigenetic mechanism is through posttranslational covalent modifications of histones; these modifications include acetylation, methylation, ubiquitylation and others, and form the basis of the 'histone code' for gene regulation . Histone lysine methylation plays a pivotal role in a wide range of cellular processes including heterochromatin formation, transcriptional regulation, parental imprinting, and cell fate determination . At least six lysine residues, five on histone H3 (K4, K9, K27, K36, K79) and one on H4 (K20), are subject to methylation. Each lysine can carry one, two or three methyl residue(s), known as mono-, di- and tri-methylation, respectively. In general, di-/tri- methylation of H3K4 and H3K36 correlates with transcriptional activation, whereas di-methylation of H3K9 and trimethylation of H3K27 correlates with gene silencing in plants and animals [2, 3].
All known lysine methylation modifications, with the exception of H3K79 methylation, are carried out by methyltransferases that contain an evolutionarily conserved SET domain, named after three Drosophila genes (Su(var), E(z), and Trithorax) . The SET domain encompasses approximately 130-150 amino acids that form a knot-like structure and constitute the enzyme catalytic site for lysine methylation . In addition to the SET domain, flanking sequences, more distant protein domains, and possibly some cofactors are also important for enzyme activity and specificity. The genes encoding SET-domain proteins are ancient, existing in prokaryotes and eukaryotes, but have proliferated and evolved novel functions connected with the appearance of eukaryotes .
The first plant genes encoding SET-domain proteins to be genetically characterized were CURLY LEAF (CLF) and MEDEA (MEA) in Arabidopsis thaliana [7, 8]. Chromatin-binding properties and histone methylation activity of plant SET-domain proteins were first reported for tobacco NtSET1 and Arabidopsis KRYPTONITE (KYP) [9, 10]. Phylogenetic analysis of plant SET domain proteins has proven helpful as a guide for genetic and molecular studies of this large family of proteins [11, 12]. To date, some of the Arabidopsis SET-domain family members have been characterized and shown to play crucial functions in diverse processes including flowering time control, cell fate determination, leaf morphogenesis, floral organogenesis, parental imprinting and seed development [3, 13–15].
Genome sequences of an increasing number of plant species, in addition to the model plants (Arabidopsis thaliana, Oryza sativa, and Brachypodium distachyon), have also been completed. Other Brassica species are of particular interest because of their agro-economical importance and their close relationship with Arabidopsis, thus providing insights into recent SET-domain gene amplification during evolution of Brassica species. Here, we identified and analyzed 49 SET-domain proteins from the recently completed Brassica rapa whole genome sequence . Our data provide a platform for future functional characterization of these important epigenetic regulatory genes in Brassica species.
List of green lingeage SET-domain proteins analyzed in this study
The plant KMT1A subclass proteins show high sequence similarity to the animal KMT1 proteins both within the SET domain and in the surrounding regions known as the Pre-SET and post-SET domains. Additionally, most of the plant proteins contain a specific domain named SRA (SET and RING associated). Similar to previously studied Arabidopsis proteins [12, 20], most of the BrKMT1A proteins also contain SRA, Pre-SET, SET and post-SET domains (Figure 2). These domains are missing in some of the Group-4 proteins; for example, BrKMT1A.4e, BrKMT1A;4f and BrKMT1A;4 g lack a Post-SET domain, and BrKMT1A.4 h lacks SRA, Pre-SET and post-SET domains (Figure 2). Several functions have been reported for SRA domains, including binding with the N-terminal tail of histone H3 and with DNA cytosine methylation . The crystal structure of AtKMT1A;3a/SDG9/SUVH5 revealed that SRA recognizes the methylation status of CG and CHH sequences . The Pre-SET domain contains 9 conserved cysteines. The Post-SET domain is a small cysteine-rich region often found at the C-terminal side of SET domains. Both Pre-SET and Post-SET domains have been shown to affect histone methyltransferase activity of the SET domain [23, 24].
Members of the plant KMT1A subclass, like animal KMT1 proteins, are likely to be responsible for H3K9 methylation, an epigenetic mark involved in heterochromatin formation and gene silencing. Consistent with this, analysis of AtKMT1A;1/SDG33/SUVH4/KYP, AtKMT1A;2a/SDG3/SUVH2, AtKMT1A;3a/SDG9/SUVH5 and AtKMT1A;3b/SDG23/SUVH6 has revealed their important roles in H3K9 methylation, in heterochromatic gene silencing and in cross-talk between H3K9 and DNA methylation [9, 21, 22, 25–28]. Work in rice also confirmed that several members of this subclass are involved in H3K9 methylation and in transposon silencing [29–31]. Some of the BrKMT1A genes might also have similar functions.
The KMT1B subclass differs from the KMT1A subclass in protein domain organization; specifically, these proteins lack the SRA domain (Figure 3). A recent study demonstrated that AtKMT1B;1/SDG31/SUVR4 possesses H3K9-methyltransferase activities and its binding with ubiquitin converts H3K9me1 to H3K9me3 deposition on transposon chromatin . Notably, the WIYLD domain, which binds ubiquitin, is conserved in BrKMT1B;1a, BrKMT1B;1b, BrKMT1B;2a and BrKMT1B;2b (Figure 3). It was reported that AtKMT1B;3/SDG6/SUVR5/AtCZS is involved in regulation of flowering time, possibly through deposition of H3K9 methylation at the flowering time repressor FLC . The functions of other members of the KMT1B subclass remain uncharacterized so far.
KMT2 Group-1 members contain one PWWP, one FYR and two PHD domains. Only one member belonging to Group-1 is found in O. sativa or B. distachyon, but two members are found in B. rapa and A. thaliana. Our examination revealed that BrKMT2;1a has synteny with AtKMT2;1a/SDG27/ATX1, and BrKMT2;1b with AtKMT2;1b/SDG30/ATX2, suggesting that they are derived from two different ancestral copies before the B. rapa/A. thaliana separation. Consistent with this, the atx1 mutant plants exhibit strong and pleiotropic defects , but the atx2 mutant plants have a normal phenotype . The atx2 mutation can enhance atx1 in reduction of expression of the flowering repressor gene FLC through reduced levels of H3K4me3 at the FLC locus .
The PWWP and FYR domains are absent from the Group-2 members and the PHD domain is found only in some monocot proteins (Figure 4). Only one Group-2 representative member is found in the dicot species B. rapa or A. thaliana, but the monocot O. sativa has two members and B. distachyon has three members. The B. rapa and A. thaliana proteins, as well as one member each from O. sativa and B. distachyon, contain a GYF domain in the N-terminal part of the protein (Figure 4). The fact that this domain is conserved in KMT2;2 proteins from all four higher plant species suggests that the acquisition of the GYF domain occurred before the monocot/dicot separation and may have a conserved function in higher plants. Genetic analysis demonstrated that AtKMT2;2/SDG25/ATXR7 is necessary in preventing early flowering [43, 44]. The recombinant AtKMT2;2/SDG25/ATXR7 protein was shown to methylate histone H3 in vitro and the depletion of AtKMT2;2/SDG25/ATXR7 in planta slightly reduced H3K4 and H3K36 methylation at FLC chromatin [43, 44]. OsKMT2;2b/SDG732 and BdKMT2;2c contain two PHD domains, but BdKMT2;2b, like the yeast protein ScKMT2/Set1, does not contain a recognizable PHD domain. Future study of these monocot proteins will likely provide a deeper understanding of the domain evolution of KMT2 proteins.
The Group-3 KMT2 proteins have a domain organization more similar to Group-1 except that they lack the FYR domain (Figure 4). B. rapa and A. thaliana both have three Group-3 members, but each of the other two higher plant species has only two Group-3 members. Synteny was observed for BrKMT2;3c with AtKMT2;3c/SDG29/ATX5 but not with AtKMT2;3b/SDG16/ATX4, suggesting that AtKMT2;3b/SDG16/ATX4 was derived from a relatively recent duplication event. This is in agreement with a previous study revealing that AtKMT2;3b/SDG16/ATX4 and AtKMT2;3c/SDG29/ATX5 are collinearly duplicated with AtKMT2;1a/SDG27/ATX1 and AtKMT2;1b/SDG30/ATX2 . To date, none of the Group-3 proteins has been functionally characterized.
The KMT3 class plant proteins share high sequence similarity and share the AWS (a subdomain of Pre-SET, ), SET and Post-SET domain organization with the yeast H3K36-methyltransferase ScKMT3/Set2 (Figure 5). The Group-1 proteins have a long sequence and contain an additional CW domain specific to this group. The CW domain of AtKMT3;1/SDG8/ASHH2/EFS/CCR1 was recently shown to bind H3K4me1/me2 , suggesting a novel link between H3K4 and H3K36 methylation in plants. AtKMT3;1/SDG8/ASHH2/EFS/CCR1 is the major H3K36-methyltransferase specifically required for H3K36me2 and H3K36me3 deposition, and activates expression of hundreds of genes including FLC and MAFs . Depletion of AtKMT3;1/SDG8/ASHH2/EFS/CCR1 causes pleiotropic phenotypes, including early flowering, reduced organ size, increased shoot branching, perturbed fertility and carotenoid composition, and impaired plant defenses against pathogens [47–54]. The other group of KMT3 plant proteins have a shorter sequence and do not contain the CW domain; interestingly the depletion of AtKMT3;2/SDG26/ASHH1 resulted in a late-flowering phenotype associated with elevated levels of FLC expression . The Group-3 KMT3 proteins, with the exception of BdKMT3;3b and OsKMT3;3b/SDG707, contain a PHD domain; and AtKMT3;3/SDG4/ASHR3 was reported to be involved in pollen and stamen development possibly through mediating H3K4me2 and H3K36me3 deposition [55, 56]. The functions of the Group-4 proteins remain unexamined so far. Examination of this group in B. rapa could be a challenge because of gene multiplication and more diverged sequences (Figure 5).
The SANT (SWI3, ADA2, N-CoR, and TFIIIB DNA-binding) domain is found in most of the plant KMT6A subclass proteins. This domain is also found in a number of other chromatin remodeling proteins with multiple activities such as DNA-binding, histone tail binding, and protein-protein interactions . Nevertheless, the precise role of the SANT domain in KMT6A proteins is currently unknown. Notably, BrKMT6A;3b does not contain a SANT domain. As expected from restricted AtKMT6A;3/SDG5/MEA expression in only a small number of cells during reproduction, expression of both BrKMT6A;3a and BrKMT6A;3b is barely detectable in the examined tissues (Additional File 3: Figure S2). It will be interesting to investigate BrKMT6A;3a and BrKMT6A;3b expression during reproduction and to examine whether both genes are functionally important.
The KMT6B subclass includes two members each in A. thaliana, B. distachyon and O. sativa, and four members in B. rapa, which together can be divided into two distinct groups (Figure 6b). The two members from A. thaliana, AtKMT6B;1/SDG15/ATXR5 and AtKMT6B;2/SDG34/ATXR6, were classified as trithorax-related in the first genome analysis of Arabidopsis SET-domain proteins . However, our study as well as two previous studies that included a more complete set of plant SET-domain proteins clearly show that AtKMT6B;1/SDG15/ATXR5 and AtKMT6B;2/SDG34/ATXR6 belong to the KMT6B subclass (Figure 1) [12, 20]. Consistent with this, functional analysis revealed that AtKMT6B;1/SDG15/ATXR5 and AtKMT6B;2/SDG34/ATXR6 are involved in monomethylation of H3K27 . They appear to act redundantly, because depletion of H3K27 monomethylation is only detectable in the atxr5 atxr6 double mutant . KMT6A-mediated H3K27me3 is mainly present in euchromatic regions and is important for gene silencing , but KMT6B-mediated H3K27me1 is found in heterochromatic chromocenters and is important for heterochromatin condensation and replication in Arabidopsis .
Distinct from KMT6A proteins containing a SANT domain, many plant KMT6B subclass proteins contain a PHD domain (Figure 6). The PHD domain of both AtKMT6B;1/SDG15/ATXR5 and AtKMT6B;2/SDG34/ATXR6 strongly bind unmethylated H3 tail peptides (amino acids 1-21), and this binding is negatively affected by methylation on H3K4 . This binding preference may help to assure that these KMT6B proteins are not targeted to euchromatin and active genes enriched in H3K4 methylation. Remarkably, both Group-1 and Group-2 members are duplicated in B. rapa. Both BrKMT6B;1a and BrKMT6B;1b are syntenic with AtKMT6B;1/SDG15/ATXR5, and both BrKMT6B;2a and BrKMT6B;2b with AtKMT6B;2/SDG34/ATXR6. Expression of BrKMT6B;1a, BrKMT6B;2a and BrKMT6B;2b was detected in different tissues, but we failed to detect BrKMT6B;1b expression (Additional File 3: Figure S2). It is reasonable to speculate that BrKMT6B;1a, BrKMT6B;2a and BrKMT6B;2b might have redundant functions.
Over last 10 years, a number of SET-domain genes in Arabidopsis and in rice have been characterized and shown to exert crucial chromatin-based functions via histone methylation during plant growth and development [3, 15]. However, the nomenclature of plant SET-domain proteins remains complex, and multiple synonyms exist for many Arabidopsis proteins (Table 1), which could cause considerable confusion in this important field. Nomenclature based on sequence similarity has several advantages, informing the prediction of KMT enzyme substrate specificity for its histone lysine residue and providing a global view of KMT types in an organism once its whole genome sequence becomes available. However, the SDG nomenclature failed to provide information concerning enzyme substrate specificity and the number following "SDG" could be long and difficult to remember [12, 63], e.g., the first SDG from B. rapa (ID = 197) would have been named SDG19701. While the nomenclature by Baumbusch and colleagues provided information about homology to animal proteins, an incomplete list of Arabidopsis SET-domain proteins and the limitation (at that time) of having only one plant species with a genome-wide analysis restricted the precision and correctness of phylogenetic grouping in this study . In addition, animal KMT nomenclature had also been noncoherent; a rational nomenclature was proposed only recently . Therefore, the nomenclature we propose here is in line with the latest advances in the field.
We identified 49 SET-domain proteins from the recently completed whole genome sequence of B. rapa. Among them, 5 proteins belong to the S-ET class likely involved in nonhistone protein methylation, 20 proteins belong to the KMT1 class potentially involved in H3K9 methylation, 6 proteins belong to the KMT2 class potentially involved in H3K4 methylation, 7 proteins belong to the KMT3 class potentially involved in H3K36 methylation, 8 proteins belong to the KMT6 class potentially involved in H3K27 methylation, and 3 belong to the KMT7 class also potentially involved in H3K4 methylation. This in silico survey is useful for future functional analysis of this important family of epigenetic regulators in Brassica. H4K20 methylation was detected in Arabidopsis using antibodies [28, 65], but the catalyzing enzyme(s) involved is(are) not yet known and the current phylogenetic analysis did not allow prediction of a specific KMT class involved in H4K20 methylation. It is possible that some members of the aforementioned KMT classes catalyze H4K20 methylation. The total number of KMTs in B. rapa (49) is slightly higher than that identified in A. thaliana (37), O. sativa (36) and B. distachyon (41). Nevertheless, we could not exclude the possibility that a few more KMTs may be missing from the currently available genome sequence of B. rapa.
Gene duplication is one of the primary driving forces in the evolution of genomes and genetic systems, and is considered to be a major mechanism for the establishment of new gene functions and the generation of evolutionary novelty [66, 67]. Contrary to what would be expected from the chromosome number duplication in B. rapa compared to A. thaliana, the number of KMT genes in B. rapa (49) is much less than double the number of A. thaliana KMTs (74). Many duplicated genes show synteny with their A. thaliana homologues, suggesting that they are derived from chromosome/genome segment duplications. Three alternative outcomes can occur in the evolution of duplicated genes: (i) one copy may simply become silenced by degenerative mutations (nonfunctionalization); (ii) one copy may acquire a novel, beneficial function and become preserved by natural selection (neofunctionalization); (iii) both copies may become partially compromised by mutation so that their total capacity adds up to the capacity of the single-copy ancestral gene (subfunctionalization) . These different outcomes likely apply to different duplicated KMT genes, judging from their expression patterns (Additional File 3: Figure S2). Expression of BrKMT1A;4c and BrKMT6B;1b was undetectable, suggesting that they might have been nonfunctionalized. The duplicated pairs BrKMT1B;2a and BrKMT1B;2b, BrKMT3;4b and BrKMT3;4d, or BrKMT7;1a and BrKMT7;1b are differentially expressed in plant organs, suggesting that they might have acquired distinct tissue-specific functions. Finally, expression of some duplicated genes, e.g. BrKMT1A;2a and BrKMT1A;2c, BrKMT1B;1a and BrKMT1B;1b, or BrKMT6A;3a and BrKMT6A;3b showed similar patterns, suggesting that they might be subfunctionalized and/or have redundant functions.
Among the groups showing gene duplications in B. rapa, it is worth to note that in Arabidopsis, AtKMT6A;3/SDG5/MEA is critical for parental gene imprinting and seed development [8, 68], AtKMT6B;1/SDG15/ATX5 and AtKMT6B;2/SDG34/ATX6 are important for heterochromatin condensation and replication in Arabidopsis [59, 60], and AtKMT7;1/SDG2/ATXR3 is essential for both sporophyte and gametophyte development [61, 62]. It will be of great interest to investigate these groups of genes for their regulation and function in chromatin organization, plant growth and development in B. rapa.
Our study shows that the plant SET-domain KMT proteins can be phylogenetically grouped into distinct classes and that the classes involved in histone methylation can be named in accordance with the nomenclature proposed for animal and yeast SET-domain KMTs. Such a coherent and rational nomenclature in different organisms will help avoid confusion caused by the existence of multiple names for the same protein or gene. The information provided on the B. rapa KMTs will also be beneficial for future research to unravel the mechanisms of epigenetic regulation in Brassica crops.
Sequences of SET-domain proteins from A. thaliana, O. sativa, O. tauri and S. cerevisiae were retrieved from the Chromatin Database with the key word SDG in species database respectively (ChromDB, http://www.chromdb.org). These sequences, primarily those from A. thaliana and O. sativa, were used as queries to search the B. distachyon genome (http://www.brachypodium.org) and the B. rapa genome (http://brassicadb.org/brad/index.php) by using the BLASTp and tBLASTn tools (http://blast.ncbi.nlm.nih.gov). The Expect threshold was set at 1.0 and other parameters were set at default values. We did not use a strict E-value threshold; rather we examined each of the resulting hits for the presence of the SET or S-ET domain to collect previously unidentified sequences. The synteny analysis was performed using the online viewer tool (http://brassicadb.org/brad/index.php). ESTs of the B. rapa SET-domain protein genes were retrieved from the Brassica Database (http://brassicadb.org/brad/index.php) and from NCBI (http://blast.ncbi.nlm.nih.gov), using an Expect threshold of 1, and a minimum sequence length of 50 bp.
The protein sequences were analyzed for domain organization using NCBI-CD searches (http://ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). The low-complexity filter was turned off, and the Expect value was set at 1.0 to detect short domains or regions of less conservation in this analysis. Domains were also verified and named according to the SMART database (http://smart.embl-heidelberg.de/).
Multiple sequence alignments of SET-domain sequences were performed using the ClustalW program . The resulting file was subjected to phylogenic analysis using the MEGA4.0 program . The trees were constructed with the following settings: Tree Inference as Neighbor-Joining; Include Sites as pairwise deletion option for total sequences analysis and complete deletion option for each class analysis; Substitution Model: Poisson correction; and Bootstrap test of 1,000 replicates for internal branch reliability.
B. rapa plants were grown at 18-22°C under a 12 h light (10,000 Lx)/12 h dark photoperiod. Leaves were collected from 2-, 4-, 6-, 8- or 10-week-old plants; roots and stems were collected from 6-week-old plants; flower buds were collected from 10-week-old plants. Total RNA was extracted using Trizol reagent (Invitrogen, USA) from about 100 mg of collected plant tissue. The RNA preparation was then treated with DNaseI (Promega, USA) for 30 min at 37°C, followed by enzyme inactivation by incubation at 65°C for 5 min. First strand cDNA was made using an RT-PCR Kit (RevertAid™ First Strand cDNA Synthesis Kit, Fermentas, CA). The RT-solution with first strand cDNA was stored at -80°C. Primers used for the RT-PCR reactions are listed in Additional File 4: Table S2. Conditions for the PCR reactions were as follows: 94°C for 3 min; then 30 cycles of 94°C for 30 s, 50-63°C for 30 s, and 72°C for 1 min; and finally 72°C for 8 min. PCR products were separated in a 1.5% (w/v) agarose Tris-borate/EDTA buffer gel and visualized by ethidium bromide staining.
We thank Dr Xiaowu Wang, Jian Wu and Fen Chen, from the Institute of Vegetables and Flowers of the Chinese Academy of Agricultural Sciences, for providing sequences and helping in synteny analysis. This work was supported in part by National Basic Research Program of China (973 Program, 2012CB910500) and National Natural Science Foundation of China (NSFC31071129 and NSFC31071455).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.