The inconsistency in the origin of methylation-related genes
The methylation-related genes have been comprehensively characterized in A. thaliana. In this study, all these genes were divided into five categories including eight groups. The five categories included genes primarily functioning in de novo DNA methylation, methylation maintenance, nucleosome remodeling, histone modification and demethylation (Fig. 1). The canonical RdDM process of de novo DNA methylation includes three steps: Pol IV-dependent siRNA biogenesis, Pol V-mediated de novo methylation and chromatin alteration. We identified methylation-related genes for all these categories in 77 species involving of algae and plants (Additional file 4: Table S1). These algae include five species, and plants include species from bryophyte, fern, gymnosperm, basal angiosperm, monocot and dicot. We summarized the content of methylation-related genes in each species (Additional file 5: Table S2).
In each of the five categories, we calculated the ratio of the number of species containing methylation-related genes to the total number of species, and explored the evolution of these genes from algae to higher flowering plants (Fig. 1). In the RdDM process, we investigated 5 genes required for Pol IV-dependent siRNA biogenesis. Of them, only HEN1 is found in algae, and some genes in the bryophytes, ferns, monocots and dicots are missing. There are 9 genes involved in Pol V-mediated de novo methylation. SUVH2 is absent in basal angiosperm, and is also absent in 71% of dicots, whereas it is found in all monocots. There are 2 genes involved in the chromatin alteration process, they are absent in the algae. For RNA polymerases, we explored the NUCLEAR RNA POLYMERASE D1 (NRPD1), NRPE1, NRPD2/E2, NRPD4/E4, and NRPE5. Interestingly, all the five subunits are not found in algae. The methylation of the CHH site is found to be re-established by the process of de novo DNA methylation during each cell cycle, while de novo DNA methylation of the CHH site includes RdDM and another RdDM-independent way [12]. The latter requires the participation of CMT2 which is absent in algae. Collectively, we conclude that only a few genes involved in de novo DNA methylation appeared in algae, however, these genes became widespread from bryophytes to higher plants. Therefore, the de novo DNA methylation process may appear after divergence of chlorophyta and embryophyta.
In the process of maintenance of DNA methylation, different enzymes work in different ways. The methylation of CG requires the participation of MET1 which is basically present in all these species, including four of the five lower algae species. The methylation of CHG requires the participation of CMT3 which is also basically present in these studied species. Therefore, the maintenance of CG and CHG methylation may derive from the common ancestors of chlorophyta and embryophyta. The nucleosome remodeler DDM1 which is essential for mediating methylation in the heterochromatin region, is basically present in all the studied species. The time of its occurrence is consistent with that of DNA methyltransferases (MET1 and CMT3). There are 4 genes (JMJ14, UBP26, HDA6 and SUVH4) involved in the histone modification process, they all are found in the algae. These genes are important components involved in histone modications, thereby they influence DNA methylation and can affect phenotypic traits such as flowering time and drought tolerance of plants [47,48,49,50].
For the DNA demethylation process, active demethylation is mediated by four types of DNA glycosylases (DME, ROS1, DML2 and DML3). The previous studies showed that DME was found only in dicots [42], but our results indicated that DME and ROS1 were present from algae to higher flowering plants. DML2 can be seen only in dicots with a relatively small proportion (17%). An analysis of the distribution of DML2 in different species showed that it was only present in cruciferous plants (Additional file 5: Table S2). DML3 was present in dicots and monocots. These results showed that DNA demethylation might have originated from the common ancestor of chlorophyta and embryophyta, while DML2 is limited in cruciferous plants and DML3 is limited in dicots and monocots.
The abundance and duplication of methylation-related genes
To explore how methylation-related genes evolved, we investigated the abundance and duplication patterns of methylation-related genes in different species. Firstly, we summarized the abundance of methylation-related genes in different species (Additional file 5: Table S2). A large number of gene expansion events were observed. In 28 diploid species, such as Setaria italica, Brachypodium distachyon, Fraxinus excelsior, Malus domestica, Lotus japonicus, Glycine max, Populus trichocarpa and Citrus sinensis, the number of genes containing multiple copies was significantly larger than that in the other 43 diploid species (ANOVA, p < 0.01) (Additional file 5: Table S2). Among the six polyploid species, including five tetraploids (Panicum virgatum, Nicotiana tabacum, Gossypium hirsutum, Gossypium barbadense, Brassica napus) and one hexaploid (Triticum aestivum), the number of methylation-related genes was normalized by dividing them at the diploid level. The result showed that the number of genes containing multiple copies was relatively larger in P. virgatum, T. aestivum and B. napus (ANOVA, p < 0.01) (Additional file 6: Table S3).
Gene absence is also observed for some methylation-related genes. In algae, there is a notable absence of methylation-related genes (Additional file 5: Table S2). There are 23 methylation-related genes that are not present in all five algae species. It is only found that 4 methylated-related genes, JMJ14, UBP26, DDM1 and DME, were present in all 5 algae. The absence rate of methylation-related genes is much larger in advanced plants. In monocots, 1 to 10 genes were missing. In dicots, there are 48 species that had gene absence events. In the three Arabidopsis species, the methylation-related genes in A. thaliana and A. lyrata were conserved, while in the A. halleri, 4 genes (RDM1, SUVH9, HDA6 and NRPD/E2) were absent. In Solanum tuberosum, the absence of genes was greatly pronounced, with 10 of the 33 genes were missing.
The expansion of methylation-related genes may be related to whole genome duplication (WGD) and tandem duplication. For example, the genome duplication event of G. max that occurred approximately 13–59 million years ago, contributed to the formation of multiple copies for nearly 75% of genes [51]. About 50 million years ago, there was a WGD event in M. domestica, which increased gene number significantly [52]. Two IDN2 genes in Gossypium raimondii, Gorai.010G201400 and Gorai.010G201300, were evolved from tandem duplication. Here, we analyzed the evolutionary patterns of methylation-related genes in Oryza sativa, Solanum lycopersicum, Vitis vinifera, Amborella trichopoda, Cucumis sativus and Phaseolus vulgaris (Additional file 7: Table S4). Through the analysis of WGD and tandem duplication information of these five species [53, 54], we found that both duplication patterns contributed to gene expansion in these species. Two copies of the MET1 gene in O. sativa were evolved from WGD. In three copies of AGO4 gene in P. vulgaris, Phvul.008G206600 and Phvul.006G021200 were evolved from WGD, Phvul.008G206600 and Phvul.008G206500 were evolved from tandem duplication. The similar phenomenon can also be found in S. lycopersicum, V. vinifera, A. trichopoda and C. sativus.
We used 17 of 77 species for a gene collinearity analysis, including four early plants, four monocots and nine dicots. In each methylation process, we carried out a collinear analysis of some genes (Fig. 2 Additional file 1: Figure S1). A total of 10 methylation-related genes were analyzed. The collinearity of chromosomal regions which contained methylation-related genes indicated that these genes were the products of WGD or segment duplication. For AGO4, DMS3 and other genes, some chromosomal segments had better collinearity in the same species. We also compared the collinearity between different species. There is no collinearity between the early plants and the monocots or dicots for most genes, and there is also no collinearity between monocots and dicots for most genes. In different early plant species, chromosomal collinearity is not observed. In monocots, the collinearity of Zea mays, Sorghum bicolor, and Oryza sativa is found for most genes. In dicots, the collinearity of most genes is observed, however, some genes in A. thaliana, Citrullus lanatus, S. lycopersicum and Carica papaya are in poor collinearity. These data suggest that genome evolution has an effect on disrupting chromosomal collinearity in different species in terms of methylation-related genes.
The conservation and change of gene structures of methylation-related genes
To explore how gene structure has changed during evolution, we used 18 representative species for an exon-intron structure analysis. We started this analysis by examining exon-intron structure of these methylation-related genes at the species level (Additional file 8: Table S5). Generally, gene length and intron length varied largely (Additional file 8: Table S5A-B, S5G-H), but the exon number and exon length of most genes were conserved (Additional file 8: Table S5C-F). The correlation analysis indicated that the change of gene length was primarily due to the change of intron length (r = 0.79, p < 0.01). For example, in A. trichopoda, Z. mays, A. thaliana and V. vinifera, gene length and intron length of most methylation-related genes changed a lot, but the exon length varied little. Interestingly, in C. reinhardtii, P. abies and C. papaya, the gene length and intron length of most methylation-related genes varied a lot, but the exon length and exon number of some methylation-related genes also had some changes.
Then, the exon-intron structure was studied at the functional domain level of protein sequences (Fig. 3 and Additional file 2: Figure S2). First, five genes were studied which were involved in de novo DNA methylation (Additional file 2: Figure S2A-E), including RDR2, CLSY1, DCL3, HEN1 and AGO4 in the process of Pol IV-dependent siRNA biogenesis. The exon-intron structure of AGO4 was conserved, and the changes of exon-intron structure of the other four genes were large, mostly because of intron changes. The domain structure of these genes was basically conserved. In the Pol V-mediated de novo methylation (Fig. 3a, Additional file 2: Figure S2F-M), the changes in exon-intron structure of these genes were notable except for KTF1. The domain structure of RDM1, DRD1, MORC6 and KTF1 was highly conserved from early plants. It is found that DMS3 only contains low complexity regions and coiled coil regions. SUVH2/9 had SRA, PreSET and SET domains in spermatophyte, but both genes had only one of these three domains in early plants. In the chromatin alteration process (Additional file 2: Figure S2 N-O), the gene and intron length of IDN2 and IDP1 changed a lot, but their protein domains were highly conserved. In the subunits of RNA polymerases (Additional file 2: Figure S2 P-T), the changes of exon-intron structure for NRPD1, NRPD/E4 and NRPE5 were mainly because of intron changes. Their functional domains were conserved with key domains existing in the most ancestral species. We also found that in early plants (P. abies and A. trichopoda), the domain structure of some key genes changed greatly. A similar trend was observed in monocots O. sativa and H. vulgare and dicot C. papaya. In the process of de novo CHH methylation (Additional file 2: Figure S2U), the exon-intron structure of CMT2 changed a lot. The CMT2 in C. papaya was short and contained only a partial domain, and CMT2 in P. abies and H. vulgare were short and did not contain domains that existed in other species.
Second, we investigated genes responsible for maintenance of DNA methylation (Fig. 3b, Additional file 2: Figure S2V). The exon-intron structure and domains of two genes (MET1 and CMT3) were generally conserved. However, for two duplicated MET1 genes in rice, only the DNA_methylase domain was found in each copy, and the protein sequences were short. In C. reinhardtii, CMT3 contained some peptide segments with low-complexity, but lacked the BAH and CHROMO domains. In P. abies, these domains of CMT3 were absent. The domains of CMT3 in rice were also different from that in Arabidopsis. As for nucleosome remodeling factor DDM1 (Additional file 2: Figure S2W), the exon-intron and domain structures were highly conserved in higher plants. In the histone modification process (Additional file 2: Figure S2X-Z, S2AA), the exon-intron structure of genes changed a lot, but their protein domains were highly conserved from early plants.
Finally, we investigated genes responsible for DNA demethylation (Additional file 2: Figure S2BB-DD). The exon-intron and domain structures of DME, ROS1 and DML3 were conserved. There were only a few exceptions. The exon length of DME in C. reinhardtii was long. For domain structures, DME in C. reinhardtii contained a large number of low-complexity regions in addition to the partial domain of A. thaliana. The protein sequences of DME and ROS1 in P. abies were short and contained only a partial domain. ROS1 in C. papaya was short and did not contain any domains.
In summary, the exon-intron structure and functional domains varied greatly in the early plants, but they were conserved for most genes in monocots and dicots. Of note is the observation that a few domains of some genes in monocots H. vulgare and O. sativa varied greatly, and the dicot species C. papaya had considerable changes in the domains of most genes. Genes with domain changes had notable changes in exons, which is consistent with the overall changes of the exon-intron structure at the species level (Additional file 8: Table S5).
The phylogenetic relationship and purifying selection of methylation-related genes
The identification of genes in more than 77 species has led to investigate the origin of methylation pathways. Through exon-intron and domain structure analysis, we have known that although the length of genes varied greatly, but the functional structures were relatively conserved. Next, we sought to explore the phylogenetic relationship of methylation-related genes (Fig. 4 and Additional file 3: Figure S3).
At first, we studied de novo DNA methylation-related genes (Fig. 4a-b, Additional file 3: Figure S3A-O). The de novo DNA methylation process involves many genes, and a phylogenetic analysis categorized them into three groups, including early plants, dicots and monocots. A detailed analysis of these data showed that most methylation-related genes with multiple homologs tended to cluster in different branches in monocots, however most of these genes tended to gather in a single branch in dicots. We hypothesize that methylation-related genes in monocots may have been duplicated prior to the divergence of monocots and dicots, and possibly most of the genes in dicots tend to duplicate after this divergence event. The MORC6 genes in monocots are clustered into three branches, while the MORC6 genes in dicots C. lanatus, P. trichocarpa, C. papaya and S. lycopersicum were clustered in one branch. A similar observation was obtained for AGO4, DCL3, DRD1 and IDP1. Because both NRPD1 and NRPE1 came from NRPB1 and the sequence similarity between NRPD1 and NRPE1 was very high, we combined NRPD1 and NRPE1 subunits together to explore their evolutionary relationship. In early plants, NRPD1 and NRPE1 were clustered together, but they were clustered into two branches in angiosperms, which suggested that NRPD1 and NRPE1 might be divergent from the ancestors of monocots and dicots.
We then studied those genes responsible for maintenance of DNA methylation (Fig. 4c). In order to explore a full phylogenetic relationship of these methyltransferases, we constructed an evolutionary tree by combining them together with those responsible for de novo methylation. It can be seen that MET1, CMT2 and CMT3 were well clustered together, and the DRM1 and DRM2 were clustered into two branches in monocots but clustered together in dicots and early plants. As for the nucleosome remodeling factor DDM1, we found that all the DDM1 genes in different species were clustered into three branches (Fig. 4d). In the histone modification process (Additional file 3: Figure S3P-S), with some exceptions, JMJ14, UBP26, HDA6 and SUVH4 are basically clustered into three branches.
Finally, we analyzed demethylation-related genes. In the monocots and dicots, most DNA glycosylases were clustered separately into the DME branch, the ROS1 branch or the DML3 branch, but they were clustered together in the early plants (Additional file 3: Figure S3T). This result suggests that the ancestral demethylation-related genes might be differentiated into different demethylation-related genes after the divergence of monocots and dicots.
In addition, we investigated the selection pressure of methylation-related genes by calculating the ratio of nonsynonymous substitutions per nonsynonymous site (Ka) to synonymous substitutions per synonymous site (Ks) (omega) for each homologous gene pair in nine dicots, with reference to genes in A. thaliana. We found that all genes were under negative selection (Fig. 5). This indicated that these genes were highly conserved during the evolution of dicotyledonous plants, which also reflected the importance of DNA methylation for dicotyledonous plants.
Divergent expression patterns of methylation-related genes in different species
On the basis of tracing evolution history of methylation-related genes from early plants to higher plants, we next aim to explore how the methylation-related genes have evolved at the expression level in different species. We selected 10 species for analysis of expression patterns, including algae, bryophyte, fern, basal angiosperm, dicot and monocot. In three early plants - C. reinhardtii, P. patens and S. moellendorffii, samples from the whole plant were used; in basal angiosperm A.trichopoda, 3 dicots and 3 monots, data from the leaf tissue were used. The normalized expression level of gene (TPM) was shown in the form of heatmap (Fig. 6).
First of all, we compared the expression patterns of methylation-related genes at the species level. We can see that the early plants, monocots and dicots were clustered into one branch, respectively. This indicated that the expression patterns of methylation-related genes might have changed during the evolution from early plants to higher plants, as well as during the differentiation of monocots and dicots. In the evolution of these genes, there were changes not only in duplication pattern and gene structure, but also in gene expression patterns, which provides an idea for us to study the origin and evolution of other genes. From the gene level, we can see that the expression patterns of these genes can be clustered into four categories. Here, we mainly analyzed the expression pattern of angiosperms. In cluster III and IV, the expression levels were high, which may reflect the conservation of gene expression in species evolution, and also explain the importance of these genes in plant growth and development. UBP26 and HDA6 are important components involved in histone modications, thereby influence DNA methylation and affect phenotypic traits [47,48,49,50]. DRM2 is the key gene in RdDM pathway [5, 12,13,14,15,16]. In cluster I and cluster II, the expression levels of these genes were low.