Data set assembly and identification of new AGOs in plant genomes
We began with the isolation of 11 unique, full length AGO gene homologs (Additional file 1) from N. attenuata. Putative NaAGOs showed high identity to 8 types of AtAGOs and were thus annotated accordingly as NaAGO1 (identity of >78% to AtAGO1), NaAGO2 (50.55%), NaAGO4 (>74%), NaAGO5 (59.98 %), NaAGO7 (68.95%), NaAGO8 (52.69 %), NaAGO9 (68.04%) and NaAGO10 (80.71%). For NaAGO1, three gene sequences shared >78% peptide identity with AtAGO1 and >87% peptide identity amongst each other; these were thus annotated as NaAGO1a, NaAGO1b and NaAGO1c. Similarly, two gene sequences of the NaAGO4 share 86.86 % peptide identity with each other and >74% identity with AtAGO4; these were named NaAGO4a and NaAGO4b. However, we were not able to identify AtAGO3 and 6 homologs in N. attenuata. Further, we mined the sequence data of 17 plant species to identify and similarly annotate 133 full length AGOs. These had not been previously annotated as AGOs (Additional file 1). Altogether, 263 protein sequences were used from 37 plant species (Additional files 1 and 2). Additionally, 5 AGO sequences from Tribolium castaneum, 32 AGO sequences (including AGO1 and AGO2) from insects and early branching animals (e.g. sponges, cnideria), and one each of HsAGO2 (PDB code: 4F3T) and KpAGO (PDB code: 4F1N) (for a total of 302 sequences from 66 species; Additional files 1; detailed in methods section) were used as the out-group in this analysis.
Phylogenetic classification and evolutionary expansion of plant AGOs
During evolution, AGO genes have formed an expanding family across different lineages [1,7]. To determine the evolutionary relatedness of plant AGOs, we reconstructed their phylogeny to evaluate their evolutionary patterns (Figure 1). In order to increase the confidence in the root we included 39 non-plant AGO sequences in the phylogenetic analysis. Plant AGO family proved monophyletic and the phylogenetic tree continued to consist of four major classes/clades (Figure 2, Additional files 1). Both the Neighbor Joining (NJ) and the Maximum Likelihood (ML) approaches were used to reconstruct the phylogeny of plant AGOs and both produced similar tree topologies and phylogenetic distributions into four classes/clades (Additional file 3). Homologs of AGO1 and AGO10 were clustered together (Clade I); similarly homologs of AGOs 2, 3 and 7 formed a clade (Clade III). Likewise, homologs of AGOs 4, 6, 8 and 9 formed the largest cluster (Clade IV), whereas AGO5 homologs formed an independent group (Clade II; Figure 2).
From the analysis of AGO gene expansion and loss (detailed in method section), it was observed that AGOs might have undergone between 133-143 duplication and 272-299 loss events (Figure 3, Additional file 4). We altered the alignment and alignment processing parameters to test the robustness of our analysis. When L-INSI in MAFFT and ‘Automated I’ in TrimAl were used, 140 duplication and 299 loss events were obtained; when the parameters were changed to L-INSI (MAFFT) and user defined parameters in TrimAl (detailed in methods section), 133 duplication and 294 loss events were recorded. Similarly, when Auto options were used for both MAFFT and TrimAl, 143 and 294 duplication and loss events were recorded respectively, whereas 137 duplication and 279 loss events were recorded when ‘Auto’ option in MAFFT and user-defined parameters for TrimAl were used. The reconciliation of species tree and AGO gene family tree (GFT) revealed that the AGO ancestor underwent at least five major duplication events early in its evolution, after the divergence of unicellular green algae, such as, Chlamydomonas and Volvox, but before the divergence of the Bryophytes. This probably gave rise to four distinct phylogenetic clades of AGOs (with strong statistical support with bootstrap values >90%; Figure 2, Additional file 3).
The AGO5 clade may have diverged before the divergence of higher plants, but after the evolution of multicellularity, suggesting a physiological role, possibly different from the ones regulating developmental processes (Figure 2, Additional file 3). Reconciliation of AGO GFT with the species tree showed that an ancestral AGO may have undergone >50 rounds of duplications by the time of the dicot-monocot divergence. (Figure 3, Additional file 4). Thus, diversification and duplication of AGOs could have coincided with the evolution of multicellularity, suggesting the relevance of AGOs and their associated smRNA pathways for developmental and adaptive programs.
The nodes of divergence between dicots and monocots apparent in all four AGO phylogenetic classes (Additional file 3) indicate that duplications were followed by speciation events (Additional file 4). For example, the relatively large number of AGO genes (containing all the four domains) in the Poaceae lineage, such as the 17 in O. sativa and the 14 in Brachypodium distachyon were noted (Additional file 1). These duplication events may have occurred in parallel with events leading to the loss of AGO family members during the evolution of Rosids and Lamiids (Additional file 4). Few such losses appeared to have occurred in the Brassicaceae and Solanaceae, for example, in which 10-11 members are found in A. thaliana and 11 AGOs in N. attenuata respectively (Figure 3, Additional files 1 and 4). In N. attenuata, homologs of AtAGO3 and AtAGO6 might have been lost while AGO1 and AGO4 were duplicated (Additional file 1). Duplicated copies of AGO4 are found in other Solanaceae taxa as well, such as in N. benthamiana [33] and Solanum lycopersicum (this study; Additional file 1).
The molecular clock test was performed to gain further insight into the relative timing of duplication and divergence events (Figure 3, Additional files 5 and 6). This analysis indicates that ancestral AGO gene may have required around 2 million years to duplicate four times after divergence from the unicellular green algae (Additional file 5). Clade IV may have been the first to diverge, followed by Clade III, Clade II and Clade I, respectively. It may have taken 0.5 million years for Clade I to evolve that now includes AGO1 and AGO10 homologs, while Clade IV may have required around 1.5 million years to evolve to include AGO4, AGO6, AGO8 and AGO9 homologs; AGO8 and AGO9 as its more recent descendants. Clade III most likely evolved around 1.25 million years and sub-diverged into two clusters, one comprising AGO7 and the other AGO2 and AGO3.
The phylogenetic tree (Figure 2, Additional file 3) reveals that AGO1 and AGO10 have orthologs in Selaginella and Physcomitrella. Interestingly, we found that of the 6 AGOs in Physcomitrella, the 3 previously unannotated AGO-like genes form a separate cluster (bootstrap value 100%). These AGOs may have diverged from the Clade IV lineage at a time comparable to the duplication of the ancestral AGO (Additional file 5), and thus may be orthologs of Class IV AGOs. Furthermore, homologs in unicellular forms, such as Chlamydomonas and Volvox, may have evolved independently from the multicellular lineages (Figure 2, Additional file 3). We observed that Chlamydomonas and Volvox AGOs harbor rudimentary forms of the PAZ domain but do not contain a distinct MID domain (Additional file 7). These results indicate that AGOs of higher plants are intricate and have substantially diverged from the lower, unicellular forms, potentially to facilitate the complex functions known to be regulated by smRNA pathways.
Variability in signature residues of plant AGOs
Phylogenetic analysis indicates the presence of four clades/classes of AGOs and that these have been evolving differently. In addition, in plants, different AGOs are known to interact with different types of smRNAs (as described in the Background), wherein each residue of the 7nt region of smRNA, ‘the seed region’, sits in a narrow groove to interact with different residues of the MID-PIWI lobe of AGO proteins [10]. It is hypothesized that the sorting of different species of smRNAs to various AGOs [22,23] may depend on the conservation of these residues across various AGOs. Such functionally important residues may also be regarded as signatures of specific domains. Therefore, we attempted to define class-wise signature residues for each of the four classes as well as to re-examine the overarching architecture of AGO sequences in plant genomes. The N-terminal domain of AGOs is the most variable domain, whereas, 'R/K-F-Y', 'Y-N-K-K', 'D-E-D-H/D' have been regarded as the signatures of PAZ, MID and PIWI domains, respectively [7]. Upon examining the MSA of all the plant AGOs, we found 55 positions (column score >90) with highly conserved residues (Additional file 2). In parallel, we also examined the MSA of plant AGOs in each of the four classes independently to determine class-wise signature residues (Figure 4). We identified 8 sites in the PAZ domains, 12 sites in the MID domains and 15 sites in the PIWI domains that show conservation in the four classes AGOs. In the MID domain, residues ‘K’, ‘Q’ and ‘C’ (alignment position 2485, 2497 and 2498, respectively), thought to directly bind to the 5’-phosphate of smRNAs [10], are conserved in all four classes (Figure 4). Similarly, ‘K and ‘S’ (alignment position 2834 and 2954) of PIWI domain are conserved in all the four classes (Figure 4).
Results of the MSA indicated that residue ‘R’, the popularly regarded signature of the PAZ domain ('R-F-Y', alignment positions 2002, 2034 and 2062, respectively), are only conserved in Class I AGOs (AGOs 1 and 10). ‘R’ has been largely replaced by ‘K’ (Figure 4) in AGO of Class II (AGO5) and IV (AGOs 4, 6, 8 and 9), whereas the consensus residue could not be determined for this position (Figure 4) in the PAZ domain of Class III AGOs (AGOs 2, 3 and 7). Further, ‘H’ at the alignment position 1985 (Figure 4) in the PAZ domain, thought to be important in the recognition of the 3’ ends of smRNAs [10], is conserved only in Classes I-III; conserved residues were not found at this position in the PAZ domain of Class IV genes (Figure 4).
Another residue relevant to the interaction of AGO with the 5’-phosphate of the smRNA in the ‘nucleotide specificity loop’ of the MID domain is 'T526' (in HsAGO2), which corresponds to alignment position 2447 in plants (Additional file 2). Classes I and IV genes harbor a conserved ‘N’ and ‘K’ respectively, whereas there is no consensus in Classes II and III at this position. Studies of HsAGO2 [10] suggest that the first oxygen atom of the 5'-phosphate of smRNAs also interacts with side-chain residue of ‘R812’ in the PIWI domain. Position 2980 corresponds to ‘R812’, and harbors a conserved ‘R’ in Classes I-III genes, while in the Class IV genes, PIWI has a conserved ‘Q’ instead (Figure 4). In the crystal structure of HsAGO2 in a complex with miR-20a, the 2nd nucleotide of smRNA interacts with ‘Q548’ of the MID domain and ‘Q757’ of the PIWI domain [10]. These residues correspond to positions 2500 and 2906 in MSA. An ‘L’ is present at position 2500 in Classes I and III, whereas Classes II and IV are highly variable, with ‘Q’ and ‘A’ being over-represented in these two classes respectively (Figure 4). The 5th nucleotide interacts with 'S798' and 'Y804' from the PIWI domain in HsAGO2 [10]. The first corresponding sites in plant AGOs contain 'S' (MSA position 2954) in all four classes, the second site harbors 'Y' (MSA position 2972) in Class I-III whereas 'C' is the over-represented residue in Class IV.
The ‘D-E-D-H/D’ signature has been associated with the catalytic activity of the PIWI domain [4,7]. The ‘D-E-D-H’ signature is apparent in Classes I, II, IV (and half of the Class III) of plant AGOs, whereas the D-E-D-D signature is present in AGO2 and AGO3 (Class III PIWIs; Figure 4, Additional file 2). In general, most of the functionally important sites of Class-I AGOs are conserved, while the converse seems true for Class-III AGOs (Figure 4).
Since the phylogenetic analysis indicates that the AGOs of unicellular forms such as Chlamydomonas and Volvox are highly divergent and evolved independently of those of the multicellular forms, we further investigated the occurrence of the above-mentioned residue signatures and predicted functionally important sites (Additional file 8). We found a high diversity across many important sites (Additional file 8). Similarly, the three Physcomitrella AGOs also have unique residues compared to AGOs in other lineages (Additional file 8).
Such patterns of occurrence of functionally important residues may have consequences for smRNA recruitment, their biochemical activities and the roles of AGOs in diverse physiological processes in both unicellular and multicellular life-forms. Indeed, our homology modeling and RNA docking studies clearly pointed towards differences in seed recognition and catalytic region of the four classes of AGOs (Additional file 9).
Evolution of AGO sequences
We next determined the 'position-by-position' ML-based relative evolutionary rates using a gamma (γ)-distribution based best substitution model. Of the total 620 sites in ‘AGO dataset II’ (Figure 1, Additional files 2 and 10), 218 sites have a relative rate <1 whereas 69 sites have relative rates >1 in all four classes (Additional file 10: Table S3A). Relatively small ML values of γ- shape parameter were observed for Class I (0.5881; Additional file 10: Table S3B), indicating that the majority of sites (405) in Class I AGOs (Additional file 10: Table S3B) are evolving at slow relative rates. These sites are more frequently found in the MID and PIWI domains (Figure 5). On the other hand, Class III AGOs show a large ML value of the γ- shape parameter (1.0174; Additional file 10: Table S3B), indicating that less number of sites (361 as compared to 405 for example in Class I) are evolving at slow relative rates (Figure 5, Additional file 10: Table S3B).
Residues involved in substrate recognition and catalysis show low relative rates of evolution (Figures 4 and 5), indicating such residues are conserved during the course of evolution. For instance, the ‘D-E-D-D/H’ signature involved in catalytic activity of the PIWI domain has low relative rates across all the four classes of AGOs. Overall, the seed recognizing MID-PIWI lobe of Classes I and II show a low relative rate (slow evolving; Figure 5). Moreover, other regions putatively involved in seed recognition and the nucleotide specificity loop show a low relative evolutionary rate in Class I AGOs as compared to other classes (Figure 5). For certain sites, substitution of residues along with variability in relative rates was noticed between different classes. For instance, at position 2000, located near the seed recognition pocket and implicated in the 3' overhang recognition of smRNA [10], substitution of K in Class I to E in Class IV was observed; both the residues are evolving at slow rates (Figure 4). Such changes may explain the capacity of AGO proteins to sort and load smRNAs with specific residues at their termini [23]. On the other hand, it was interesting to note that the N-terminal and the PAZ domains have several sites with high relative rates (fast evolving) across all four classes of AGOs (Figure 5).
These observations suggested the possibility that different classes of AGOs undergo site-specific rate shifts. We performed the likelihood ratio test by calculating the coefficient of Type I (θI) divergence and the posterior probability (PP) of a shift in substitution rate (Additional file 11). Rejection of the null hypothesis (θI > 0) indicates that after duplication, selection constraints may have altered many sites differently in different classes (thus shifts in substitution rates in different classes; θI values of 0.2814-0.6509 for pairwise comparisons; Additional file 11: Table S4A). Hence, as expected, large variations in site-specific profiles of PP among different classes were observed (Additional file 11: Table S4B). Maximum shifts were observed between Classes I and IV (Additional files 11: Table S4B, and Additional file 12). Also, the functional branch lengths (bF) of Class IV and Class III were nearly two times greater than the branch length of Class I and Class II (p <0.05; Additional file 13). Such results point to different evolutionary histories of different classes of AGOs that may have resulted in different structural and functional properties; Class I AGOs may have diverged functionally more than Class IV AGOs.
Context-dependent coevolution of amino acid residue
The evolution of protein residues is frequently context-dependent in that substitutions at a given site are affected by local structure, residues at the other sites, and related functions. Such context-dependent substitutions result in co-evolution of amino-acid residues that have implications for protein structure and function. We uncovered coevolving residues in plant AGOs by using Pearson correlation coefficient (r) as implemented in CAPS 2.0 (coevolution analysis using protein sequences) algorithm [34]. Only co-evolving sites with r ≥ 0.5 were considered significant (Figure 6, Additional file 14: Table S5A). Class III AGOs accounted for largest number of coevolving residues (Figure 6A, Additional file 14: Table S5A). Strong correlation of r > 0.9 was observed between the sites coevolving in the PAZ domain and PIWI domain of Class III AGOs (Figure 6A, Additional file 14: Table S5A). Four classes of AGOs displayed heterogenous coevolving groups of residues that are of different sizes. In Class III AGOs, PIWI domains displayed the largest number of coevolving residues (Figure 6A, Additional file 14: Table S5B). In general, the amino acid residue 'R' is the most frequently correlating residue in Class I and II, while residue ‘L’ is found most frequently correlating in Classes III and IV (Figure 6B, Additional file 14: Table S5C). In Class I, 'G' is the second most frequent residue that is significantly correlated mainly to 'G', 'Q', ‘R’ and 'H'. In Class II, ‘G’ is again the second most frequent residue that instead significantly correlates to 'V', 'S', ‘E’, 'K' and 'R' (Figure 6B, Additional file 14: Table S5C). In Class III and IV, ‘P’ is the second most frequent residue that significantly correlates frequently to ‘V’, ‘Q’ and ‘F’, and to ‘P’, ‘G’ and 'R' respectively (Figure 6B, Additional file 14: Table S5C).
Correlation patterns in the context of specific residues at a site in the sequence were observed. For instance, position 2002 (in MSA) in the PAZ domain (may play a role in wedging 14th and 15th nucleotide of loaded RNA duplex [10]), is overrepresented by the residues ‘R’ in Classes I and III, and ‘K’ in Classes II and IV AGOs respectively (Figure 4). This 'K' is highly correlated with two other residues in Classes II and IV (Figure 6C, Additional file 14: Table S5A). On the other hand, position 2002 in Classes I and III do not show any significant correlation coefficient with other residues in the protein. Similarly, the ‘H’ at position 2505 (Figure 4) in the MID domain of Class I AGO is highly correlated to residue ‘Q’ at position 2906 in PIWI domain (Figure 6C, Additional file 14: Table S5A). Residue corresponding to position 2505 (Figure 4) could bind to phosphate of 2nd nucleotide of smRNA, directing the 1st nucleotide into a deep binding pocket at the interface between MID and PIWI domain, whereas Q, corresponding to postion 2906 may coordinate with N2 and N3 on the minor groove side of the G5 base at seed sequence of smRNA [10] . In other classes, where ‘H’ is replaced, no significant correlation is observed. In HsAGO, ‘R’ corresponding to 2835 in the PIWI domain (MSA; Additional file 2) stacks between the U9 and U10 of miRNAs to result in a major kink [10]. In Classes I, II and III residues ‘R’ is conserved at position 2835 (Figure 4) and do not show any correlation with other residues, whereas in Class IV AGOs, this position is overrepresented by ‘N’ and shows significant correlation to two other residues (Figure 6C, Additional file 14: Table S5A).
Diverse correlation patterns were observed in the ‘nucleotide specificity loop’ across the four classes of AGOs (Figure 6). None of the five residues of nucleotide specificity loop of Class I AGO showed any significant correlation. The 5’-end of the smRNAs interact with peptide backbone of the HsAGO residues [10] corresponding to positions 2445 and 2447 (Additional file 2). ‘T’, as in HsAGO, was overrepresented only in Class II (AGO5) and correlated to residues in PIWI and PAZ domains (Figure 6C). On the other hand, in Class IV, ‘E’ (MID domain; position 2445; Figure 4) correlates with ‘R’ in MID domain and ‘V’, ‘S’ and ‘H’ in PIWI domain (positions 2446, 2567, 2972 and 2975 respectively; Figure 4 and 6C; Additional file 14: Table S5A). Such class specific coevolving residues may influence the functional diversification of AGOs.