WRKY transcription factors in legumes

Background WRKY transcription factors, so named because of the WRKYGQK heptapeptide at the N-terminal end, are widely distributed in plants and play an important role in physiological changes and response to biotic and abiotic stressors. Many previous studies have focused on the evolution of WRKY transcription factors in a given plant; however, little is known about WRKY evolution in legumes. The gene expression pattern of duplicated WRKY transcription factors remains unclear. Results We first identified the WRKY proteins in 12 legumes. We found that the WRKYGQK heptapeptide tended to mutate into WRKYGKK. The Q site in WRKYGQK preferentially mutated, while W, K, and Y were conserved. The phylogenetic tree shows that the WRKY proteins in legumes have multiple origins, especially group IIc. For example, WRKY64 from Lupinus angustifolius (LaWRKY64) contains three WRKY domains, of which the first two clustered together in the N-terminal WRKY domain of the group I WRKY protein, and the third WRKY domain grouped in the C-terminal WRKY domain of the group I WRKY protein. Orthologous WRKY genes have a faster evolutionary rate and are subject to constrained selective pressure, unlike paralogous WRKY genes. Different gene features were observed between duplicated WRKY genes and singleton WRKY genes. Duplicated Glycine max WRKY genes with similar gene features have gene expression divergence. Conclusions We analyzed the WRKY number and type in 12 legumes, concluding that the WRKY proteins have multiple origins. A novel WRKY protein, LaWRKY64, was found in L. angustifolius. The first two WRKY domains of LaWRKY64 have the same origin. The orthologous and paralogous WRKY proteins have different evolutionary rates. Duplicated WRKY genes have gene expression divergence under normal growth conditions in G. max. These results provide insight into understanding WRKY evolution and expression. Electronic supplementary material The online version of this article (10.1186/s12870-018-1467-2) contains supplementary material, which is available to authorized users.


Background
The WRKY gene family comprises a class of important transcription factors involved in physiological change and response to biotic and abiotic stress [1][2][3][4]. WRKY transcription factors contain a WRKYGQK heptapeptide at the N-terminal end and a zinc-finger motif (CX 4-5 CX 22-23 HXH or CX 7 CX 23 HXC) at the C-terminal end [1,2]. WRKY proteins can be classified into groups I-III based on the number of WRKY domains and the type of zinc-finger motif [1,2]. Group I WRKY contains two WRKY domains and a zinc-finger motif [1,2]. Group II WRKY contains a single WRKY domain and a CX 4-5 CX 22-23 HXH zinc-finger motif, and group II WRKY can be divided into the following five subgroups: IIa, IIb, IIc, IId, and IIe [1,2]. Group III WRKY has a single WRKY domain and a CX 7 CX 23 HXC zinc-finger motif [1,2]. WRKY can also be classified into two groups based on their intron type, R-type or V-type [2]. R-type introns are distributed in groups Ic, IIc, IId, and III, which contains an intron spliced immediately after the R (Arg) position [2,5,6]. V-type introns are distributed in groups IIa and IIb. One intron is located before the K (Val) position, which is at the sixth amino acid after the second C residue in the C 2 H 2 zinc finger motif [2,5,6].
Many studies have focused on the evolution of the WRKY gene family, but there is a debate about the origin of each type of WRKY in various plant species. According to a phylogenetic tree, the evolutionary relationships proposed revealed that the WRKY gene family can be classified into four clades including groups I + IIc, groups IIa + IIb, group IId, and group IIe [6]. Based upon phylogenetic analyses, researchers have proposed that the group II and III WRKY domains are descendants that have originated from the C-terminal WRKY domain of group I [2,6]. With the development of sequencing technology, an increasing number of complete genome-wide sequences have been reported for various plant species. Researchers have identified more WRKY gene families in various plants, obtaining results that contrasted the above conclusion. Specifically, Zhu et al. [31] found that the Triticum aestivum subgroup IIc WRKY domains originated from the N-terminal WRKY domain of group I. Wei et al. [32] demonstrated that group I WRKY proteins first appeared in monocotyledons, followed by groups III and II. Brand et al. [33] reported that group I and other WRKY proteins likely originated from subgroup IIc. Recently, Rinerson et al. [4] detected the number and type of WRKY gene families ranging from lower organisms to higher organisms without the use of phylogenetic trees. Rinerson et al. [4] proposed two alternative hypotheses of WRKY protein evolution: the "Group I Hypothesis" and the "IIa + b Separate Hypothesis" [4]. The "Group I Hypothesis" proposed that all WRKY proteins evolved from the C-terminal WRKY domains of group I proteins, whereas the "IIa + b Separate Hypothesis" suggested that groups IIa and IIb evolved directly from a single domain algal gene separated from a group I-derived lineage [4].
To date, genome-wide sequences have been reported for 12 legume species, and their phylogenetic relationships have been revealed based on genomic data [34][35][36] (Fig. 1). However, studies on the phylogenetic relationships of the WRKY gene family are limited in these legume species. In this study, we identified WRKY genes using the same method utilized by other studies and confirmed the number and type of WRKY genes in each genome. Then, we identified orthologs and paralogs and estimated their evolutionary rate. Further, we compared gene features between duplicated WRKY genes and singleton WRKY genes as well as the gene features and expression in WRKY paralogs. These results provide a greater depth of the understanding of WRKY evolution.

Phylogenetic analyses
The multiple sequence alignment of all WRKY domains and groups I, II, and III domains were executed using MAFFT 7.0 [52]. The maximum likelihood (ML) estimate used the best-fitting model of sequence evolution as determined by ProtTest [53]. The phylogenetic trees were constructed using IQ-TREE [54] and were estimated using an SH-aLRT test with 1000 random addition replicates and ultrafast bootstrap approximation set to 10,000.
For Arachis, homologous genes were identified in the transcriptome assembly using the similarity-based method [37,55]. Here, we used the same method to identify WRKY orthologs and paralogs in the 12 legumes. In brief, the multiple alignment of coding sequences (CDSs) was executed using each species with the local BLAST program [56]. The following evaluation criteria were used as thresholds to determine inclusion in subsequent analyses [55]: (1) length of aligned sequences > 80% of each sequence length; (2) identity > 80%; and (3) E-value ≤10 − 10 . MAFFT 7.0 [52] was used to align duplicated CDS and amino acid pairs. PAL2NAL [57] was used for the conversion of amino acid sequences into the corresponding CDSs. PAML 4.0 [58] was used to calculate the nonsynonymous substitution rates (K a ) and synonymous substitution rates (K s ). If the K s value was less than 0.01 or more than 3, and the K a value was nearly 0, these duplicated genes were excluded, because low sequence divergence could result in unknown estimates, and a high K s value indicated potential sequence saturation [58,59]. When K a / K s = 1 it indicated neutral selection, when it was > 1 it indicated positive selection, and when it was < 1 it indicated purifying selection.

Gene feature in homologs
To compare duplicated and singleton WRKY genes, we estimated the gene features between these genes. The gene features included polypeptide length, GC content at three codon positions (GC1, GC2, and GC3), and the frequency of optimal codons (Fop). The Fop and polypeptide length were calculated using the codon W program (version 1.4, http://codonw.sourceforge.net). The GC content was estimated using the in-house perl script.

WRKY genes in 12 legume species
A total of 75 AdWRKY, 77 AiWRKY, and 178 GmWRKY proteins with both WRKYGQK heptapeptide and zinc-finger motifs were identified in A. duranensis, A. ipaënsis, and G. max genome sequences, respectively [18,29]. Re-identification of WRKY proteins in C. arietinum, L. japonicus, and M. truncatula showed that 58 CaWRKY, 78 LjWRKY, and 98 MtWRKY proteins, respectively, were found using the HMMs method (Table 1). We detected 92 CcWRKY, 108 LaWRKY, 88 PvWRKY, 89 TpWRKY, 77 VaWRKY, and 76 VrWRKY proteins that contained both the WRKYGQK heptapeptide and zinc-finger motif using the HMMs method in C. cajan, L. angustifolius, P. vulgaris, T. pratense, V. angularis, and V. radiata genomes, respectively (Table 1). In addition, genomes with missing zinc-finger motifs and/or partial WRKY proteins are presented in Additional file 1: Table S1. We named WRKYs based on the order of genes located in chromosomes. CaWRKY, LjWRKY, and MtWRKY have been named in previous studies [47][48][49]. Newly detected WRKY genes from C. arietinum, L. japonicus, and M. truncatula were named based on the order of their location on the chromosome. If alternative splicing was observed for the WRKY genes, we retained the primary transcript. These WRKY genes can be classified into three groups: groups I, II, and III. A comparison of the number of WRKY genes among the 12 legumes showed that G. max contained the greatest number of WRKY genes, while C. arietinum contained the fewest WRKY genes.
WRKY genes can regulate downstream genes that are involved in physiological change and response to biotic and abiotic stress by WRKYGQK heptapeptide binding to the cis-acting element of the downstream gene [1]. In this study, we found that 139 WRKYGQK sequences were observed in at least one mutation site (Fig. 2). WRKYGKK tended to be mutated. Among these sequences, most WRKYGQK sequences from group II WRKY genes appeared to be mutated (Additional file 1: Table S1), indicating that the biological function of the group II WRKY genes was more diverse. Further, in this study, the Q in WRKYGQK preferentially mutated, while the W, K, and Y were conserved (Fig. 3). Previous studies have revealed that mutation of the K positions in WRKYGQK sequences disrupt protein-DNA interactions [62,63]. In this study, some WRKYGQK sequences were observed with a K mutation, suggesting that the mutation influences protein-DNA interactions.
A WRKY protein can form a fusion protein with the nucleotide binding site-leucine-rich repeat (NBS-LRR) protein [1,4], which is involved in plant response to pathogens [64]. We found two, one, and one WRKY-NBS fusion proteins in A. duranensis, A. ipaënsis, and G. max, respectively. We propose that there is no evolutionary relationship among WRKY proteins, NBS-LRR proteins, and WRKY-NBS fusion proteins due to the absence of WRKY-NBS proteins in close relatives to A. duranensis, A. ipaënsis, and G. max. WRKY-NBS fusion proteins can be classified into eight groups based on a phylogenetic tree [4]. As reported in a previous study, a G. max WRKY-NBS protein that belongs to the group RW4 was identified [4], but there is no record of the Arachis WRKY-NBS fusion protein. We constructed an ML phylogenetic tree using the JTT + I + G model using reported WRKY-NBS fusion proteins and Arachis WRKY-NBS fusion proteins. The phylogenetic tree showed that Arachis WRKY-NBS fusion proteins are a part of the group RW4 (Additional file 2: Figure S1). In addition, an ML phylogenetic tree using the JTT + G model based upon 12 legume WRKY domains was constructed ( Fig. 4 and Additional file 3: Figure S2). GmWRKY, AdWRKY, AiWRKY, MtWRKY, LjWRKY, and CaWRKY have been reported in previous studies [18,29,[47][48][49]. Accordingly, we used these WRKY genes as reference sequences to further classify other group II WRKY genes. The ML tree showed that the WRKY proteins can be classified into eight clusters: In, Ic, IIa, IIb, IIc, IId, IIe, and III (Fig. 4). However, we found that some clades contain different types of WRKY proteins. For example, some group II WRKY proteins clustered with group I WRKY proteins, and some group I WRKY   proteins mixed with groups IIa, IIc, IIe, and III (Fig. 4). Further, two WRKY domains of group I grouped with IIa, IIc, and III, but only N-terminal WRKY domains of group I clustered in group IIe. In addition, the WRKY proteins in groups In, IIa, IIe, and III clustered into separate clades, but the other types of WRKY proteins formed multiple mixed clades (Fig. 4). These results indicated that WRKY proteins have multiple origins, especially group IIc.
To reveal the phylogenetic relationships of group II WRKY proteins, we constructed an ML phylogenetic tree using the JTT + G model based on all group II WRKY proteins in the 12 legumes ( Fig. 5 and Additional file 4: Figure S3). The topological structure of the phylogenetic tree is similar between using solely group II WRKY proteins and using all WRKY proteins. For example, groups IIa and IIe clustered in a clade, and other WRKY proteins grouped into multiple clades. However, compared with using all WRKY proteins to construct the phylogenetic tree, the number of clusters of group IIc WRKY proteins is relatively lower. We found a mixed clade, IIm, including group IIb and IIc WRKY proteins. Further, we found that members of the clade IIm were observed in group II, including MtWRKY99 and VaWRKY2, but were not GmWRKY102 VrWRKY49 The number of WRKY domains showed that LaW-RKY64 contains three WRKY domains (Fig. 6), and we propose this newly identified sequence as a member of group I. In addition, LaWRKY96 and VrWRKY68 contain three WRKY domains, but the second and third WRKY domain of LaWRKY96 and VrWRKY68 are abnormal. We speculate that the original LaWRKY96 and VrWRKY68 contain two WRKY domains because there is a short sequence inserted into the second WRKY domain in LaW-RKY96 and VrWRKY68 (Fig. 6). The phylogenetic tree showed that the first two WRKY domains of LaWRKY64 clustered together in In, and the third WRKY domain grouped in Ic (Fig. 4). Accordingly, we proposed that the second WRKY domain possibly originated from the first WRKY domain in LaWRKY64. To verify this hypothesis, we constructed a phylogenetic tree using group I WRKY proteins. The results showed that the sequences were highly homologous between the first and second WRKY domains (Additional file 5: Figure S4).

Evolutionary rate and gene feature in homologous genes
In this study, we identified 317 orthologous pairs and 67 paralogous pairs in 12 legumes. We excluded one orthologous pair and six paralogous pairs due to the abnormality of the evolutionary rates of these seven homologous pairs. The evolutionary rate showed that the average K a , K s , and K a /K s of the paralogs was 0.06, 0.19, and 0.34, respectively. The average K a , K s , and K a / K s of the orthologs was 0.08, 0.34, and 0.28, respectively. A comparison of paralogs and orthologs revealed that the average K a and K s (a proxy for evolutionary rate) of the orthologs was higher than that of the paralogs, but the K a /K s (a proxy for selective pressure) for the orthologs was lower than that of the paralogs (Fig. 7). These results indicated that orthologs have a faster evolutionary rate and are subject to constrained selective pressure, unlike paralogs.
The correlation analyses showed that K a was positively correlated with K s in paralogs and orthologs (paralog: r = 0.73, P < 0.01; ortholog: r = 0.58, P < 0.01). The K s was negatively correlated with the K a /K s in paralogs and orthologs (paralog: r = − 0.27, P < 0.05; ortholog: r = − 0.42, P < 0.01). The K a was not correlated with the K a /K s in orthologs (ortholog: r = 0.07, P > 0.05), but the K a was significantly positively correlated with the K a /K s in paralogs (paralog: r = 0.38, P < 0.01). These results indicated that K a influenced the K a /K s in paralogs, while the K s determined the K a /K s in paralogs and orthologs.
Comparison of gene features between paralogs and orthologs showed that the average Fop and GC content of the orthologs was not significantly lower than that of paralogs (Additional file 6: Table S2).

Correlation of evolutionary rate and gene feature between orthologs and paralogs
There are correlations between evolutionary rate and gene feature [59,65,66]; however, different correlations have been found in various organisms. In this study, our results showed that the K a was significantly negatively correlated with Fop, GC1, and GC2, and the K a /K s was significantly negatively correlated with Fop, polypeptide length, GC1, and GC2 in orthologs ( Table 2). In paralogs, the K a was significantly negatively correlated with Fop, GC2, and GC3, and the K a /K s was significantly negatively correlated with Fop, GC1, GC2, and GC3 (Table 2). These results indicated that both Fop and GC2 influenced in K a , and Fop, GC1, and GC2 affected K a /K s in paralogs and orthologs. Fig. 7 Comparison of the evolutionary rate between orthologs and paralogs. PAL2NAL was used to convert amino acid sequences into the corresponding nucleotide sequences. PAML 4.0 was used to calculate the nonsynonymous/synonymous substitution (K a /K s ) rate. The figure was constructed using the ggpubr package in R script

Comparison of gene features between duplicated and singleton WRKY genes
Here, we addressed the gene features between duplicated WRKY genes and singleton WRKY genes in legumes. We found that the Fop (a proxy for codon usage bias) of duplicated WRKY genes was slightly higher than that of singleton WRKY genes ( Table 3). The polypeptide length of duplicated WRKY genes was longer than that of singleton WRKY genes ( Table 3). The GC content at the three codon positions of duplicated WRKY genes was higher than that of singleton WRKY genes (Table 3), but this is not statistically significant in GC1 content (Table 3). These results indicated that different gene features were observed between duplicated WRKY genes and singleton WRKY genes.

Glycine max WRKY paralogs
Compared with WRKY genes in other legumes, more duplicated GmWRKY (41 gene pairs) and LaWRKY (19 gene pairs) genes remained during the evolutionary process. Therefore, we investigated the gene features, evolutionary rates, and gene expression patterns in duplicated WRKY genes. To that end, we used duplicated GmWRKY because multiple tissue transcriptome datasets have been published for G. max, and it has the greatest number of duplicated WRKY genes. Our results showed that there is a slight difference in Fop (0-0.073), polypeptide length (0-14), GC1 (0-2.466), GC2 (0-3.087), and GC3 (0.015-4.664) among duplicated GmWRKY genes (Additional file 7: Table S3). However, the gene expression levels of 14 different tissues were observed to be largely different in duplicated GmWRKY genes. The differential gene expression of greater than 50% of the duplicated WRKY genes was observed as up to a two-fold difference, except for duplicated GmWRKY genes in seed tissue (Additional file 7: Table S3). These results indicated that the duplicated WRKY genes with similar gene features have gene expression divergence.
Negative correlations were found between the K a /K s and the gene expression level in 14 different tissues, but no statistical significance was found in young leaves, flowers, pod shells, and nodule tissues ( Table 4). The K a was significantly negatively correlated with the gene expression level in seeds 14 DAF and root tissues (Table 4). However, the K s had a coefficient of irregularity with the gene expression level, but it was not statistically significant ( Table 4). The correlation between gene features and gene expression level revealed that gene expression level was positively correlated with Fop, GC2, and GC3, but was not statistically significantly different in some tissues (Table 4). For example, the gene expression levels of leaves and pods 14 DAF were positively correlated with Fop, but were not statistically significantly different. There is correlation but no significance between the gene expression level of pods and GC2 (Table 4). In addition, the gene expression level was significantly positively correlated with GC3 in seeds 14 DAF, roots, and nodule tissues ( Table 4). The correlations among gene expression level, polypeptide length, and GC1 were irregular and not statistically significant (Table 4).

WRKY number, structure, and evolution in legumes
To date, genome-wide sequences obtained via different pipelines have been published for 12 legume species [35][36][37][38][39][40][41][42][43][44][45]. Accordingly, the sequencing depth of these legumes varies,      making it difficult to compare the number of WRKY genes. However, in this study, the largest number of GmWRKY genes was detected for G. max compared with the 11 other legume species. This is because G. max is an autotetraploid [34], and more than three whole-genome duplication (WGD) events have been identified for G. max [67]. Our previous study revealed that the number of WRKY genes was not correlated with genome size but positively correlated with the number of WGD events [18]. This is also consistent with our other study, in which we found that the number of GmLOX genes is greater than in other legumes [68]. Unexpectedly, L. angustifolius, a diploid species, has more WRKY genes than other diploid legumes in this study. One reasonable explanation is the retention rate of duplicated WRKY genes is higher than in other legumes. Normally, WRKY proteins contain one or two WRKY domains [1,2]. However, Mohanta et al. [69] found that WRKY proteins contained three WRKY domains in Gossypium raimondii and Linum usitatissimum and four WRKY domains in Aquilegia coerulea and Setaria italic. We speculate that WRKY proteins containing three or four WRKY domains might have recently evolved because these types of WRKY proteins have only been observed in higher plant species, not in lower plant species [69]. We also identified a WRKY protein containing three domains in L. angustifolius. Our results indicated that the middle WRKY domain possibly produced the N-terminal WRKY domain, indicating that WRKY domains can be replicated using gene duplication.
Genomic rearrangement plays a pivotal force in forming WRKY-NBS fusion proteins [4], and NBS-LRR is a resistance gene involved in response to pathogens [64]. WRKY genes located in central positions mediating fast and efficient activation of defense programs [21]. Accordingly, we speculated that the fusion of WRKY proteins and NBS-LRR proteins can increase disease resistance in plants. The effector PopP2 and AvrRps4 can bind to the WRKY domain of the Arabidopsis WRKY-NBS fusion protein, which activates another NBS-LRR protein involved in response to bacterial pathogens in Nicotiana benthamiana and Nicotiana tabacum [70].
It is hard to explain the origin of the WRKY gene family using only one hypothesis, such as the "Group I Hypothesis" and "IIa + b Separate Hypothesis" [4]. This is because an increasing number of studies has revealed that the WRKY gene family has multiple origins. The "Group I Hypothesis" proposed that all WRKY proteins evolved from C-terminal WRKY domains of group I proteins, whereas the "IIa + b Separate Hypothesis" stated that groups IIa and IIb evolved directly from a single domain algal gene separated from the group I-derived lineage [4]. Our results are consistence with these two hypotheses. Based on our phylogenetic analyses, we found that WRKY genes from different groups clustered into a clade. Furthermore, our study fills a gap in the knowledge on the evolution of WRKY genes by comparing gene features and evolutionary rates between WRKY orthologs and paralogs. We concluded that (1) K a influenced K a /K s in paralogs, while the K s determined the K a /K s in paralogs and orthologs; (2) orthologs have a faster evolutionary rate and are subject to constrained selective pressure, unlike paralogs; and (3) both Fop and GC2 influenced K a , and Fop, GC1, and GC2 affected K a /K s in paralogs and orthologs. However, more studies are required on legume WRKY genes because our current analysis does not clarify the evolutionary relationship among each group of WRKY genes. Therefore, further studies should not only focus on producing phylogenetic trees, but also work towards identifying novel WRKY genes, with a particular focus on WRKY proteins containing more than two domains or WRKY fusion proteins.

Gene expression in duplicated GmWRKY genes
The copy of duplicated genes will often be lost after WGD or small-scale duplication (SSD) [67]. However, many copies will be retained in the genome because they have novel molecular, structural, or adaptive traits functions [71]. Some researchers have proposed that duplicated genes with the same biological function will be lost due to fitness cost [72,73]. Others researchers hold that duplication allows further adaptive changes to accumulate [71,74,75]. In addition to these contrasting proposals, four mechanisms can explain the retention of duplicates: gene dosage increase [72]; duplication, degeneration, and complementation (DDC) [76]; gene balance [77]; and paralog interference [74]. In this study, we found that the gene features of duplicated gene were similar, but the gene expression patterns of duplicated genes were different in 14 different tissues. The duplicated GmWRKY genes might have been retained because copies had the asymmetric expression pattern when following the explanation of the DDC mechanism. The DDC model proposed that the mutations that cause subfunctionalization are explicitly neutral [67,76].
In this study, we found a negative correlation between K a / K s and gene expression level in 14 different tissues from G. max. This result is consistence with the expression-rate of sequence evolution anticorrelation model (E-R anticorrelation) [78]. The model proposed that the most highly expressed genes are also subject to the strongest selective constraint [79]. This can be explained by the expression cost hypothesis, the protein misfolding avoidance hypothesis, the protein misinteraction avoidance hypothesis, and the mRNA folding requirement hypothesis [80]. In addition, our results showed that Fop was positively correlated with gene expression levels, indicating that highly expressed genes have high codon usage bias. This is supported by natural selection in highly expressed genes preferentially using optimal codons [81]. Accordingly, we propose that natural selection plays a crucial role in codon usage bias of GmWRKY genes.